Duncan's Diurnal Diatribe: How to use an Unpivoted Correlation Matrix: Part Two

How to use an Unpivoted Correlation Matrix: Part Two

Introduction

In part one of this two part series, I told you about how and why I use correlation matrices as part of any data analysis work that I do. I illustrated a large correlation matrix that I built from the then latest update of the entire covid-19 data set from the brilliant Our Word in Data (OWID) web site: this page. Although I did mention that I use conditional formatting to help with any analysis of such a matrix, I then went on the say why and how I created an unpivoted version of the correlation matrix: this turned a 47 * 47 matrix into a 3 * 1082 table.

In summary, reading and analysing a 47 * 47 matrix can be much more difficult that reading and analysing a 3 * 1082 table and, finally in this introduction, I illustrated some of the output of my unpivoted matrix by having sorted the correlation coefficients from largest to smallest and showing the top ten, middle ten and bottom ten results.

Let's be clear, even a 3 * 1082 table is difficult to analyse so I am going to begin by discussing the top ten, largest, coefficients of correlation, the middle ten and the smallest ten. Then I will do a more or less random sweep of other results.

Top Ten Results

Here are the top ten results, then:

Eight of the first ten results show very high degrees of association between the new and total numbers of cases and deaths: I am sure this will not come as a surprise to anyone. The 7th largest value is the association between the numbers of people older than 70 and those older than 65: significantly high correlation between them. The 10th largest value shows a high degree of correlation between the gross number of hospital papers and the number of ICU patients.

Is there analytical value in those results? Maybe not as they mainly confirm what we might expect in eight out of ten results while the other two results probably do not tell us a great deal.

Middle Ten Results

What about the ten results either side of 0?

Notice, first of all, that there is zero association between the number of female smokers and total number of vaccinations. Similarly, there is very little association between the reproduction rate of the virus and the total number of cases.

In fact, since we are reviewing the results at and either side of zero, all we can see are cases where there are no associations. Hence, our question here ought to be: are any of these results surprising. That is, should there be an association between the number of female smokers and total vaccinations? If so, why? If not, why not?

Let me point out, for the third time, we are dealing with OWID's entire covid-19 data set, from Afghanistan to Zimbabwe. If you think something we have found so far does not seem to ring true, we can carry out regional analyses or country by country analyses.

An interesting result will probably turn out to be the very low degree of negative association between the number of vaccinations and the total weekly icu admissions per million as well as the negative association between the total vaccinations per hundred and the weekly icu admissions per million and the total vaccinations per hundred and weekly icu admissions.

We can see similar and positive associations in the above table, too of course: total vaccinations per hundred and new deaths as well as new deaths smoothed.

Finally, the reproduction rate shows virtually no association with total cases.

Bottom Ten Results

I suppose I shouldn't say bottom ten results since a large but negative correlation coefficient could be just as interesting and revealing as a positive correlation coefficient.

The results we see here have their largest (negative) value of just -0.3962, which is not so large for a significant correlation coefficient. Still, it is one of four measures in which extreme poverty is important and since comorbidity can be a major factor in the survivability from a covid-19 infection, it is important to note it here.

Old age and extreme poverty can be seen to be associated here at around -0.36 for both the over 70s and the over 65s and -0.39 for the median age.

The negative correlations between handwashing facilities and gdp per capita as well as hospital beds per thousand and extreme poverty could well have implications elsewhere in this data set.

Ease of Use

Let me pause here, in a sense, to remind you that this is not so much an essay about covid-19 and the OWID data set but a discussion about the ease of use of my unpivoted correlation matrix. Turning a matrix into a linear table really does make the results easier to interrogate and present here.

Female Smokers

We have already noted that the number of female smokers was completed uncorrelated to the total number of vaccinations. What about female smokers and any other results? Is there anything else we need to review as far as female smokers are concerned?

Yes, it is very easy in this format to search and sort the data set to find that there are 41 different results for female smokers, the most significant of which are:

The age of female smokers: the highest correlation coefficients for female smokers relates to their age! They are generally older women. In addition, the largest but negative correlation coefficient relates to extreme poverty: female smokers may not be so poor!

Extreme Poverty

We also focused on extreme poverty above, so let's interrogate the data set to see what we can find out about that.

All correlation coefficients for extreme poverty are negative and the above extract shows that most negative results of all 38 results.

We already know that age is important for any review of extreme poverty as is gdp per capita. We can also see that the association between extreme poverty and total deaths and cases per million is negative: the poorer you are, the less chance you have of contracting and dying from covid-19.

Searching for extreme poverty in the Association column, too, brings out some other interesting relationships:

Not surprisingly, life expectancy for those in extreme poverty as is the incidence of diabetes: again, concerning ourselves with comorbidity.

Random Review

Since this page is mainly concerned with the ease with which an unpivoted correlation matrix can be interrogated and analysed, I did say earlier that at this stage I would take a short and random walk through other parts of the table to see what we can find that might be of interest.

I am showing the coefficient of correlation as the value of r in this section, that is not to be confused with the reproduction rate of the virus.

gdp per capita and new tests smoother per thousand = r = 0.5026

human development index and aged 65 and older = r = 0.6608

human development index and aged 70 and older = r = 0.6046

icu patients and total deaths per million = r = 0.2187

icu patients per million and new deaths smoothed per million = r = 0.4715

male smokers and new deaths = r = 0.0422

new deaths and new cases smoothed = r = 0.9082

It takes just seconds to sort the Headings column alphabetically and review the different sections that are then clearly revealed. Following on from there, it is just a matter of scrolling down the Association and Correlation Coefficient columns to reveal something of importance.

Alternatively, as I did above, I can click on the down arrow of any column and type the search term I am looking for, to assist me in my review of the data, such as here, where I want to review anything to do with gdp:

I hope you find these two pages of use as I adopt the role of the unpivoted correlation matrix lieutenant! I urge you to try this technique, which is not too difficult to learn and to use. I also encourage you to take the data from the OWID web site and carry out your own world, regional and country by country analysis

Duncan Williamson

6th January 2021