How to use an Unpivoted Correlation Matrix: Part One

Introduction

This is Part One of a Two Part article on the use and analysis of a correlation matrix in data analysis. In this first  part, I explain what the data set I have used contains and where I got the data from. I also discuss the setting up of a correlation matrix using the Data Analysis ToolPak in Excel and finally, I use the correlation matrix I created to further create a Query table from it and an unpivoted correlation matrix table from there.

In the Beginning ...

One of the first things I do when I analyse a data set is to create a table of descriptive statistics and a correlation matrix.

Descriptive Statistics

The descriptive statistics tell me the means, medians, modes and standard deviations, maxima and minima. They also show the kurtosis and skewness values. We need to know some or all of them to describe the data we are working with.

Correlation Matrix

The correlation matrix shows me associations and, I've got to say it, not causation. Correlation is not a measure of cause: it's a measure of statistical association. If we have, say 5 or 10 variables in our data set, a first blush correlation analysis can show us what is associated with what and, therefore, it can help to prevent us creating an over fitted model by leaving out variables that are highly correlated. Correlation coefficients can also cause some surprise when we find variable X is highly positively correlated with Y or highly negatively correlated with Y: maybe that is something we had not considered before. Alternatively, something that we felt certainly was highly corelated with another variable turns out to have little or no association at all.

Why is that so good? I find that correlation coefficients and matrices tell us one or more stories: how about the association between infant mortality and the number of physicians in a society: should be obvious, yes? How about the incidence of smoking and lung cancer: again, probably a clear association. Then again, suppose that we expected these two levels of association yet we don't find them: then what? Well, we have some explaining to do!

Data Analysis

In this post, I want to explore two things by using a Covid-19 data set:

  • how we create a correlation matrix in Excel for a data set that contains 47 variables and how we create an unpivoted correlation matrix
  • the insights that a correlation matrix, unpivoted or otherwise can provide

Creating a Correlation Matrix


The data set I have use comes from the highly regarded web site, Our World in Data (OWID), notably their covid-19 section. I downloaded and have used their entire covid-19 database as at early in January 2021 when there were 1.24 million cells of data in it at that time.

My first approach was to convert the data in the OWID *.csv file into an Excel Table and create a Query from it. That action allowed me to clean anything that might have been dirty: not entirely in line with the data types I was expecting, values shown as text, dates badly formatted and so on. By the way, this is no slur on OWID as I know that my computer can mangle data that comes in via a .csv file, especially dates. The Query table looks almost exactly the same as the Excel Table, reflecting the very few changes I made to the data.

I then created my correlation matrix using the Data Analysis ToolPak Correlation Function; follow the illustration below:


Choose the entire Query table headings and all rows of data but excluding country codes and names and the dates columns. Your correlation matrix will be large and it will look, in part, like this:


Please notice that I have used Conditional Formatting to highlight what I consider to be the most important correlation coefficients in that table. That is optional but I always do it and it does help to focus attention on any areas of interest!

Unpivoted Correlation Matrix

Because the correlation matrix is so large and unwieldy, it can be difficult to read and review. That is true even when I use the conditional formatting approach I just mentioned. I believe that unpivoting the correlation matrix is an excellent approach and this is what I did..

  • Convert the correlation matrix into an Excel Table in the normal way
  • Now, create a Query from that table ... Data ... From Table/Range
  • In the Query Editor, highlight the first column, Headings in my case; right click that column heading and click Unpivot Other Columns

Within seconds, your unpivoted correlation matrix will appear and it will look like this:


The Headings column remains the same but there are just two columns left now: Attributes and Value and they are the normal column headings for the two columns that unpivoting always leaves. I suggest you change Attributes to Association and Value to Correlation Coefficient.

Notice, 
  • I have deselected all of the rows that have a correlation coefficient of 1, since they come from the unit matrix and add no value to the analysis. I checked that there were only 47 number 1s in that list, that is the number of 1s in the original matrix ... of course, it is possible for a true association to have a coefficient of 1, hence the need to check for that
  • I sorted the coefficients from largest to smallest ... or smallest to largest if you prefer..
Here is an extract from my final, Unpivoted Correlation Matrix Query and here you can see the largest and smallest 10 coefficients of correlation and the 10 results either side of coefficients of correlation of zero:
:

In Part Two of this article, I will discuss what I consider are some of the more important or at least most interesting findings from this analysis. But in the meantime, why not download the latest version of these data from OWID and replicate what I have done so far.


Duncan Williamson
5th January 2021