Introduction
This is Part One of a Two Part article on the use and analysis of a correlation matrix in data analysis. In this first part, I explain what the data set I have used contains and where I got the data from. I also discuss the setting up of a correlation matrix using the Data Analysis ToolPak in Excel and finally, I use the correlation matrix I created to further create a Query table from it and an unpivoted correlation matrix table from there.
In the Beginning ...
One of the first things I do when I analyse a data set is to create a table of descriptive statistics and a correlation matrix.
Descriptive Statistics
The descriptive statistics tell me the means, medians, modes and standard deviations, maxima and minima. They also show the kurtosis and skewness values. We need to know some or all of them to describe the data we are working with.
Correlation Matrix
The correlation matrix shows me associations and, I've got to say it, not causation. Correlation is not a measure of cause: it's a measure of statistical association. If we have, say 5 or 10 variables in our data set, a first blush correlation analysis can show us what is associated with what and, therefore, it can help to prevent us creating an over fitted model by leaving out variables that are highly correlated. Correlation coefficients can also cause some surprise when we find variable X is highly positively correlated with Y or highly negatively correlated with Y: maybe that is something we had not considered before. Alternatively, something that we felt certainly was highly corelated with another variable turns out to have little or no association at all.
Why is that so good? I find that correlation coefficients and matrices tell us one or more stories: how about the association between infant mortality and the number of physicians in a society: should be obvious, yes? How about the incidence of smoking and lung cancer: again, probably a clear association. Then again, suppose that we expected these two levels of association yet we don't find them: then what? Well, we have some explaining to do!
Data Analysis
In this post, I want to explore two things by using a Covid-19 data set:
- how we create a correlation matrix in Excel for a data set that contains 47 variables and how we create an unpivoted correlation matrix
- the insights that a correlation matrix, unpivoted or otherwise can provide
Creating a Correlation Matrix
The data set I have use comes from the highly regarded web site, Our World in Data (OWID), notably their covid-19 section. I downloaded and have used their entire covid-19 database as at early in January 2021 when there were 1.24 million cells of data in it at that time.
My first approach was to convert the data in the OWID *.csv file into an Excel Table and create a Query from it. That action allowed me to clean anything that might have been dirty: not entirely in line with the data types I was expecting, values shown as text, dates badly formatted and so on. By the way, this is no slur on OWID as I know that my computer can mangle data that comes in via a .csv file, especially dates. The Query table looks almost exactly the same as the Excel Table, reflecting the very few changes I made to the data.
I then created my correlation matrix using the Data Analysis ToolPak Correlation Function; follow the illustration below:
Unpivoted Correlation Matrix
- Convert the correlation matrix into an Excel Table in the normal way
- Now, create a Query from that table ... Data ... From Table/Range
- In the Query Editor, highlight the first column, Headings in my case; right click that column heading and click Unpivot Other Columns
- I have deselected all of the rows that have a correlation coefficient of 1, since they come from the unit matrix and add no value to the analysis. I checked that there were only 47 number 1s in that list, that is the number of 1s in the original matrix ... of course, it is possible for a true association to have a coefficient of 1, hence the need to check for that
- I sorted the coefficients from largest to smallest ... or smallest to largest if you prefer..