Excel File 4.7 million Rows Deep: Power Pivot Can Cope

Introduction

In the UK over the last couple of weeks, there has been a slur campaign against family doctors. A newspaper published an article that is supposed to show how lazy our doctors are. What they did was present a biased picture of how many consultations these doctors provide.

I downloaded the GP Appointments file and carried out my own analysis. I was a bit surprised to find that it all came to 4.7 million rows of data for a year and a half's data so initially I used Power BI for my analysis. I ran into a DAX problem that is now solved. I ran into another problem to do with April 202's data that I cannot fathom.

For that reason, I came back to Excel to do my work and the Date Table is now in the M Language, not DAX. DAX wasn't the problem, it was my use of it that was wrong and I would use it again but chose M in Excel.

I was really using Excel to see if I could replicate the error that Power BI was giving me and my good news is that I could not: that error has disappeared. My conclusion is

a I did something to the data to corrupt it

b Power BI just does not like April 2020's CSV file for a reason I cannot see as yet

I will keep looking at it to see what I can find.

Enough of that: why am I here today, with this?

The Story

A GP in the UK is a family doctor, the first medical port of call for treatment and/or referral to a hospital. There are many thousands of these doctors all over the UK and they provide a sterling service, as they work with nurses and other medical practitioners in their surgeries to provide a wide range of medical care.

The newspaper article I referred to is reported as saying that the number of Friday consultations is significantly less than the number of Monday consultations so Doctors are slackers as they stop providing services on Friday. The truth is two fold

a all doctors work the same on Monday, Tuesday, Wednesday and Thursday

b the demand for consultations in less on Friday ... that is, the public don't go to see their Doctors on Friday as much as to day on the other weekdays.

The newspaper was attacking the reputation of our front line medical services.

My Excel Work

4.7 Million Rows

Knowing that the total file would take up 4.7 million rows, how can we get all of that data into Excel? Firstly, the data came to me in CSV files and Power Pivot will not work directly with them. I couldn't just copy and paste all of the files 22 files and try to combine them because Excel is limited to 1,048,576 rows and the data comprises 4,724,442 rows.

Here is how I got the data into Power Pivot.

Create a Query in Power Query: from file, from folder ... CSV files only ... choose all 22 files to be combined

Transform the data to make sure you get exactly and only the files into the Query that you want: after all I had additional CSV files, an XLSX file and a PBIX file in that folder as well and I needed to exclude them

Having set up the Query I then did this:

Close and Load to ... Only Create Connection ... Add to the Data Model

That does it: all of the 4.7 million rows stay in memory and the Query become a model in Power Pivot.

Date Table

I had a problem with my Date Table, DAX based, as I mentioned above but with help from the Power BI community, I sorted that out.

In my Excel file I created a Date Table using the M Language, as described on this page.

To make the Date Table available to me for my analysis in a Pivot Table, I needed to set up a relationship between the Query and the Date Table. My Date Table looks like this:

I used several of the columns in that Table to help my analysis, as you will see.

Pivot Table

My main work was done in a Pivot Table (PT) and I will share all five of them here: I created the first PT and, having finished it, copied and pasted it to another part of the worksheet. I kept all of my PTs together on one worksheet. It should be clear what is in each PT and you can probably work out the connections I have used between the Query and the Date Table. For all but the first PT, I also used a Slicer for the Regions into which the UK National Health Service is divided. That Slicer is connected to all of the four PTs so that every time I choose a new region in the Slicer, it changes all four of the PTs together.:






Dashboard

It is really useful to be able visualise the contents of a PT and that is easy to do in Excel, albeit there is some tidying up to do between creating a Pivot Chart (PC) and presenting a more useful chart. I created the charts to be next to its own PT and the Slicer but once I was happy with everything I cut the Charts and the Slicer from that worksheet and pasted them onto another worksheet that I call my dashboard!


I never worry about pinpoint accuracy over placing my charts. I don't worry about making the charts look gorgeous ... doing that kind of thing has its place but it can be very time consuming. However, by including a Slicer that allows me to change, in this case, to any of the NHS regions adds a lot of interest to such a dashboard. Of course, I could add other slicers to control other variables as well as or instead of regions.

To that end, it took me one second, literally a second, to change my dashboard from showing All Regions, not shown here, to that one showing Barnsley, to the following one showing my home region of Calderdale:


Conclusions

This article has had the purpose of discussing how to get more than the allowed 1,048,576 rows' worth of data into Excel: in this case by using Power Query and Power Pivot combined. I also discussed setting up relationships between a Date Table, the Pivot Table and thereby allow me to carry out the kind of analysis I have.

By analysing the data, we can see that Friday appointments are down on the other weekdays and to help to focus on that aspect, I excluded Saturday and Sunday data from this discussion. However, it is not possible to prove why Friday shows fewer appointments than the other weekdays and for that, we have to ask the Doctors themselves: a fact checking service provided that information for us and report that it's because of supply and demand.

My Excel File: because I have used Power Query, Power Pivot AND Excel, it is not feasible to share my file here. However, if you write to me and ask for the files, I will happily share them with you.

Duncan Williamson
18th October 2021


No comments: