Duncan's Diurnal Diatribe: Deaths by Risk Factor

Deaths by Risk Factor

Deaths by risk factor, 2023

Reference: Our World in Data: OWID
Deaths by risk factor, 2023
Source of data: IHME, Global Burden of Disease (2025)¹

The Excel file https://docs.google.com/spreadsheets/d/1f2iCsi7UXW661cDRgzd5bMrdmhoMQ_2S/edit?usp=sharing&rtpof=true&sd=true accompanies this file and it contains the data from OWID, Our World in Data as well as my initial correlation matrix.

Introduction

This page relates to creating a correlation matrix using the Data Analysis ToolPak in Excel or by using functions and formulas and then unpivoting the matrix using Power Query for greater ease of reading and working with the correlation coefficients.

created the correlation matrix by using the Data Analysis ToolPak in Excel and it created the matrix that begins like this:

Correlation	High blood pressure	Diet high in sodium	Diet low in whole grains	High alcohol use	Diet low in fruits	Unsafe water source
High blood pressure	1
Diet high in sodium	0.9689	1
Diet low in whole grains	0.9950	0.9666	1
High alcohol use	0.9738	0.9187	0.9680	1
Diet low in fruits	0.9652	0.9183	0.9665	0.9106	1
Unsafe water source	0.7524	0.6415	0.7309	0.6998	0.8635	1

The matrix above is the top left corner of the 26 cell * 26 cell matrix that the ToolPak has created for this data set: this is the normal style of correlation matrix we expect to see and it comprehensively gives us significant insights into the data and any possible relationships between the variables.

The main problem, however, with such a matrix is that at 26 cells * 26 cells, is that trying to appreciate and summarise it is an overwhelming experience. For that reason, in such a case, I convert these matrices into an Excel Table and use Power Query to unpivot it: the following image is of the 277 row * 3 column unpivoted correlation matrix in which the correlation column has been sorted, largest to smallest:

For reasons that are explained in the sections that follow, in a case like this one, we express each risk as a percentage of total deaths in that country before correlating, to give us the following, revised, 301 row * 3 column unpivoted correlation matrix.

1 Overview

This note analyses the Our World in Data (OWID) dataset "Deaths by risk factor, 2023," which records the number of deaths attributable to each of up to 25 risk factors (eg high blood pressure, unsafe water, smoking) for ~210 countries and regions in 2023.

The goal: understand which risk factors tend to co-occur across countries — ie which risks tend to kill people in the same places. The tool used is the Pearson correlation coefficient (r), computed pairwise across countries for each pair of risk factors. r ranges from −1 (perfect inverse relationship) through 0 (no linear relationship) to +1 (perfect positive relationship).

Two versions of the analysis are presented:

· Version A — raw death counts. The original correlation matrix supplied in the workbook.
· Version B — share of deaths. A normalised version computed in this workbook, expressing each risk as a percentage of total deaths in that country before correlating.

The two versions tell very different stories

· Version A is dominated by country size
· Version B reveals the underlying epidemiology.

2 Version A: Correlations on raw death counts

Setup

24 risk factors → 24 × 23/2 = 276 unique pairs. The correlation column ranks pairs from highest to lowest.

Headline result

Every single one of the 276 correlations is strongly positive, ranging from 0.999 down to 0.72. There are no negative or near zero values.

Why: the country size effect

Large population countries (India, China, Nigeria, the United States) record more deaths from every cause than small population countries do. As a result, any two raw death count columns correlate at roughly 0.9 or higher simply because both scale with population. The correlations on raw counts are largely a restatement of "big countries have lots of deaths" — close to tautological and not informative about which risks actually co-occur.

What can still be read from Version A

Even with the size effect, pairs sitting noticeably above the ~0.95 baseline reflect recognisable epidemiological clusters and pairs sitting below it flag genuine divergence between countries.

Top of the list: the noncommunicable disease (NCD) cluster (r > 0.998):

· High cholesterol ↔ High blood pressure: 0.9991

· Low physical activity ↔ High blood sugar: 0.9988

· High cholesterol ↔ Low physical activity: 0.9983

· Outdoor PM pollution ↔ Second-hand smoke: 0.9983

These are the cardiometabolic risks that travel together: the same countries record large numbers of deaths from each of them.

Bottom of the list: child stunting decouples (r ≈ 0.72–0.81):

Child stunting's weakest correlations are with smoking, high sodium diet, second hand smoke and outdoor PM pollution. Stunting is concentrated in low income countries, while smoking, sodium and PM deaths are concentrated in middle and high income countries. The same low income pattern appears for unsafe sex and lack of handwashing access.

**The clearest two worlds signal**

· Lowest pair: Child stunting ↔ Diet high in sodium (cell C277 of the sorted matrix).

· Highest pair: High cholesterol ↔ High blood pressure (cell C2).

The gap between these two values is a direct measure of the epidemiological transition: countries are either still primarily fighting undernutrition and poor sanitation or they have moved on to cardiovascular and metabolic disease.

Other tightly coupled sub clusters

· WASH and child wasting (water, sanitation, handwashing, child wasting): all pairs above 0.95.

· Air pollution (indoor + outdoor + second hand smoke): internally 0.96–0.998.

· Diet risks (low fruit, low vegetables, low whole grains, low nuts, high sodium): mostly above 0.97.

Conclusion on Version A

The matrix as supplied measures country size more than it measures epidemiology. The ranking still tells a coherent story — NCDs cluster at the top, child stunting and unsafe sex sit alone at the bottom — but to draw genuine conclusions about which risks co occur, the analysis must be rerun on a population normalised metric.

3 Version B: Correlations on share of deaths

Method

The OWID workbook does not contain a population column, so death rates per 100,000 cannot be computed without external data. Share of deaths can be computed from the existing data alone:

· share_of_deaths(country, risk) = deaths(country, risk)/total_deaths_across_all_risks(country)

This removes the country size effect because every country's shares sum to 1, irrespective of population. The recomputed correlation matrix lives on the new correl_share tab. It covers 25 risks (Low birthweight, omitted from the original matrix, is included here) and 203 countries (regional aggregates such as Africa, Asia, Europe, Oceania, South America, and World are filtered out). 25 × 24/2 = 300 unique pairs.

Headline result

With country size removed, correlations now span −0.75 to +0.99 and the structure is interpretable. Risks split into distinct epidemiological clusters that are negatively correlated with each other.

Strongest positive correlations — risks that share countries

The top of the list is the low-income/WASH (water, sanitation, hygiene) cluster. Countries where one of these is a major killer tend to have all of them:

· Unsafe water ↔ Unsafe sanitation: 0.994

· Unsafe sanitation ↔ No handwashing: 0.946

· Child wasting ↔ Child stunting: 0.937

· Unsafe water ↔ Child wasting: 0.925

Strongest negative correlations — the epidemiological transition

These pairs are inversely correlated across countries — and this is the headline finding of the share of deaths analysis:

· No handwashing ↔ High cholesterol: −0.75

· No handwashing ↔ High blood pressure: −0.74

· Low birthweight ↔ High blood pressure: −0.74

· Obesity ↔ Indoor air pollution: −0.73

· Obesity ↔ Air pollution (combined): −0.72

Plain reading: countries where a large share of deaths come from high cholesterol or hypertension have a small share from handwashing and sanitation related deaths and vice versa. This is the same two worlds pattern that was visible only in the ranking under Version A; here it appears directly in the sign of the correlation.

Three clusters emerge

· NCD/cardiometabolic: high blood pressure, high cholesterol, high blood sugar, obesity, low physical activity, smoking, diet risks (sodium, low fruit, low vegetables, low whole grains).

· WASH/child health: unsafe water, unsafe sanitation, no handwashing, child wasting, child stunting, low birthweight, indoor air pollution, unsafe sex.

· Pollution sits in the middle: outdoor PM correlates with both clusters, while indoor air pollution sits firmly with WASH.

4 Caveat: what share of deaths does and does not measure

Share of deaths measures the composition of mortality in a country, not absolute risk. A country can have a high share of deaths from hypertension simply because it has few deaths from anything else (long life expectancy, low child mortality).

Per 100,000 rates would tell us which risks are more dangerous in absolute terms; share tells us which risks dominate the cause of death mix. For a two worlds/epidemiological transition question, share is the appropriate metric. For an absolute burden question, per 100,000 rates would be required and would need population data merged in from an external source.

5 Note on input table edits

The script underlying Version B already excludes regional aggregates (Africa, Asia, Europe, Oceania, South America, World) before computing correlations. Removing the Africa row from the deaths_risk input therefore does not change any of the results above — Africa was being filtered out at the country list stage in any case. The 203 country sample and all correlations on correl_share remain as reported.

Duncan Williamson

9^th May 2026

Source: 1 IHME, Global Burden of Disease (2025) – with major processing by Our World in Data. 
“High blood pressure – IHME” [dataset]. IHME, Global Burden of Disease, “Global Burden of Disease: 
Risk Factors - Deaths” [original data].

Deaths by Risk Factor

Deaths by risk factor, 2023

Introduction

1 Overview

2 Version A: Correlations on raw death counts

Setup

Headline result

Why: the country size effect

What can still be read from Version A

Top of the list: the noncommunicable disease (NCD) cluster (r > 0.998):

Bottom of the list: child stunting decouples (r ≈ 0.72–0.81):

The clearest two worlds signal

Other tightly coupled sub clusters

Conclusion on Version A

3 Version B: Correlations on share of deaths

Method

Headline result

Strongest positive correlations — risks that share countries

Strongest negative correlations — the epidemiological transition

Three clusters emerge

4 Caveat: what share of deaths does and does not measure

5 Note on input table edits

No comments:

**The clearest two worlds signal**