Deaths by risk factor, 2023
- Reference: Our World in Data: OWID
- Deaths by risk factor, 2023
- Source of data: IHME, Global Burden of Disease (2025)1
The Excel
file https://docs.google.com/spreadsheets/d/1f2iCsi7UXW661cDRgzd5bMrdmhoMQ_2S/edit?usp=sharing&rtpof=true&sd=true accompanies this file and it contains the
data from OWID, Our World in Data as well as my initial correlation matrix.
Introduction
This page relates to creating a correlation matrix using the Data Analysis ToolPak in Excel or by using functions and formulas and then unpivoting the matrix using Power Query for greater ease of reading and working with the correlation coefficients.
created
the correlation matrix by using the Data Analysis ToolPak in Excel and it
created the matrix that begins like this:
|
Correlation |
High blood pressure |
Diet high in sodium |
Diet low in whole grains |
High alcohol use |
Diet low in fruits |
Unsafe water source |
|
High blood
pressure |
1 |
|
|
|
|
|
|
Diet high in
sodium |
0.9689 |
1 |
|
|
|
|
|
Diet low in
whole grains |
0.9950 |
0.9666 |
1 |
|
|
|
|
High alcohol
use |
0.9738 |
0.9187 |
0.9680 |
1 |
|
|
|
Diet low in
fruits |
0.9652 |
0.9183 |
0.9665 |
0.9106 |
1 |
|
|
Unsafe water
source |
0.7524 |
0.6415 |
0.7309 |
0.6998 |
0.8635 |
1 |
The matrix
above is the top left corner of the 26 cell * 26 cell matrix that the ToolPak
has created for this data set: this is the normal style of correlation matrix
we expect to see and it comprehensively gives us significant insights into the
data and any possible relationships between the variables.
The main
problem, however, with such a matrix is that at 26 cells * 26 cells, is that trying
to appreciate and summarise it is an overwhelming experience. For that reason,
in such a case, I convert these matrices into an Excel Table and use Power
Query to unpivot it: the following image is of the 277 row * 3 column unpivoted
correlation matrix in which the correlation column has been sorted, largest to
smallest:
For reasons
that are explained in the sections that follow, in a case like this one, we express
each risk as a percentage of total deaths in that country before correlating,
to give us the following, revised, 301 row * 3 column unpivoted correlation
matrix.
1 Overview
This note
analyses the Our World in Data (OWID) dataset "Deaths by risk factor,
2023," which records the number of deaths attributable to each of up to 25
risk factors (eg high blood pressure, unsafe water, smoking) for ~210 countries
and regions in 2023.
The goal: understand which risk factors tend
to co-occur across countries — ie which risks tend to kill people in the same
places. The tool used is the Pearson correlation coefficient (r), computed
pairwise across countries for each pair of risk factors. r ranges from −1
(perfect inverse relationship) through 0 (no linear relationship) to +1
(perfect positive relationship).
Two versions
of the analysis are presented:
- · Version A — raw death counts. The original correlation matrix supplied in the workbook.
- · Version B — share of deaths. A normalised version computed in this workbook, expressing each risk as a percentage of total deaths in that country before correlating.
The two
versions tell very different stories
- ·
Version
A is dominated by country size
- ·
Version
B reveals the underlying epidemiology.
2 Version A: Correlations on raw death counts
Setup
24 risk
factors → 24 × 23/2 = 276 unique pairs. The correlation column ranks pairs from
highest to lowest.
Headline result
Every single
one of the 276 correlations is strongly positive, ranging from 0.999 down to
0.72. There are no negative or near zero values.
Why: the country size effect
Large population
countries (India, China, Nigeria, the United States) record more deaths from
every cause than small population countries do. As a result, any two raw death count
columns correlate at roughly 0.9 or higher simply because both scale with
population. The correlations on raw counts are largely a restatement of
"big countries have lots of deaths" — close to tautological and not
informative about which risks actually co-occur.
What can still be read from Version A
Even with the
size effect, pairs sitting noticeably above the ~0.95 baseline reflect
recognisable epidemiological clusters and pairs sitting below it flag genuine
divergence between countries.
Top of the list: the noncommunicable disease (NCD) cluster (r > 0.998):
·
High cholesterol ↔ High blood pressure: 0.9991
·
Low physical activity ↔ High blood sugar: 0.9988
·
High cholesterol ↔ Low physical activity: 0.9983
·
Outdoor PM pollution ↔ Second-hand smoke: 0.9983
These are the
cardiometabolic risks that travel together: the same countries record large
numbers of deaths from each of them.
Bottom of the list: child stunting decouples (r ≈ 0.72–0.81):
Child
stunting's weakest correlations are with smoking, high sodium diet, second hand
smoke and outdoor PM pollution. Stunting is concentrated in low income
countries, while smoking, sodium and PM deaths are concentrated in middle and
high income countries. The same low income pattern appears for unsafe sex and
lack of handwashing access.
The clearest two worlds signal
·
Lowest pair: Child stunting ↔ Diet high in
sodium (cell C277 of the sorted matrix).
·
Highest pair: High cholesterol ↔ High blood
pressure (cell C2).
The gap
between these two values is a direct measure of the epidemiological transition:
countries are either still primarily fighting undernutrition and poor
sanitation or they have moved on to cardiovascular and metabolic disease.
Other tightly coupled sub clusters
·
WASH and child wasting (water, sanitation,
handwashing, child wasting): all pairs above 0.95.
·
Air pollution (indoor + outdoor + second hand
smoke): internally 0.96–0.998.
·
Diet risks (low fruit, low vegetables, low whole
grains, low nuts, high sodium): mostly above 0.97.
Conclusion on Version A
The matrix as
supplied measures country size more than it measures epidemiology. The ranking
still tells a coherent story — NCDs cluster at the top, child stunting and
unsafe sex sit alone at the bottom — but to draw genuine conclusions about
which risks co occur, the analysis must be rerun on a population normalised
metric.
3 Version B: Correlations on share of deaths
Method
The OWID
workbook does not contain a population column, so death rates per 100,000
cannot be computed without external data. Share of deaths can be computed from
the existing data alone:
·
share_of_deaths(country, risk) = deaths(country, risk)/total_deaths_across_all_risks(country)
This removes
the country size effect because every country's shares sum to 1, irrespective
of population. The recomputed correlation matrix lives on the new correl_share
tab. It covers 25 risks (Low birthweight, omitted from the original matrix, is
included here) and 203 countries (regional aggregates such as Africa, Asia,
Europe, Oceania, South America, and World are filtered out). 25 × 24/2 = 300
unique pairs.
Headline result
With country
size removed, correlations now span −0.75 to +0.99 and the structure is
interpretable. Risks split into distinct epidemiological clusters that are
negatively correlated with each other.
Strongest positive correlations — risks that share countries
The top of
the list is the low-income/WASH (water, sanitation, hygiene) cluster. Countries
where one of these is a major killer tend to have all of them:
·
Unsafe water ↔ Unsafe sanitation: 0.994
·
Unsafe sanitation ↔ No handwashing: 0.946
·
Child wasting ↔ Child stunting: 0.937
·
Unsafe water ↔ Child wasting: 0.925
Strongest negative correlations — the epidemiological transition
These pairs
are inversely correlated across countries — and this is the headline finding of
the share of deaths analysis:
·
No handwashing ↔ High cholesterol: −0.75
·
No handwashing ↔ High blood pressure: −0.74
·
Low birthweight ↔ High blood pressure: −0.74
·
Obesity ↔ Indoor air pollution: −0.73
·
Obesity ↔ Air pollution (combined): −0.72
Plain
reading: countries
where a large share of deaths come from high cholesterol or hypertension have a
small share from handwashing and sanitation related deaths and vice versa. This
is the same two worlds pattern that was visible only in the ranking
under Version A; here it appears directly in the sign of the correlation.
Three clusters emerge
·
NCD/cardiometabolic: high blood pressure, high
cholesterol, high blood sugar, obesity, low physical activity, smoking, diet
risks (sodium, low fruit, low vegetables, low whole grains).
·
WASH/child
health: unsafe water,
unsafe sanitation, no handwashing, child wasting, child stunting, low
birthweight, indoor air pollution, unsafe sex.
·
Pollution
sits in the middle:
outdoor PM correlates with both clusters, while indoor air pollution sits
firmly with WASH.
4 Caveat: what share of deaths does and does not measure
Share of
deaths measures the composition of mortality in a country, not absolute risk. A
country can have a high share of deaths from hypertension simply because it has
few deaths from anything else (long life expectancy, low child mortality).
Per 100,000
rates would tell us which risks are more dangerous in absolute terms; share
tells us which risks dominate the cause of death mix. For a two worlds/epidemiological
transition question, share is the appropriate metric. For an absolute burden
question, per 100,000 rates would be required and would need population data
merged in from an external source.
5 Note on input table edits
The script
underlying Version B already excludes regional aggregates (Africa, Asia,
Europe, Oceania, South America, World) before computing correlations. Removing
the Africa row from the deaths_risk input therefore does not change any
of the results above — Africa was being filtered out at the country list stage
in any case. The 203 country sample and all correlations on correl_share
remain as reported.
Duncan
Williamson
9th
May 2026
Source: 1 IHME, Global Burden of Disease (2025) – with major processing by Our World in Data.
“High blood pressure – IHME” [dataset]. IHME, Global Burden of Disease, “Global Burden of Disease:
Risk Factors - Deaths” [original data].

No comments:
Post a Comment