Since the coronavirus pandemic started, I’ve wondered how accurately countries have been reporting data on their cases. I read as much as I could about it, but still never felt satisfied.
Then I was watching a Netflix show called Connected, and they started talking about a numerical phenomenon called Benford’s Law. Apparently governments use this law to determine the probability of tax fraud, rigged elections, and more.
The premise is that in any real-world dataset, the frequency of the numbers’ leading digits will follow a consistent distribution using a base-10 logarithm. About 30% of the time the first digit will be the number 1, 17% the number 2, and so on, as you can see from the chart below.
Every real-world dataset from stock prices to population sizes should follow this distribution closely.
So after I finished Connected, I decided to test Benford’s Law against COVID data. I grabbed the raw daily COVID cases & deaths dataset from WHO to run an analysis against.
Method:
1. Extract first digit from each country’s daily ‘New Cases’ and ‘New Deaths’ reporting.
2. Calculate the probability of occurrence of each first digit, 1 through 9, on the ‘New Cases’ and ‘New Deaths’ datasets, by country.
3. Run a correlation of each country’s reporting distributions against the Benford’s Law distribution.
4. Rank each country by the average of their ‘New Cases’ and ‘Daily Deaths’ correlations.
Result
The results were both reaffirming and surprising.
Over 70% of countries (154, specifically) daily COVID reports had above a 90% correlation with the Benford’s Law distribution. A handful were above 99%!
This leads me to believe that the true number of new cases and deaths across the globe follows Benford’s Law. It also leads me to believe that most countries are reporting pretty accurately.
In other cases where the correlations were weaker, there was an interesting pattern of either a country’s ‘New Cases’ or ‘Daily Deaths’ report being highly correlated, and the other being not so much so.
If you look at the sparkline distribution charts, you can see that the number 1 appears less frequently than other numbers on for a handful of countries, when it should at least be the most frequent.[1] Additionally, the rest of the sparkline distributions on the low correlations are bumpy, and in some cases have the higher first digit numbers more frequently than the lower ones.
I won’t speculate as to why one of the correlations is so weak compared to the other, but I am extremely curious to learn more.
I’ll let you take it from here to explore the data yourself.
Closing
Hopefully this Benford’s Law analysis was as interesting to you as it was for me. Since I only spent an hour or so putting together the report after I finished the Netflix show, some of the statistical methods and analysis could certainly be better.
Feel free to explore the analysis yourself, make a copy, and audit the formulas and methods. If you see a way to improve anything, message me on Twitter and I’ll update it here.
Spreadsheet: COVID Benford’s Law Correlation Analysis
[1] The left-hand side of the line is the frequency of number 1, whereas the right-hand side is the frequency of 9, with other numbers spread across the middle.