Thursday, April 30, 2020

A note on minimizing common problems reporting trends in Covid-19 statistics

Some of the most widely-reported statistics about Covid-19 are based on daily updates of total "new" cases in the previous 24-hours. These updates are used to track increases and decreases in everything from the number of tests to the number of deaths from the pandemic.

But these daily totals can be misleading. The day that a case is first reported may not be the day the case actually occurred. Cases are often reported days or weeks after their occurrence because of lags in the collection and distribution of Covid-19 data.

Reports that use the actual day a case occurred are a better way to measure trends. But  day of reports can also be misleading because of lags in reporting cases.

The problems created by lags are especially apparent in many popular dashboards that use graphics to illustrate Covid-19 trends.

The Ohio Department of Health is an exception. The department's dashboard features clear presentations of Covid-19 trends that include the limitations of the data.

The Ohio data is updated every day at 2 p.m. I used data from multiple updates to produce my own graphics and analysis for this post. I have no affiliation with the Ohio Department of Health.


I am using Ohio deaths as an example of general problems reporting Covid-19 tests, cases, hospitalizations and other measures. Covid-19 trends are often reported with bar charts, so I will do the same.

My first two charts (above) show total "new" deaths reported each day in orange. The blue chart shows days that each death actually occurred. Both charts illustrate the period from the first Ohio coronavirus death to April 21.

The orange chart shows that March 20 is the first day that "new" Ohio deaths were reported. But the blue chart shows the first death actually occurred on March 17.

You can also see the orange distribution of "new" deaths does not accurately depict the actual  distribution of deaths in blue.

The number of "new" deaths being reported appeared to be increasing on April 21. But the blue chart shows that deaths per day appeared to be decreasing.

These next charts (above) include data from eight more days, extending the analysis from April 21 to April 29. The orange chart includes a dramatic spike on April 29 because 138 "new" deaths were first reported on that day.

The spike is misleading. The blue chart includes these "new" deaths on days the deaths actually occurred. The blue chart again shows the actual number of deaths per day appear to be decreasing.

But the blue chart also uses lagged data, making it incomplete and possibly misleading. Recall the decline in my first blue chart ending on April 21. That decline  vanished after eight days of updates.

Five of the 138 "new" deaths occurred on unknown dates, so they are not reported in the blue chart. This is not unusual, dates are normally added in subsequent updates. But this is another example of how lags complicate efforts to identify Covid-19 trends.

This next graphic (above) compares both charts of "new" deaths reported each day. Ovals highlight April 14-21. The distribution of deaths did not change from the first to the second chart. This shows how daily totals can persistently misrepresent the distribution of deaths.

This graphic (above) compares both charts of the actual number of deaths each day. Ovals highlight April 14-21.

The distribution of deaths changed from the first to the second chart. The first chart incorrectly shows a decline in deaths. The second chart shows deaths were actually constant or increasing from April 14-21.

This illustrates how the accuracy of lagged data improves over time. As more deaths were reported the counts for the highlighted days were revised upward.

Lags are typically concentrated in the most recent days in any report. So the decline that now appears from April 22-29 might also disappear after new updates in coming days.

My last charts (above) show running totals, another common way to report trends associated with Covid-19. The orange chart is "new" deaths reported each day, and the blue chart is deaths occurring each day.

Circles highlight the most recent seven days. The running total of "new" deaths shows a rapid increase in deaths. This is not accurate.

The running total of deaths each day shows slower increases that are starting to level off. But this may change when daily death reports are updated with lagging data. So this curve might also be inaccurate.

Better ways to accurately report trends associated with Covid-19

A complete count of cases associated with Covid-19 probably will not be available for months or years. But the public, public health officials, and policy makers cannot wait that long. There is enormous demand for immediate information because we need to slow the virus now.

The best way to minimize the inaccuracies created by lagging data is by averaging over a long period of time. Most of this period should be days where counts have stabilized, and major revisions have ended.

For example, counts for the number of deaths in Ohio are typically revised for about 10 days after initial reporting, so trends should be for periods of at least 30 days. But counts for the number of  hospitalizations are typically revised for a much longer period so trends should account for this difference.

I use the percent change every three days to estimate Covid-19 trends. An example is (April 28 deaths/April 25 deaths). This measure, from economist Arnold Kling, is simple and intuitive -- a result larger than 1 means deaths are increasing, smaller than 1 means deaths are decreasing.

I then calculate the median change for the most recent 30 days to determine the trend.a This statistic was 1.06 on April 29, meaning the median three-day change in deaths was a 6 percent increase. Six days earlier the median three-day change was 1.17, or a 17 percent increase. So increases in Ohio deaths may have slowed.

Similar measures are the best way to report other trends associated with Covid-19. But I don't think its realistic to expect such statistics to become the norm.

However, reports should stop focusing on daily reports of "new" cases. Instead report the current total number of cases for a relevant period of time.

The best measure of a trend is cases on the actual days the cases occurred. This should be the preferred measure for reports whenever the data is available.

Regardless of the measure, the time period should be part of every report. This period should minimize the number of days still being revised because of lagging data. All reports should explain the limitations of the measure being used.

Graphics that show trends across time should only be used for day of data. These graphics should explain that counts for recent days may change because of lags reporting data.

a The median minimizes the influence of unusually large changes, or outliers.