Showing posts with label #journalism. Show all posts
Showing posts with label #journalism. Show all posts

Thursday, April 30, 2020

A note on minimizing common problems reporting trends in Covid-19 statistics

Some of the most widely-reported statistics about Covid-19 are based on daily updates of total "new" cases in the previous 24-hours. These updates are used to track increases and decreases in everything from the number of tests to the number of deaths from the pandemic.

But these daily totals can be misleading. The day that a case is first reported may not be the day the case actually occurred. Cases are often reported days or weeks after their occurrence because of lags in the collection and distribution of Covid-19 data.

Reports that use the actual day a case occurred are a better way to measure trends. But  day of reports can also be misleading because of lags in reporting cases.

The problems created by lags are especially apparent in many popular dashboards that use graphics to illustrate Covid-19 trends.

The Ohio Department of Health is an exception. The department's dashboard features clear presentations of Covid-19 trends that include the limitations of the data.

The Ohio data is updated every day at 2 p.m. I used data from multiple updates to produce my own graphics and analysis for this post. I have no affiliation with the Ohio Department of Health.


I am using Ohio deaths as an example of general problems reporting Covid-19 tests, cases, hospitalizations and other measures. Covid-19 trends are often reported with bar charts, so I will do the same.

My first two charts (above) show total "new" deaths reported each day in orange. The blue chart shows days that each death actually occurred. Both charts illustrate the period from the first Ohio coronavirus death to April 21.

The orange chart shows that March 20 is the first day that "new" Ohio deaths were reported. But the blue chart shows the first death actually occurred on March 17.

You can also see the orange distribution of "new" deaths does not accurately depict the actual  distribution of deaths in blue.

The number of "new" deaths being reported appeared to be increasing on April 21. But the blue chart shows that deaths per day appeared to be decreasing.

These next charts (above) include data from eight more days, extending the analysis from April 21 to April 29. The orange chart includes a dramatic spike on April 29 because 138 "new" deaths were first reported on that day.

The spike is misleading. The blue chart includes these "new" deaths on days the deaths actually occurred. The blue chart again shows the actual number of deaths per day appear to be decreasing.

But the blue chart also uses lagged data, making it incomplete and possibly misleading. Recall the decline in my first blue chart ending on April 21. That decline  vanished after eight days of updates.

Five of the 138 "new" deaths occurred on unknown dates, so they are not reported in the blue chart. This is not unusual, dates are normally added in subsequent updates. But this is another example of how lags complicate efforts to identify Covid-19 trends.

This next graphic (above) compares both charts of "new" deaths reported each day. Ovals highlight April 14-21. The distribution of deaths did not change from the first to the second chart. This shows how daily totals can persistently misrepresent the distribution of deaths.

This graphic (above) compares both charts of the actual number of deaths each day. Ovals highlight April 14-21.

The distribution of deaths changed from the first to the second chart. The first chart incorrectly shows a decline in deaths. The second chart shows deaths were actually constant or increasing from April 14-21.

This illustrates how the accuracy of lagged data improves over time. As more deaths were reported the counts for the highlighted days were revised upward.

Lags are typically concentrated in the most recent days in any report. So the decline that now appears from April 22-29 might also disappear after new updates in coming days.

My last charts (above) show running totals, another common way to report trends associated with Covid-19. The orange chart is "new" deaths reported each day, and the blue chart is deaths occurring each day.

Circles highlight the most recent seven days. The running total of "new" deaths shows a rapid increase in deaths. This is not accurate.

The running total of deaths each day shows slower increases that are starting to level off. But this may change when daily death reports are updated with lagging data. So this curve might also be inaccurate.

Better ways to accurately report trends associated with Covid-19

A complete count of cases associated with Covid-19 probably will not be available for months or years. But the public, public health officials, and policy makers cannot wait that long. There is enormous demand for immediate information because we need to slow the virus now.

The best way to minimize the inaccuracies created by lagging data is by averaging over a long period of time. Most of this period should be days where counts have stabilized, and major revisions have ended.

For example, counts for the number of deaths in Ohio are typically revised for about 10 days after initial reporting, so trends should be for periods of at least 30 days. But counts for the number of  hospitalizations are typically revised for a much longer period so trends should account for this difference.

I use the percent change every three days to estimate Covid-19 trends. An example is (April 28 deaths/April 25 deaths). This measure, from economist Arnold Kling, is simple and intuitive -- a result larger than 1 means deaths are increasing, smaller than 1 means deaths are decreasing.

I then calculate the median change for the most recent 30 days to determine the trend.a This statistic was 1.06 on April 29, meaning the median three-day change in deaths was a 6 percent increase. Six days earlier the median three-day change was 1.17, or a 17 percent increase. So increases in Ohio deaths may have slowed.

Similar measures are the best way to report other trends associated with Covid-19. But I don't think its realistic to expect such statistics to become the norm.

However, reports should stop focusing on daily reports of "new" cases. Instead report the current total number of cases for a relevant period of time.

The best measure of a trend is cases on the actual days the cases occurred. This should be the preferred measure for reports whenever the data is available.

Regardless of the measure, the time period should be part of every report. This period should minimize the number of days still being revised because of lagging data. All reports should explain the limitations of the measure being used.

Graphics that show trends across time should only be used for day of data. These graphics should explain that counts for recent days may change because of lags reporting data.

a The median minimizes the influence of unusually large changes, or outliers.

Saturday, March 7, 2020

Coronavirus is a test for local journalism, will it pass?

Information and misinformation about the new Coronavirus has for weeks been easily available to anyone with an Internet connection -- i.e. almost everyone in the United States. So news organizations that don't cover this story until the virus is detected in their community are failing an important test.

People want information because they are justifiably concerned about the Coronavirus. Many people are getting sick, and some are dying. There is not yet a medicine or vaccine to treat the virus.  Mobile phones and computers make it easy to find, follow, and share reports about the virus on social media, search engines, and websites.

Local journalists compete directly with the information that people are finding on the internet. Journalists who aren't covering this story are losing this competition and signaling irrelevance to potential audiences. This is not a good strategy when local journalism is struggling to survive.

Searches coincide with developments in the news

I live in Athens, Ohio, a state that has not yet reported any infections. But Google's data on the volume of Coronavirus searches shows interest in Ohio coincides with major news about the virus.

The chart compares Ohio searches on the topic of Coronavirus with Ohio searches on the topic of the flu from December to March. Each topic includes many different search terms. Interest is measured on a scale from zero to 100, where 100 represents a peak in searches.1 

The chart begins Dec. 31, 2019, when China first reported the new virus to the world, according to a timeline in the New York Times. Searches for informaiton on the flu have not changed much in response to virus news, but the opposite is true for Coronavirus. 

The first Coronavirus case in the United States was reported on Jan. 21, 2020, the day the first spike in Ohio searches begins. The Trump administration announced restrictions on travel from China on Jan. 31, 2020, which was followed by the rapid decline in searches that ended the first spike.

The second spike in Coronavirus searches began Feb. 23, 2020, the day that authorities in Italy responded to a major outbreak by shutting down some Italian towns. The next day the Trump administration asked Congress for $1.25 billion to combat the virus in the United States. Searches in Ohio have been spiking ever since.

The trends in Ohio show journalists throughout the state should have been covering the story no later than Jan. 21.

I live in Athens, home to Ohio University which has extensive international connections and a medical school. So I've been surprised by the lack of coverage in two local newspapers that claim to serve the community. The first Coronavirus story that I read in either paper was just published in The Athens News three days ago, March 4, 2020.I might have missed some earlier stories, but that's because there were few or none.

Ohio is not uniquely interested in Coronavirus. Google data shows interest across the United States coinciding with the same major developments in the Coronavirus story.

Journalists can develop unique local stories

Coronavirus is a complicated story involving science, public health, politics, and local jobs and businesses. So many local journalists will probably have to learn a lot of new information at the same time they are covering the story.

Repeating information that is already available on the internet will not make local stories competitive. Local journalists must provide new and valuable information to attract and hold audience attention.

Fortunately, the internet also gives local journalists direct access to the global conversation among experts trying to contain the virus. This makes it possible to quickly find accurate information that can be used to develop differentiated local stories. 

For example, former FDA Commissioner Dr. Scott Gottlieb has warned that local health departments and hospitals might be rapidly overwhelmed if the virus becomes epidemic. His concerns are discussed on his Twitter feed (@ScottGottliebMD), which also references his op-eds in the Wall Street Journal and elsewhere.

Local journalists who publish stories on the limited resources available to fight an epidemic are likely to attract an audience that will stay with them for additional coverage. 

Journalists risk losing audiences if they don't cover this story

Local news organizations have limited staff. Many local newspapers are struggling financially. But journalists who don't re-order priorities to provide continuing coverage of the Coronavirus risk making those problems worse.

When someone is concerned or frightened they keep looking until they find information that answers their concern. Someone who cannot get local Coronavirus information from their community newspaper or television station will go elsewhere to find what they need. They might never return.

The Coronavirus is a major test of credibility for local journalists. But the virus is also an opportunity for journalists to show audiences why their work matters. I hope journalists pass this test.

1  According to Google, trends data is based on representative samples of all searches on a topic. The samples are used to create an index measuring the proportion of searches on a topic. Increases/decreases mean a larger/smaller proportion of searches in Ohio or the United States were about Coronavirus or the flu. This shows increases/decreases in interest about a topic. Charts do not show the actual number of searches.