Using data science to fight the virus and minimize economic consequences

Friday, December 11, 2020
Image credit: 
National Security Task Force December 2020 Progress Report*

The current pandemic is not a previously unknown Black Swan, or a perfect storm as a result of an unlikely combination of rare events. There is a long history of global outbreaks, and modern human activities, including increased contact between humans and wild animals and global transportation networks, have increased the threat of new infectious diseases. Participants in the Hoover Institution’s National Security Task Force have undertaken to examine how advances in technology might offer new and better tools to combat the COVID-19 pandemic and increase resilience to future biological threats.

Given the prominence of Stanford and Silicon Valley in the development of artificial intelligence technologies and their application to practical problems, it is natural to examine the potential of data science to counter the pandemic and minimize the economic consequences. We have found that modern machine learning techniques are being used in the fight against the virus in a variety of ways, including in the rapid development of vaccines and therapeutics, and improved methods for detection of the virus from images of lungs and sounds of coughs. But we find few examples of use of machine learning inform public health decisions, in large part because of the lack of an adequate supply of health data to support such a system.

The publicly available dashboards and forecasts are generally based on more traditional statistical approaches. Among the most promising is the “nowcasting” indicator dashboard (predicting the state of the pandemic today from analysis of recent data), developed by a research group at Carnegie Mellon to monitor seasonal influenza, and adapted to estimate daily COVID-19 prevalence at the county level.1This system incorporates non-traditional data sources, including personal mobility data, web searches, doctor visits, and surveys carried out over social media.

We continue to believe that data science techniques, particularly machine learning, combined with large amounts of data (from virus test results, health information, economic activity, government interventions, and other sources), has potential for informing complex decisions on public health measures to curtail the COVID-19 pandemic and minimize economic disruption. Such a system could evolve over time to take full advantage of modern data collection and analysis to make our society more resilient against future pandemics and other biological threats. Our Task Force’s charge is to understand the feasibility and status of such a system and its potential for informing decisions on responses to the pandemic.

A machine learning system to inform public health decisions

A research group at IT consultancy Cognizant Technology Solutions has taken a first step toward development of a system using artificial intelligence to support public health officials in making decisions on social distancing measures, school and business closings, and other interventions to counter the virus while minimizing economic consequences.They consider their prototype system to be a proof of concept, demonstrating the potential power of data science to address this problem. Cognizant has developed a machine learning algorithm, but the data that it draws upon is limited – publicly available information on confirmed infection cases each day in each state or country, and eight indicators of government response measures (school and workplace closings, and restrictions on gatherings and travel). For example, the data for the state of California would consist of one number for each day representing the number of new confirmed cases of COVID-19, and eight numbers representing the number and stringency of restrictions in place on that day.

Notwithstanding the paucity of data, the prototype system generates information on the relationship of between the stringency of government interventions and the expected future spread of the virus. In principle, public health officials could use such relationships to help inform decisions on interventions (school and business closures, and the like). The authors call for updating the system with new data as it becomes available. “With further data and development, the approach may become a useful tool for policy makers, helping them to minimize the impact of the current as well as future pandemics.”

A next step is now underway to develop a more robust system supported by more and finer-grained data, building on the work of the Cognizant research group, in the form of an XPRIZE Challenge:

“XPRIZE and Cognizant have partnered to launch the Pandemic Response Challenge, a $500K, four-month Challenge to use artificial intelligence (AI) to generate data-driven, actionable plans that will empower decision makers to safely reopen their societies and economies during the COVID-19 Pandemic.”

The Challenge is to produce and demonstrate by late February an improved version of the machine learning system initially developed by the Cognizant group. Participating teams are encouraged to combine cutting edge artificial intelligence tools with more and higher-quality data to enable accurate predictions, stronger intervention plans, and continual improvement as new interventions such as vaccinations and treatments become available. Teams are encouraged to include data on demographics, economics, healthcare factors, social distancing, adherence to policies, and more. The Challenge is likely to attract competitors with strong machine learning skills.

Building a data supply chain

A modern data science system to support data-driven decisions on public health measures to counter the virus and minimize the economic impact could potentially draw upon a vast amount of data. Such a system would require three streams – health data, economic activity data, and data on public health interventions. Task Force participants have conducted a series of interviews with data scientists at Stanford, in Silicon Valley, and elsewhere, and with federal and local public health officials, to explore sources for the health and economic data such a system would require, and the needs of public health officials who make decisions.

A key finding is that access to detailed health data, including information from electronic health records of individuals, would be valuable but raises many issues that will be difficult to resolve. Lack of individual-level health data could be a major obstacle to the development of data-hungry machine learning approaches which will be difficult to overcome.

A data supply chain could include the following elements:

Testing and health data on individuals, reported promptly in usable form. Data to support a machine learning system could integrate every test result with information from that individual’s electronic health records. An initial report could include:

  • Test results (for virus or antibodies).
  • Demographic information (age, sex, location).
  • Symptoms at the time of the test.
  • Prior health conditions.
  • Admitted to a hospital? Admitted to an ICU?

Subsequent reports could include:

  • Subsequent symptoms.
  • Treatments.
  • Outcome (discharged? died? persistent issues?).

Such a health data stream could be designed to provide the machine learning system the volume andprecision necessary to make accurate inferences about the spread of the virus, and at the same time sufficiently selective and targeted to secure the cooperation of the health care and health records entities holding the data, and conform to legal restrictions.

The most valuable and most difficult data to acquire is information from electronic health records, which are designed for billing purposes rather than to support public health, and there are legal and privacy issues as well. Test results and demographic information from health records are straightforward to collect. It is not possible, however, to report all information from electronic health records – the holders of health records would not agree to share all their data, and in any event that would be illegal.

Extracting information relevant to COVID-19 (and perhaps other conditions) from electronic health records would be extremely valuable, but faces major challenges. The various electronic health records enterprises collect and record information (e.g. on symptoms and prior conditions) in different ways, sometimes with time lags. Tracking individuals over time for subsequent reports raises difficult issues. Electronic health records are largely maintained for those with employer health care or Medicare, a subset of society and not representative of the population as a whole, and individuals cannot be tracked if they switch health care providers. Researchers with experience in attempting to extract information from individual electronic medical records for public health purposes report the work is slow and challenging and not now close to being ready to deploy.3[3]In principle, however, county public health officials could require health care entities holding electronic health records to provide a standardized set of information daily in usable electronic form. (A nationally coordinated collection effort would be even better.) These reports could be designed and handled to ensure privacy, and would be available only to public health officials and researchers. The content and format of this data stream could be designed by a team including medical professionals, data scientists, health care and health records enterprises, and lawyers. Implementation would raise issues concerning funding the collection and validation of information by health care entities, and ensuring the security and privacy of the data by the government entities that receive it.

The prospect of big data analysis of private medical information will raise privacy concerns that will need to be addressed and resolved in order to secure the support of government officials and the public. There will be public skepticism that data supposedly anonymized and secure could be hacked, manipulated, and abused. And indeed there have been cases where researchers were able to easily identify individuals from supposedly “deidentified” data. Extracting information in real time from individual health records will require many privacy issues to be addressed, and development of data use agreements with the holders of electronic health records. This is likely to be a difficult and time consuming process, but would establish a basis for better responses to future pandemics (and perhaps other conditions as well).

The difficulty in acquiring health data to support a machine learning system is a reflection of a larger problem – public health authorities struggle to get the information they need to do their jobs. Establishment of a modern system for reporting to public health authorities certain categories of information from electronic medical records could serve both purposes – providing public health officials better situational awareness, and providing data to support a machine learning system to inform public health decisions.

Economic data. In parallel, a second team including data scientists and economists could develop a mechanism for collecting and reporting information on economic and commercial activity, including high-frequency data on employment, business activity, credit card transactions, restaurant reservations, retail traffic, transportation activities, etc.

In contrast to the difficulty of acquiring health data, a number of sources collect and publish near-real-time data on economic activity. For example, the Opportunity Insights Economic Tracker uses anonymized data from private companies such as credit card processors and payroll firms to construct statistics on consumer spending, business revenues, employment rates, and other key indicators of economic activity and reports them daily by county and industry. Other entities collect similar information on retail store foot traffic and restaurant reservations. Such data is available to support a machine learning system to inform policy decisions that take into account the economic impact, and this data stream could be expanded over time.

Data on public health measures and their implementation. A third team including public health officials and data scientists could develop a mechanism for collecting and reporting data on public health measures that are adopted, and on how well they are being implemented (social distancing, mobility, face covering, tele-work, practices at schools, factories, stores, and restaurants, etc.).

Some data is currently available on implementation of public health measures. For example, Google has begun to publish Community Mobility Reports, which use anonymized data provided by applications such as Google Maps to produce a data set updated every day that shows how peoples’ movements have changed during the pandemic. The reports measure visitor numbers to locations such as grocery stores, pharmacies, retail, parks, transit stations, workplaces, and residences. Apple publishes similar data, and other entities publish data on social distancing and sheltering in place. The Carnegie Mellon group conducts surveys of behavior using social media. Again, such data is available to support a machine learning system to inform policy decisions that take into account the implementation of public health measures, and this data stream could be expanded over time.

Contact tracing data. Information could be collected from contact tracing systems, incorporating both traditional labor-intensive methods and location history from smart phone apps.

Future additions. Over time, collection of more types of information could be added, including:

  • Information derived from monitoring media and social media, search histories, data from smart phone apps, and imagery. Modern machine learning techniques can draw inferences from such large data sources.
  • Information and analysis from published reports and studies. Data can be extracted from the very large number of published reports and studies on COVID-19.
Support for data-driven public health interventions

With ample near-real-time data along these lines, data science could be effectively applied to continuously provide accurate awareness of the situation and enable more effective responses.

The first thing that could be done with such a data supply is to infer with great geographic resolution the current state of any epidemic or pandemic. This would further refine the “nowcasting” that has been developed and successfully applied over many years by the research group at Carnegie Mellon, and would help improve current public health practice by improving real-time situational awareness.4

The second thing that could be done with such data is to develop techniques for forecasting the course of an epidemic or pandemic, which takes time because it requires careful measurement of forecasting accuracy over many time periods, locations, and circumstances. The Carnegie Mellon group has over several years developed and demonstrated a capability to forecast seasonal flu. Forecasting the course of a new pandemic is more difficult, in part because there is not enough time to develop confidence in the empirical performance of the system.

The third thing that could be done with such data is to learn which public health interventions would have what effects on the spread of the virus and on economic activity, and use that knowledge to optimize the interventions. This is difficult because inferring causality from observational data is difficult. Inferring causality and optimizing public policy generally require experimentation, which takes time and is often difficult in public health settings.5

A plausible architecture could include government coordination of the collection of the necessary health data, to establish uniform requirements for the information provided and for the format. While state and local governments currently have the ability to require reporting of information to public health authorities, and to specify the form and content, there would be advantages for national coordination of the collection of health data, with uniform requirements for timely reporting of testing and health data in useable electronic form, to support the basic functions of public health authorities and to support a machine learning system to inform data-driven decisions as well. The reporting of health data would be governed by stringent security and privacy protections, and the information would be used only for public health and research purposes.

The necessary data on economic activity could be acquired from existing sources and their future evolution.

The health data would be made available to research groups under strict privacy and security protections. Research groups at universities and commercial entities could then develop techniques for inferring from the data the better awareness of the current state of an epidemic or pandemic, better predictions of its future course, and eventually, better ways to optimize public health interventions and minimize the economic consequences.

How can the National Security Task Force contribute?

Given our ties to Stanford and Silicon Valley, the potential role of artificial intelligence in informing decisions on effective actions against COVID-19 and future biological threats while minimizing the economic impact is an obvious subject for Task Force interest and research. This brief paper summarizes what we have learned to date through interviews with data scientists and public health officials, and from published research, on what artificial intelligence can do in this space. We intend to periodically update this paper to remain current with developments in this field.

We also see a potential role for the National Security Task Force in bringing together individuals working on different aspects of this problem. For example, we could consider convening a session with data scientists with differing specialties (for example, experience extracting information from electronic health records, experience in developing machine learning systems, and experience with privacy and security issues), officials of enterprises holding health data, and county and state public health officials. Such a session could facilitate a transition from technical progress to real-world application, and address privacy concerns that could be an obstacle to public acceptance of such use of medical records. Some of the data scientists we have interviewed are part of the Stanford Human-Centered Artificial Intelligence initiative, and HAI could be involved in such a session as well.

*This paper is the first result of a collective effort under the auspices of the National Security Task Force. David Fedor and Admiral James Ellis participated in the interviews and the drafting of this progress report, which also benefited from reviews and comments of other participants in the Task Force.

[1] Delphi Research Group, Carnegie Mellon University, “COVIDcast.” Online at

2 Risto MiikkulainenOlivier FranconElliot MeyersonXin QiuElisa CanzaniBabak Hodjat, “From Prediction to Prescription: Evolutionary Optimization of Non-Pharmaceutical Interventions in the COVID-19 Pandemic,” arXiv:2005.13766v3 1 Aug 2020.

3 Sherri Rose, PhD, Associate Professor, Medicine, Center for Health Policy, Center for Primary Care and Outcomes Research, Stanford University 

4 Roni Rosenfeld, Leader of the Delphi Research Group, and Professor and Head of the Machine Learning Department, School of Computer Science, Carnegie Mellon University.

5 Ibid.