In our era of big data, the fastest growing data sources are web-based: search engine queries, forums, and social media outlets such as Facebook, Twitter, and Instagram. Though social media postings about someone’s lunch or stressful day may appear to be nothing more than a chronicle of life’s most mundane activities, by aggregating and sifting through this electronic water cooler chatter, we have the power to launch a public health revolution.
Skeptics question whether web-based data is trustworthy or valuable, yet my colleagues and I have proven, through multiple studies, the validity of these new information sources. Not only have we substantiated our findings by comparing them to traditional data sources, but we also have done so in a manner that is more timely and cost-efficient—two qualities that can greatly improve health care delivery.
Public health research has two primary stages: information collection, followed by the design and implementation of interventions. For years, this first stage has suffered from a data bottleneck. Traditional data sources (such as reports from clinical visits, telephone polls, and longitudinal studies) are difficult to implement, time-consuming, and expensive. In contrast, web sources such as online drug forums or web searchers provide massive amounts of data that are not only cheaper and quicker to collect but that also provide more frank reporting and, through aggregation, maintain the
privacy of subjects.
Nowhere is the potential greater than in the area of behavioral medicine—a field that combines health, psychology, biology, and the social sciences. The key to understanding behavioral medicine is an analysis of a person’s health choices and actions—precisely the type of data than can be culled from people’s online searches for health information and their seemingly innocuous social media postings about health behaviors, preferences, and habits.
Consider the case of smoking. According to the Centers for Disease Control and Prevention, each year tobacco smoke kills nearly half a million people in the United States and five million people worldwide. It cost the U.S. economy more than $289 billion between 2009 and 2012, including $133 billion in health care costs and $156 billion in lost productivity. While we know a great deal about how to discourage people from starting to smoke and how to help them quit, such programs require accurate and up-to-date information about the latest trends in tobacco use—information that even at its most basic level is not readily available for the majority of the globe.
So how do we tap the vast reservoir of data in the social chatter occurring around the 21st-century virtual water cooler? By aggregating Google search queries, for example, we have tracked rising interest in electronic cigarettes, as well as an increased interest in smoking cessation after celebrities are diagnosed with cancer. Using advanced natural language processing algorithms to scour the web for articles reporting the latest trends in tobacco use, or to mine Twitter messages about smoking, we can provide public health researchers and public policy organizations with critical, real-time information to aid
in the fight against tobacco use.
In the coming years, we will push the limits of mining this data to find even more ways to assist our public health colleagues. Our focus will be in public health areas that traditionally lack data. Consider what these methods can do for mental health illnesses or identifying emerging illicit drugs. Both areas lack good data and thus are years behind in tracking the latest trends. This wealth of data promises to revolutionize what’s possible in these fields, allowing researchers to ask exciting new questions and changing the very nature of public health research. We are on the cusp of an information revolution.
Mark Dredze is an assistant research professor in the Department of Computer Science. Learn more about his work at www.socialmediahealthresearch.org.