Crystal Ball: In the Future, How Will We be Able to Access the Data We Gather Today?

Summer 2009

Sayeed_ChoudhuryWe’re clearly very good at gathering data. Take a picture with a cell phone and in addition to the pixels that make the image, tags are automatically attached to the photo relating to its context—the photo’s date, time, and even GPS coordinates. On a larger scale, scientists using remote sensors that are placed across the country can gather huge amounts of detailed data without the constraints of location and time. What we need to do is make sure the data we’re collecting can be used—now and into the future.

One of the problems IDIES is addressing is that we don’t necessarily know how all of this data is going to be used in the future. Though it’s usually collected with a specific purpose in mind, it may be useful to someone else—now or a few years from now—for an entirely different reason. It could be that the metadata, the data about the data, is what makes it valuable. And as more of the research we’re doing is cross-disciplinary, the chances of there being new applications for the data increase.

So the problem becomes more complicated. Not only do we need to capture and store a huge quantity of data, we need to do it in such a way that it’s reliably preserved and accessible to people in multiple disciplines for future use. Data preservation is really about risk management. We need to determine how our energy, time, and resources are best spent so that we can increase the probability that the data we’re collecting today can be used in the future.

Storage is not the solution. In 2008, the world ran out of storage for all the data that’s being collected. Already we have to ask ourselves: How much of the data needs to be stored? And we have to hope that we’re making the right determinations.

Worldwide, industry’s needs have dictated our capacity for data storage. But industry only keeps data as long as it’s legally necessary. If there’s no longer a reason to keep it, it’s gone. Research doesn’t have these kinds of deadlines.

Standardization is an important part of this. Right now, only in certain disciplines is data gathered and metadata attached in a standardized way. With astronomy, a field in which researchers from many locations may use the same large instruments, there’s more standardization. But in chemistry, a field where people work in individual labs, there are far fewer standards. In five years, IDIES hopes to create standards across astronomy, biology, and Earth sciences. This will be critical to the future of multidisciplinary research.

What surprises me more than anything is the lack of urgency around this issue. Hopkins is putting an infrastructure in place that we hope others can copy and is defining a new role for libraries in data management.

(Note: Johns Hopkins is currently one of two organizations with a pending award through the National Science Foundation’s DataNet program for $20 million over five years.)