Inverting the Model of Genomics Data Sharing

Spring 2022

Harnessing the power of genomics to find risk factors for major diseases relies on the costly and time-consuming ability to analyze huge numbers of genomes. A team co-led by a Whiting School computer scientist has leveled the playing field by creating a cloud-based platform that grants researchers easy access to one of the world’s largest genomics databases. 

Known as AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space), the new platform gives any researcher with an internet connection access to thousands of analysis tools, patient records, and more than 300,000 genomes. The work, a project of the National Human Genome Institute, appeared in Cell Genomics. 

“AnVIL is inverting the model of genomics data sharing, offering unprecedented new opportunities for science by connecting researchers and datasets in new ways and promising to enable exciting new discoveries,” says project co-leader Michael Schatz, Bloomberg Distinguished Professor of Computational Biology and Oncology at Johns Hopkins. 

Typically, genomic analysis starts with researchers downloading massive amounts of data from centralized warehouses to their own data centers, a process that not only is time-consuming, inefficient, and expensive, but also makes collaborating with researchers at other institutions difficult. Genetic risk factors for ailments such as cancer or cardiovascular disease are often very subtle, so researchers must analyze thousands of patients’ genomes to discover new associations. The raw data for a single human genome comprises about 40 gigabytes, so downloading thousands of genomes to conduct such research can take several days to several weeks. 

“AnVIL will be transformative for institutions of all sizes, especially smaller institutions that don’t have the resources to build their own data centers,” Schatz says. 

In addition, studies requiring the integration of data collected at multiple institutions means each institution must download its own copy while ensuring that patient data security is maintained. This challenge is expected to become even greater in the future, as researchers embark on ever-larger studies requiring the analysis of hundreds of thousands to millions of genomes at once. 

“Connecting to AnVIL remotely eliminates the need for these massive downloads and saves on the overhead,” Schatz says. “Instead of painfully moving data to researchers, we allow researchers to effortlessly move to the data in the cloud. It also makes sharing datasets much easier so that data can be connected in new ways to find new associations, and it simplifies a lot of computing issues, like providing strong encryption and privacy for patient datasets.” 

In Impact