Putting Genetics on the Map

Winter 2023

Banner image of genes and the human DNA helix with the title "Putting Genetics on the Map"

Crucial partners for ambitious geneticists the world over, computer scientist Michael Schatz and his lab most recently contributed to an astounding accomplishment: the first truly complete human genome.

Want to create an extremely accurate, high-quality genomic analysis of, say, a tomato or a human being? No problem. All you need are several million dollars, a good cellular specimen, a small army of skilled postdoctoral fellows, and access to the latest sequencing technology. And one more thing: You’ll probably want a superb computer scientist by your side.

For many geneticists working at the highest levels of the field over the last decade, that role has been played by Michael Schatz, a Bloomberg Distinguished Professor of Computer Science and Biology at the Whiting School of Engineering and Krieger School of Arts and Sciences. Schatz has never formally trained in biology, and he hasn’t handled a petri dish since college. But his lab has established itself as a crucial partner for ambitious geneticists both inside and outside Johns Hopkins. He and his graduate students are beloved by biologists because they design software that can effectively resolve the raw data generated by modern sequencing technology, with a minimum of errors and a maximum of efficiency.

Over the last year, Schatz’s lab has reached a new level of prominence. He and his colleagues were a key part of the international consortium that achieved a major milestone in 2021: the first truly complete human genome. Schatz was one of four members of that consortium to be honored on TIME magazine’s list of the 100 most influential people in the world in 2022.

“I love collaborating with Mike because he’s passionate about the work, and because he has no fear,” says Zachary Lippman, a professor of genetics at Cold Spring Harbor Laboratory who has worked with Schatz for more than a decade. “He always thinks about ways to use the latest technology to allow us to look at aspects of the genome that weren’t evident before.”

In the wake of the human-genome triumph and the TIME award, Schatz hasn’t taken much pause for rest. His lab is engrossed in several projects, some of which are collaborations with the Johns Hopkins University School of Medicine. They are studying the genome of the tomato, with an eye toward increasing global crop yields; they’re examining newly discovered familial risk factors for early-onset pancreatic cancer; and they’re scrutinizing older libraries of genetic data in light of the newly completed human reference genome.

Schatz has an affable, unpretentious presence, and it’s easy to see why he has won teaching awards. When he describes his research, he makes plain that he feels passionate about the potential human consequences. His algorithms are no abstract card trick. “Why is a computer scientist interested in genetics?” he asks. “It’s because it’s so meaningful and it’s so intellectually interesting. We get to study origins of diseases, we get to look at patterns of evolution, we get to look at agriculture, medicine, fuels, food. It’s incredibly meaningful work to be able to do all of this.”

‘Let’s Ask the RIGHT Questions’

Illustration of the back of a female figure with superimposed genes and human DNA helix graphics

As a fledgling programmer in high school and college, Schatz had no particular interest in biology and no idea that his career would center on genetics. Schatz’s undergraduate training in computer science at Carnegie Mellon University was, he says, “all about core techniques—how to program, how to think about algorithms, how to look for patterns in data and reason about data.”

After graduating from CMU in 2000, Schatz took a programming job at a small firm that focused on network security. “We were developing codes and breaking codes for encryption and authentication,” he says. “And it turns out that was excellent training for genetics.” After Schatz had been there for roughly a year, one of his colleagues left the firm to take a job at The Institute for Genomic Research (TIGR), an independent nonprofit center founded by the legendary geneticist J. Craig Venter. The work sounded interesting to Schatz, so—almost on a whim—he applied for a job at TIGR too.

This serendipitous move was the critical transition point, Schatz says. His first roles at TIGR were meat-and-potatoes programming tasks. But as he got to know the biologists and geneticists who comprised the core of TIGR’s staff, Schatz became more and more intrigued by the problems they were tackling. He also met the head of TIGR’s bioinformatics department, who would become the most important mentor of his career: Steven Salzberg.

Salzberg, now Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at the Whiting School and the Bloomberg School of Public Health, was one of the earliest computer scientists to plunge fully into genetics. In his spare time as a graduate student at Harvard in the 1980s, he’d audited courses in cellular biology. It was clear to him, he says, that genetics was the most exciting place to be.

To appreciate why an institution like TIGR required the skills of pure computer scientists like Schatz and Salzberg, it helps to understand the basic challenges of genome assembly.

A complex organism’s genome contains hundreds of millions of base pairs—that is, matched pairs of the DNA bases adenine, thymine, guanine, and cytosine, whose various combinations encode instructions for all of the organism’s cellular activity. The human genome is around 3 billion base pairs long, and its largest chromosome is around 250 million base pairs long. In an ideal world, we could read the genome by uncoiling the DNA from each chromosome, running it through an electron microscope, and directly reading the sequence of bases. That isn’t physically possible, so over the last 40 years scientists have MacGyvered an awkward-but-ingenious procedure for decoding genomes: They make many copies of a cell’s DNA. They splice those copies into tiny bits. They use machines to read the splices (from as few as 70 base pairs at a time to as many as a million, depending on the technology). And then, having amassed a huge pile of fragmented, duplicative base-pair data, they use algorithms to infer the cell’s full DNA sequence.

It’s at this last stage—writing algorithms to guide the final assembly of the genome —that computer science comes in. “You can’t read 250 million base pairs at once,” Schatz says. “So we get little pieces of the genome, and then we have to stitch them together like a jigsaw puzzle.”

It was during an early project at TIGR that Salzberg first became impressed by Schatz’s intellectual dexterity. The team was compiling a genome using a novel software system, Salzberg recalls. “And what Mike started doing right away was saying, ‘Oh, let’s play around with this assembler. Let’s change things’—it had a lot of parameters you could adjust. Mike said, ‘Let’s try to optimize it for each genome. Give me the raw data, and I will assemble it 12 different ways and give you a much better assembly than what would have come out if we’d just run it in the default mode.’”

In 2005, Salzberg left TIGR for the University of Maryland, and he took several of his proteges with him. Schatz enrolled as a doctoral student there, with Salzberg as his primary adviser. “I had a huge advantage when I started my PhD,” Schatz says, “because I’d already been working in the trenches at TIGR for almost four years. I’d had early exposure to the key problems, the key technologies.”

Schatz’s years at the University of Maryland, from 2005 to 2010, happened to coincide with the advent of so-called “second-generation” sequencing techniques. These new technologies were far faster and cheaper than previous sequencing systems—which meant that Schatz and his fellow genome-assemblers were suddenly being asked to write algorithms that could digest much larger quantities of data.

In addition to improving his assembly algorithms, Schatz also became concerned during his grad school years with the problem of managing the sheer volume of data being generated by second-generation sequencing. The task of analyzing a single genome might be more than a single server could handle—so Schatz created one of the earliest distributed cloud-computing systems for processing and storing genetic data, a project known as CloudBurst. Today, he helps manage AnVIL, a vastly larger cloud-computing resource for geneticists.

When he completed his doctorate in 2010, Schatz was hired as a faculty member at Cold Spring Harbor Laboratory on Long Island. There he began his long collaboration with Zachary Lippman, one of the most prominent plant geneticists of his generation. “I immediately could sense that Mike has the same enthusiasm and passion for science that I do,” Lippman says. “If I proposed something ambitious, he’d say, ‘Let’s go for it. Let’s just make sure we can get it done. Let’s get the right people together. Let’s ask the right questions. Let’s get the right amount of money.’ Every aspect of how he was pushing the science was exactly the way that I push my own science.”

Schatz left Cold Spring Harbor in 2015, when he was hired by Johns Hopkins as a Bloomberg professor. But he has maintained active collaborations with Lippman and others at his old institution. Last year, Schatz assisted Lippman’s team as they used CRISPR gene-editing technology to produce a tomato variety that matures faster and grows more compactly than typical tomato cultivars —an innovation that might help increase global crop yields. Soon they hope to use similar techniques with other staple crops.

Photograph of the human genome team standing in a lab
Schatz lab members: (left to right) Alaina Shumate (Salzberg lab), Steven Salzberg, Samantha Zarate (Schatz lab), Paul Hook (Timp lab), Michael Schatz, Winston Timp, Roham Razaghi (Timp lab), Rajiv McCoy, Dylan Taylor (McCoy lab), and Ariel Gershman (Timp lab)

Getting the FULL PICTURE

The work that won Schatz the TIME magazine honor—the creation of a genuinely complete human genome— was the product of the Telomere-to-Telomere (T2T) Consortium, a project that involves hundreds of scholars and dozens of universities. The project was launched in 2018 by Karen Miga, an assistant professor of biomolecular engineering at the University of California at Santa Cruz, and Adam Phillippy, the head of the Genome Informatics Section at the National Human Genome Research Institute (and a member of Schatz’s grad-school cohort).

Prior efforts to construct a complete human genome had not quite gotten the full picture, thanks to the limitations of the sequencing technologies that existed a decade ago. The primary human reference genome in use since 2013 particularly lacked information about regions in the centers of chromosomes (centromeres) and at the distal ends (telomeres) of certain chromosomes’ arms. Those regions were not believed to contain many proteincoding genes, but they are sites of structural variations whose significance is only now beginning to be fully appreciated.

“We knew that there is a good reason to try this,” Schatz says, “but it was totally unknown what it would take to actually sequence a complete human genome from scratch.” Schatz mostly watched the project from the sidelines until early 2020, when Miga and Phillippy announced that they had successfully sequenced a complete X chromosome. With that proof of concept in place, it was time to get serious about sequencing and assembling the other 22 chromosomes—and for that, they needed the expertise of Schatz’s lab.

Why is a computer scientist interested in genetics? It’s because it’s so meaningful and it’s so intellectually interesting. We get to study origins of diseases, we get to look at patterns of evolution, we get to look at agriculture, medicine, fuels, food.

— Michael Schatz

“I remember meeting with Adam before the pandemic,” Schatz says. “The initial plan was that we’d just do a few chromosomes a year. We thought it would be incredibly difficult and that it would take a decade.” But when the pandemic struck, Schatz says, he and many other people in the consortium started to work from home, concentrating extensively on the T2T project. (Schatz notes here the crucial involvement of current PhD student Samantha Zarate, former PhD students Melanie Kirsche and Mike Alonge, and former postdoctoral fellow Sergey Aganezov.) In that atmosphere of intense focus, problems were solved faster than Schatz and his colleagues had expected, and the complete genome was finished by the spring of 2021. (After a year of peer review, the major T2T papers were published in Science in March 2022.)

“This new genome gives us a much better map than we’d had before,” says Winston Timp, an associate professor of molecular biology and biomedical engineering at Johns Hopkins who was a key member of the project. “There are areas called segmentally duplicated genes that we can now really start to understand for the first time.” Compared to the 2013 reference genome, the T2T genome adds roughly 200 million base pairs and resolves roughly 10 million previously erroneous base pairs. Contrary to expectation, the T2T Consortium also discovered new protein-coding genes—at least 140 of them.

Schatz says that he is excited to see scientists from outside the T2T Consortium beginning to use the new reference genome. But he recognizes that many scholars are now coping with a potential headache: Should they re-analyze their existing libraries of human genomes to correct for the errors and additions that were discovered by the T2T team? “Some mutations that were previously considered variants of concern now appear to be potentially benign—just artifacts of the errors and biases of the previous reference genome,” Schatz says. “But it is complicated to re-process millions and millions of old genome sequences unless there is substantial knowledge to be gained.”

Rajiv McCoy, an assistant professor of biology, brought his lab into the T2T project in order to help Schatz compare the new reference genome to an existing genome model. “It was immediately clear to me that this was going to be a high-impact contribution to the field,” McCoy says. “The opportunity kind of came out of nowhere, but I was able to shift my lab’s resources toward it for much of 2021, and we’re still working on related projects. Mike has been incredibly supportive of me ever since I was hired here. Knowing that he’s had my back has made a huge difference.”

Sharing ‘GREAT MYSTERIES’

Illustration of human form with superimposed genes and human DNA helixes

One of Schatz’s major projects this year is a collaboration with Alison Klein, a professor of oncology and pathology at the Johns Hopkins University School of Medicine. Klein has developed a library of specimens from patients who have experienced early-onset pancreatic cancer. “It looks like there’s a strong genetic component to these cases,” Schatz says, “although it hasn’t been recognized using any of the standard approaches.” Together with Winston Timp’s lab, Schatz hopes to help Klein discover previously invisible genetic risk factors for pancreatic cancer.

Why would a high-risk genetic variation still be difficult to detect in 2022? Because, Schatz explains, some genetic risk factors for cancer and other diseases can only be noticed via “long-read” sequencing, rather than the short-read sequencing techniques that are most commonly used in clinical settings. In the clinic, Schatz says, the most commonly examined mutations are changes in just one or two nucleotides that can set off cancer. That’s true in some cases—but “it’s also common for there to be much larger sets of changes,” Schatz says. “Cancer often involves changes on the scale of hundreds of thousands of nucleotides. Those patterns can only be seen with long-read sequencing.”

Even one of the best-known cancerassociated genetic variants—BRCA1, which tends to cause breast and ovarian cancer— still has not been completely mapped. In 2018, Schatz and his colleagues studied germline and tumor cells from patients with breast cancer in New York and discovered patterns of large-scale structural variation. “We found mutations in BRCA1 that are effectively invisible to short-read sequencing,” Schatz says. “They’re invisible to the cancer [lab] panels that you might get for regular patient care. You can only detect them using long-read sequencing.”

Klein says that she is excited to collaborate with Schatz and Timp as she searches for previously unrecognized variants in her registry of pancreatic cancer cases. “My lab could never do this with off-the-shelf software,” she says. “To do work at this level, you need to work with experts in sequencing and genome assembly.”

Beyond his research, Schatz says that he loves every opportunity to teach. In 2019, he won the Johns Hopkins Alumni Association Excellence in Teaching Award from the Whiting School. In recent years, he’s co-taught courses in genetics and computational biology with McCoy, drawing both biology majors and engineering majors. “Students bring so much energy, with their own questions and their own curiosity,” Schatz says. “It’s a privilege to get to share these great mysteries with them.”

“Mike is sitting at just the right spot,” says Timp. “He’s developing techniques that are then used worldwide to solve problems in biology. It’s typical of Hopkins that he’s been able to take advantage of sitting at the intersection between engineering and the clinic.”