Artificial intelligence systems called graph neural networks (GNNs) are used to analyze complex social networks such as Facebook and Twitter, model traffic flow and transportation systems, discover new medications by analyzing the structure of chemical compounds, detect fraudulent activity in financial systems, and more. But to be useful, GNNs must be trained on the massive amounts of data in these complex systems—a costly process in terms of both time and money.
Now, a team that includes a Johns Hopkins applied mathematician has found a faster, less expensive way to train those systems, a breakthrough that promises to help AI systems handle network data far more efficiently and cost-effectively.
“Usually, you train GNNs on large networks with many connections. Our approach was different: We trained them on a smaller subgraph, which is basically a sample of the larger system we want it to analyze,” said team member Luana Ruiz, an assistant professor in the Whiting School of Engineering’s Department of Applied Mathematics and Statistics. “We found that using these smaller samples helps the computer program handle large sets of information more efficiently.”
The team’s results appeared on the preprint site arXiv.
Ruiz explained that the fundamental element in her team’s new approach is that the subgraphs used for training must have the same connection patterns as the full graph. That requires using sampling methods that preserve what she calls “the local neighborhood structures and connected components of the graph.”
“When you’re working with graphs or networks, it’s important that the way you pick and choose data only affects nearby parts and keeps things connected. So, we studied how these sampling-based GNNs behave using a theory called ‘local graph limits,’ also known as Benjamini-Schramm limits, kind of like looking at how things behave locally in a neighborhood,” said Ruiz.
When the subgraphs retained the connection patterns of the large whole, training GNNs on the smaller sections produced results that were very close to those trained on the entire graph, she said.
Despite the success of the method, Ruiz and her team discovered that sampling-based GNNs have limitations when dealing with large graphs under certain circumstances.
“Our research provides a new perspective on training deep learning models like graph neural networks using sampling on various real-world networks. We’ve found that most network types have limits: the point where adding more data does not improve performance. Sampling saves a lot of computing power compared to working with large networks and allows machine learning experts to test different methods on smaller parts, which closely represent the whole dataset,” said Ruiz.
Ruiz said that the rapid growth of deep learning on graphs and the proliferation of types of GNNs presented her team with a challenge: choosing which GNNs would work best for the specific task they wanted to accomplish.
To tackle this issue, the researchers devised a theory—a set of assumptions, operations, and mathematical relationships—to guide various ways of making these sampling-based GNNs.
“We created a big theoretical framework that covers all the ways we make sampling-based Graph Neural Networks, putting them together under one ‘umbrella.’ Then, we looked at how well this big framework works, especially when we’re dealing with really huge sets of data,” Ruiz explains.
The researchers also applied their methods to study citation graphs—essentially a web of scientific articles—and how they reference each other.
“We applied our methods to study citation graphs, essentially a web of scientific articles, and how they reference each other. By using machine learning on this network, we could uncover patterns in how scientific ideas spread, pinpoint emerging research areas, and even categorize new articles into specific fields automatically. Surprisingly, we found that training our system on just 2.5% of all the articles was enough to get meaningful insights, showing the efficiency of our approach,” said Ruiz
Ruiz and her team also intend to extend their approach to applications in malware detection within software call graphs. Additionally, they aim to utilize it for learning how to control teams of robots.