A team of Johns Hopkins applied mathematicians has devised a new method for untangling the intricate web of connections in complex networks such as social media and the internet. The researchers’ findings, described in The Journal of Computational and Graphical Statistics, have the potential to make computer calculations faster and more efficient. They also shed light on finding the right balance between numerical and statistical errors in data analysis: crucial for making reliable decisions based on data.
Carey Priebe, a professor of applied mathematics and statistics worked with Avanti Athreya and Zachary Lubberts, both associate research professors, to identify communities within a YouTube social network based on connections between users. One challenge they faced was that the network’s connections on any given day were sometimes jumbled and unclear, not always accurately representing the community of users. To tackle this, they used a technique called eigendecomposition, which breaks the network down into simpler parts. The aim was to gain insights into the network’s actual community structure. However, performing these eigendecompositions requires a lot of computer power, and the team was eager to make that process more efficient.
“We discovered that striving for a ‘perfect’ eigendecomposition by using a lot of computational resources is not productive, because the endgame isn’t the eigendecomposition itself: It’s to understand the communities within YouTube,” said Athreya.
To address this challenge, the researchers sought to determine how to allocate computational resources more optimally, including a more refined criterion for terminating an eigendecomposition algorithm. An improved termination criterion not only ensures efficiency but also conserves precious resources, such as time, energy, and effort.
They used spectral properties, which are characteristics of the graph of connections between users, to set a better-adapted stopping point for their algorithm. Much of the improvement they discovered came from balancing two types of error in random data: numerical and statistical. Statistical errors happen because the data is naturally noisy and uncertain, and such errors cannot, in general, be entirely eliminated. Numerical error emerges when a computer program makes small approximations, such as rounding, with each calculation.
“To minimize numerical error, one might spend a lot of computational time and resources, but if the statistical error remains significant, reducing the numerical error may not be worth it, as the statistical error will overshadow any gains,” said Lubberts.
Rather than letting the computer program run until it stopped itself, Priebe’s team devised a new method of measuring two different mistakes and deciding when the program should stop. After testing this new method on data from Youtube, they learned that they could finish the work more quickly with accurate results. This yielded a practical and measurable method to balance various aspects of an algorithm or inference procedure, making it relevant for real-world applications, from social media platforms to transportation systems.
“The machine learning algorithms we depend on, whether they are search-engine results, driving directions, or optimal flight prices, all have a termination point, a point at which the algorithm stops, the list of recommendations ends, enough routes have been searched, and so on. An important question is when to stop and how to do so in a systematic and nuanced manner rather than as a one-size-fits-all default,” said Athreya.
Priebe noted that this project illustrates the ongoing challenge of achieving both speed and accuracy in a specific calculation, crucial for effective statistical inference in modern data contexts.
“My takeaway from this experience is that formal consideration of both the statistical implications of numerical calculations, and the relevance of the numerical calculations to statistical inference, can lead to significant breakthroughs. To me, this project perfectly illustrates the wisdom of one my favorite quotations, from Leopold Kronecker to Hermann von Helmholtz: ‘The wealth of your practical experience with sane and interesting problems will give to mathematics a new direction and a new impetus,’” said Priebe.
The researchers emphasize that this concept has broader relevance and reflects everyday decisions we all encounter. From deciding whether to invest hours cleaning a car that will soon be dirty again to contemplating spending time searching for the perfect vacation deal: it is all about finding the optimal point to stop and commit to a decision, they say. The researchers stress that an approach that incorporates and distinguishes between various sources of error can minimize resource expenditures and enhance the accessibility and affordability of analyzing vast and intricate datasets.