Published:
Author: Randy Rieland
A crowd of bystanders watches a fire, while two people record on their camera phones.
"Maybe a person wants to write an article about a shooting that happened. They could start searching for anything that had been written about it. But this could help them search for any video that's been uploaded on this event in order to find additional information."—Benjamin Van Durme, Associate Professor of Computer Science (Photo via Adobe Stock AI)

When the roof of the Notre Dame Cathedral in Paris burst into flames in April 2019, dozens of bystanders pulled out their phones to capture the terrible scene, then uploaded the video footage to social media sites.

That’s often the case with breaking news events: Smartphones, not news trucks, are first on the scene. But those videos are usually seen without much context unless they’re incorporated into professional broadcast news footage, and they can be difficult to find later. The opportunity for comprehensive retelling of how an event unfolded could be squandered.

Johns Hopkins researchers want to make raw, on-the-scene video more meaningful by providing more depth and perspective. A team of cognitive and computer scientists from the Whiting School of Engineering has created MultiVENT—short for Multilingual Videos of Events with aligned Natural Text—a dataset that contains nearly 2,400 videos related to 260 news events.

Notably, MultiVENT uses artificial intelligence to make it possible to find videos based on images or audio within that video rather than metadata.

Another key component is that it pairs citizen videos with professional footage and then provides links to corresponding Wikipedia pages, thus providing a more nuanced and accurate depiction of each news event. Rather than watching a single TikTok video of the Notre Dame fire, for example, the user could witness the blaze progress over time and from a variety of perspectives.

Uncurated videos are increasingly seen as an important part of news coverage, particularly since they reflect unfiltered eyewitness perspectives, says Ben Van Durme, an associate professor of computer science in the Whiting School of Engineering.

“There are a number of potential motivations for creating this dataset,” he says. “But one thing that I’m particularly interested in is developing technology that will enable citizen journalism. Maybe a person wants to write an article about a shooting that happened. They could start searching for anything that had been written about it. But this could help them search for any video that’s been uploaded on this event in order to find additional information.”

Another member of the MultiVENT team, Kate Sanders, a PhD student at the Center for Language and Speech Processing, points out that professional news organizations provide coverage of only a small percentage of events. Ideally, MultiVENT can be a model for providing easier access to firsthand evidence of incidents that otherwise might not be reported.

“Local organizations and community activists could become better informed and aware of events happening in their communities that are captured on social media,” she says. “And that could help them produce higher quality news reports with fewer resources.”

Another distinctive characteristic of MultiVENT is that it is multilingual, making it possible to retrieve videos recorded with narration in one of five languages: Arabic, Chinese, English, Korean, and Russian. That, Van Durme notes, could significantly reduce the cultural bias that could occur, for instance, with a street video shot in South Korea being described by an English-speaking newscaster.

“This could be a case of a Korean person speaking into their cellphone about an event in a video as compared to something that’s filtered through people in a different culture,” he says. “It means being able to allow citizens of the world to access information in as raw a form as possible, and have artificial intelligence assist them in getting unfiltered views of what’s really happening in the world.”

Making raw footage more accessible and contextual is a step forward in elevating its value as a storytelling asset, Sanders says.

“Most research in this field focuses on the more nicely curated videos because we can retrieve more information from them,” she says. “That may be true, but a really interesting takeaway from the Notre Dame story is that it’s often not the first content to come out from these events. It’s useful to develop systems that can work with less curated videos.”

Sanders notes that MultiVENT will be a core component of a Human Language Technology Center of Excellence workshop at Johns Hopkins this summer. Research scientists, faculty members, and students will explore how to refine the tool’s video retrieval capabilities. For instance, words or sounds included in the audio of a clip could be incorporated into the search results. The focus will be on using artificial intelligence models to find ways to improve access to the different types of content within the dataset. That aligns with the team’s priority of expanding how videos in MultiVENT can be searched rather than adding more clips to the tool.

One possible refinement would be to make individual frames of a video searchable. “It would not just be, Could you find a video of a protest?” Van Durme explains. “It would be, Could you find the particular frames of the video where there’s someone filling the role of law enforcement, as well as someone filling the role of protester?