Categories
Seminar

Behind the Scenes of Really Big Data: What It Takes to Compute on the Whole World

Presenter:  Kalev H. Leetaru

Date and Time:  Friday, May 9, 2014 – 3pm

Location:  Research 381

Kalev H. Leetaru is the 2013-2014 Yahoo! Fellow in Residence for International Values, Communications Technology and the Global Internet at the Institute for the Study of Diplomacy in the Edmund A. Walsh School of Foreign Service at Georgetown University. He holds three US patents (cited by a combined 44 other issued US patents) and his work has been profiled in Nature, the New York Times, The Economist, BBC, Discovery Channel and the media of more than 100 countries. His most recent work includes the first in-depth study of the geography of social media and the changing role of distance and location in online communicative behavior around the world (named by Harvard’s Nieman Lab as the top social media study of 2013), the creation of the GDELT Project, a database of more than a quarter-billion georeferenced global events 1979-present and the people, organizations, locations, and themes connecting the world, and the creation of the SyFy Channel’s Twitter Popularity Index, the first realtime character “leaderboard” created for television. Most recently he was named as one of Foreign Policy Magazine’s Top 100 Global Thinkers of 2013. More on his latest projects can be found on his website at http://www.kalevleetaru

Abstract:  What does it take to build a system that monitors the entire world, analyzing global news media in real time, compiling catalogs of everything happening in the world and makes that data accessible for analysis, visualization, forecasting, and operational use? What does it take to support querying of a quarter-billion-record-by-58-column database in near-real time? How do you visualize networks with hundreds of millions of nodes, tease structure from chaotic real-world observational graphs, or explore networks in the multi-petabyte range? How do you process and geographically visualize the emotion of the live Twitter Decahose in real time? How do you rethink tone mining from scratch to power a flagship new reality television show? How do you adapt systems to work with machine translation, OCR and closed captioning error, and the messiness of real-world data? How do you process half a million hours of television news, five billion pages of historic books, or 60 million images dating back 500 years?

This talk will pull back the curtain and present a behind-the-scenes view of what its really like to work with really big data. How does one blend the world’s most powerful supercomputers, virtual machines, cloud storage, infrastructure as a service, plus a ton of software, into a single end-to-end environment that supports all of this research? I’ll be deep-diving on the GDELT Project (http://gdeltproject.org/), a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what’s happening around the world, what its context is and who’s involved, and how the world is feeling about it, every single day. What does it take to build and run a system that monitors the entire world each day and delivers a quantitative model that increasingly powers operational conflict watchboards across the world?

This is the last in the spring 2014 CSC Seminar Series.