Visualizing 260M tweets with Google Dataflow, BigQuery and Tableau

Thursday, May 07, 2015 Unknown

We want to analyse* 260 million Tweets over the span of almost 200 days and find 'trending topics' by day and by location. We built a very performant solution that filters the noise, and displays the relevant conversations over a dynamic word cloud and geographic map. We're using the Google Cloud Platform to do the parallel data processing and querying, and Tableau Software for the interactive visualisation. With Google Cloud Dataflow we have a new powerful/easy backend to populate BigQuery (which already proved to work very well with Tableau, see previous blogpost). Dataflow handled this perfectly with a speed of more than 10 million tweets per minute!


The Dataset

Our dataset consists of 260M English tweets, spanning 188 days and stored in multiple csv files (32GB in total). For each tweet we have stored: 
  • - tweet["id"]
  • - tweet["text"]
  • - tweet["in_reply_to_user_id"]
  • - tweet["created_at"]
  • - tweet["user"]["id"]
  • - tweet["user"]["location"]
  • - tweet["user"]["time_zone"]

Google Dataflow Backend

The goal of the visualization is to understand what is tweeted about in this massive dataset. We want to find different topics, but we also want to be able to drill down to the level of individual tweets.

To achieve this goal, we need to detect the different topics (single hashtags for now) in the tweets. However, visualizing a list of all these topics is unmanageable because of the large number. Therefore we need to filter out the irrelevant ones. We chose to select only topics that have a bursty behavior [1]. This means topics that occur much more during a limited amount of time relative to the rest of the time.

The outputs of the Dataflow processing step are two BigQuery tables:
- table with filtered set of topics and their distribution over time and location
- table with the filtered set of tweets linked to the topics

Execution

Once you submit your code to be executed on Cloud Dataflow, it returns a nice visual representation of your code blocks, and also the speed of execution over the different steps.

Furthermore, you also get the job log (below), showing the autoscaling capability of Google Dataflow, another killer feature. Note that we restricted the number of workers to 6 here...

In total Dataflow took 25 minutes to handle the 260M tweets with 6 workers, or over 10 million tweets per minute.

The Tableau Dashboard

The dashboard (see on top) has a play button to let it cycle over the days and it allows to retrieve the tweets of the selected topics on specific timestamps (and locations). If you are interested in a live demo, feel free to contact us.

*Twitter contains a lot of relevant information from many people about different subjects, and this nicely timestamped and sometimes even georeferenced. Companies can leverage this information to their advantage by gaining insights. However, people can't handle the massive scale and the abundant presence of noisy data, so some preprocessing is necessary to make it useable. This demo is not focused on any business, but it shows how you can get some real information out of this huge messy dataset.


[1] J. Kleinberg. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery, 7(4):373–397, 2003.

2 comments:

  1. Very interesting article! Can you share the Java code that goes with these transformation steps? (I think called pipelines in DataFlow lingo.)

    ReplyDelete
  2. Hi, thanks for your interest!
    If you send me an email(matthias [at] datatonic dot com), I will send you the Java code...

    ReplyDelete