First Apache Beam meetup in London

Last week, Datatonic organised the first Apache Beam meetup in London, working together with Qubit and Google to educate and spark discussions in the area of data engineering and more specifically: Apache Beam.

Beam history and resources

Apache Beam started as Dataflow within Google and builds upon Google’s experience of building distributed data processing pipelines, with projects like FlumeJava and MillWheel.

They started developing Dataflow / Beam to make previously written pipelines more portable, so moving to new engines or environments wouldn’t necessarily involve rewriting well-written and tested business logic. In 2015, Google released the Dataflow Model paper and in 2016 the project was submitted to the Apache software foundation resulting in the Apache Beam project.

One of the main characteristics that discerns Beam from other data processing frameworks is the fact that the processing logic (written using the Beam SDK) is independent of the runner (how the system actually does the relevant processing), making the code portable. For those interested, an overview of the different runners and their capabilities can be found here: Beam Capability Matrix. This makes it possible to run Beam code on your existing on-prem Spark cluster if your company hasn’t signed off on Cloud technologies, and also allows you to future-proof your code in case you want to move to a managed cloud service (e.g. Google Cloud Dataflow - which is the name of Google’s managed Beam runner).

Beam is currently being used at scale by Spotify,  Qubit, and others, and we are sure more will follow.

If you want to get started using Apache Beam, have a look at some of the relevant resources at the bottom of the blogpost!


Since we at Datatonic are quite excited about this new technology, and see interest with our clients for Beam as well, we wanted to build a community to share experience and learn from experts and each other, so we set up a meetup. For the first event we invited three experts to talk about Beam-related topics!

Talk 1: Apache Beam primer

Victor Kotai [LinkedIn, Twitter], software engineer at Qubit, gave an introductory glance at Beam. It went through some of the Beam concepts like the different transforms, and how Beam handles time. The session ended with some code examples and how this translates into a pipeline. This allowed us to get everyone on a similar level before digging deeper. (Although the attending audience was pretty advanced already!)

Talk 2: Apache Beam on Google Cloud: use-case at Qubit

As a second talk, we had a more use-case focused session, in which Jibran Saithi [LinkedIn, Twitter], tech lead at Qubit laid out how their data infrastructure evolved through time and how they ended up with Apache Beam running on Google’s managed service. They use Beam in both streaming and batch mode at scale and in harmony with other GCP tools like Pub/Sub and BigQuery. We were shown their current architecture, why they are still using Beam, and some of the features they would like to see in the future; which made for an interesting discussion.

Talk 3: State & timers in Apache Beam

To wrap up, we were fortunate enough to have Reuven Lax [LinkedIn, Twitter] talk. Reuven is one of the co-authors of both the Millwheel and Dataflow papers so he is one of the most knowledgeable people in the Beam world. Since he was partly responsible for the inception of Beam and is currently team lead on the Dataflow team, he was able to talk us through the concepts in a clear and concise way and helped us understand some of the more complicated Beam aspects, like state & timers, as well as answer some of the more advanced questions people in the room had.

Join or get involved!

Overall, we think it was a great first meetup, but we are just getting started. We would love to invite you to our next sessions, which you should be able to find here soon! We will be posting the slides of the first session on the page as well, so stay tuned.
If you want to get involved or have something interesting to share, don’t hesitate to get in touch!

Getting started resources

Some interesting resources to get you started on Beam:

DI Summit, Hands on TensorFlow Course by Datatonic.

Machine learning is all around us, and the recent advances in deep learning deliver breakthrough state of the art performance on a wide range of tasks, from computer vision to time series prediction and even natural language processing...

TensorFlow is Google's deep learning framework that will help you conquer the most challenging problems. So there really is no reason to fall behind here! This introductory course on TensorFlow, including hands on iPython notebook exercises, will get you started in the field of deep learning in no time! In these notebooks, we will tackle the classic MNIST image data set and get familiar with concepts ranging from a simple logistic regression to deep neural nets and convolutional architectures.

Here you can find a link to the presented slide deck:
Presentation Deep Learning

The solutions to the notebook can be found here:
iPython Notebook Solutions

Join us at Ultimate Machine Learning with Google Cloud

On February 21st we are hosting a Machine Learning event together with Google Cloud.

Our team of experts will show how your company can use machine learning on Google Cloud to generate greater insights from your data, drive business, and improve customer experience. We'll cover some common myths surrounding machine learning, as well as the essential ingredients you'll need to get started with a machine learning project, and why Google Cloud is the perfect place to get started or scale your ML efforts.

We'll also discuss some real-world examples where we have helped transform companies using the power of machine learning. We'll dig deeper into the architectural and technical best practices and the full model development lifecycle for delivering a succesful ML project. There will be plenty of time for questions and to discuss in detail the possibilities for your business.

The event will be hosted in London, at Google HQ at 9.00AM to 1.00PM on Tuesday 21st February.

So why not join us? To attend you need to register at the link below:

We hope to see you there!

Real-time streaming predictions using Google Cloud Dataflow and Google Cloud Machine Learning

Google Cloud Dataflow is probably already embedded somewhere in your daily life, and enables companies to process huge amounts of data in real-time. But imagine that you could combine this - in real-time as well - with the prediction power of neural networks. This is exactly what we will talk about in our latest blogpost!

It all started with some fiddling around with Apache Beam, an incubating Apache project that provides a programming model that handles both batch and stream processing jobs. We wanted to test the streaming capabilities running a pipeline on Google Cloud Dataflow, a Google managed service to run such pipelines. Of course, you need a source of data, and in this case, a streaming one. Enter Twitter, which provides a nice, continuous stream of data in the form of tweets.

Fetching the tweets 

To easily get the tweets into Beam, we hooked the stream into Google Cloud PubSub, a highly scalable messaging queue. We set some filters on the Twitter stream (obviously we included @teamdatatonic and #datatonic) and then hooked the PubSub stream into Dataflow.

Setting up the Beam pipeline

Dataflow has a connector that easily reads data from PubSub. The incoming data points are raw, stringified jsons and they first need some parsing and filtering. This is done in the "Parse tweets" part of the pipeline (see below for an overview of the pipeline). We extract the text from the tweets, but in order to have a valid input for the machine learning model, we have to vectorise them as well. This is done in the "Construct vectors from text" step.
As you might notice in the Dataflow graph below, a side input is used in this step as well. Since the inputs into the ML model we used are numbers rather than strings, we assign different words to different indexes and use a look-up table to convert them. This is done by loading the vocabulary into the memory (the full vocabulary was 4,5 MB) of all the Dataflow workers, where the look-up table is implemented as a Hashmap.
Google Cloud Dataflow pipeline for processing tweets, predicting sentiment and writing to BigQuery

Embedding Machine Learning into our pipeline

Next thing we needed was a simple, but cool model we could apply to our tweets. Luckily, there are some ready-to-use models on GitHub such as this one, which predicts sentiment using a convolutional neural network (CNN). We will not go into much detail here as we focused our work on the pipeline rather than getting the most accurate predictions. The only changes we made to the model are the ones outlined here, which enables Google Cloud Machine Learning to handle the inputs and outputs of the Tensorflow graph, and then we deployed it.

One of the capabilities of Cloud Machine Learning are online predictions - these are made by HTTP requests where the body of the requests contains the instances we wish to classify. We used a client library for Java to handle the communication between the Dataflow workers and the Cloud Machine Learning API interface, which requires authorization.

This is the most exciting part of this infrastructure - it is now possible to create a feed of data and make machine learning predictions on it in real-time, and have it as a fully managed and scalable service. In our pipeline we sent the tweets transformed into vectors to the deployed model and obtain a prediction of sentiment. All of this is done in the "Predict sentiment using CNN on Cloud Machine Learning" step.


We wrote each tweet with its sentiment to BigQuery where it can be used for further analysis or visualisation. The overview of the full architecture can be found below. As you can see, we added Google Cloud Datastore and Google Cloud BigTable, which can be used when the look-up file becomes too large to fit in the memory, or can be used to enhance the data in your pipeline in other ways. This would enable a range of various applications (such as time series analysis), where we need a low latency random access to large amounts of data.
Full architecture
Since this is rather a proof-of-concept and reference pipeline, the output is not exceptionally exciting. However, it opens up a whole new world of live processing and serving predictions, which can all be seamlessly embedded into your data pipeline. We are really excited about these new possibilities and looking forward to apply it in our future projects!

Traffic in London episode II: Predicting congestion with Tensorflow

A few weeks ago Datatonic took part in a hackathon organised by Transport for London (TfL). At the hackathon we received data from road sensors in London which allowed us to build a nice real-time traffic visualisation which you can find in this blog post here.  You can also find more information about the hackathon in that post and some details on how we processed the raw sensor data with Dataflow (Apache Beam).

In this post we use the traffic data and build something quite different: A deep learning model that predicts congestion!

Visualising road traffic

The TfL data we use stems from sensors placed at traffic junctions all over London. These sensors detect if a car is above them or not every quarter second (see Episode I). As for the real-time analysis we use Dataflow to convert the raw bitstream from the sensors into two commonly used measures in traffic engineering: occupancy and flow. Occupancy simply tells us how often a sensor detects a car. It can be further divided by the average car length (around 4.5 m) to obtain the traffic density of a sensor. Flow on the other hand is a measure of the number of cars passing over a detector in a given time. Combining these two measures allows us to calculate the average speed per car (as flow/density), which we will use later to determine congestion.  If you want to learn more about traffic flow have a look here.

Before we attempt to model the congestion in London, we will first visualise some of the sensor data. This will give us an overview of the key measures and it will also help us to define and quantify congestion. Tableau is a great choice for this task since it offers convenient and great looking map views.

The figure above shows some of the traffic indicators plotted against each other. You can also track the time behaviour by looking at the colours. Overall, the figures agree nicely with the theoretical behaviour found here. We see for example that the speed decreases as the density increases (more cars on the road -> slower traffic).

Defining Congestion

To find out if a road is congested or not we need a single, robust quantity that describes the traffic state on that road. Finding such a measure is far from simple as we can illustrate by looking at flow: Flow is small for small values of density, but also for large values of density. This means zero flow can either mean a free road with no traffic or total congestions. To resolve this ambiguity we settled for speed as our congestions measure after some testing. In more detail we use speed/max(speed) as a congestion measure, since the absolute speed can vary largely between roads. Here, the max(speed) value is the maximum speed of a road, which will be usually achieved if there is very little traffic.

A relative speed value of 0 then means complete congestion and a value of 1 means the road is free. We plotted this congestion measure for small group of sensors in the map view of the above figure (top left). Such a congestion measure based on relative speed has been previously used by traffic engineers. However, it is worth noting that there is no ideal measure of congestion and many different definitions can be found in the literature.

Predicting traffic with the LSTM network

Now that we have a good measure for congestion we can try and predict its value using historic data. For this we build a deep learning model which uses the past 40 minutes of traffic data (relative speed) as input and which will then predict the congestion measure 40 minutes from now.

For the model we selected a LSTM recurrent neural network (RNN) which performs exceptionally well on time series data. We won't go into detail here about how the neural network works and how it is setup up, but you find a great article explaining LSTM networks here.  Our implementation of the model was done in TensorFlow which has built in functions for RNNs and the LSTM network. To simplify the problem we selected a small group of around 300 sensors out of the 12'000 sensors in London (the group can be seen in the map view above). 

We trained the model on a small set of only 8 days of traffic, but the results are already promising. The model can forecast the traffic 40 minutes into the future with good accuracy and it predicts the morning and afternoon rush hours nicely. The image below shows the predicted vs. actual speed for a few different sensors.

You can see how the neural network captures the different daily behaviour of each sensor individually. However, looking closely at the above graph you will find that some of the details of the time series data are not well reproduced. We also found that there are a few sensors that show a large overall prediction error. In order to increase the accuracy of the LSTM network it could be trained on much more data than just 8 days. Further, including more sensors might help training and of course all the sensors needed for a congestion prediction system for the whole of Greater London. Still, the model we have developed has a lot of potential and might help to do the following after some more tuning:

   +    predict future traffic state
   +    predict congested areas so they can be avoided or reduced
   +    detect incidents quickly

Final thoughts

The road sensor data we received from TfL is fascinating for two reasons: First, it is highly relevant as most us have to deal with some sort of road traffic during their everyday lives. Second, it allows us to analyse it in various different ways. In these two blog posts we leveraged the streaming capabilities of Dataflow (Apache Beam) to power a live visualisation and we forecasted congestion. Both are great use cases of the data and we hope that we are going to see more such datasets from TfL in the future.

Thanks to the great team at TfL for organising the demo and providing the data.