Real-time streaming predictions using Google Cloud Dataflow and Google Cloud Machine Learning

Google Cloud Dataflow is probably already embedded somewhere in your daily life, and enables companies to process huge amounts of data in real-time. But imagine that you could combine this - in real-time as well - with the prediction power of neural networks. This is exactly what we will talk about in our latest blogpost!

It all started with some fiddling around with Apache Beam, an incubating Apache project that provides a programming model that handles both batch and stream processing jobs. We wanted to test the streaming capabilities running a pipeline on Google Cloud Dataflow, a Google managed service to run such pipelines. Of course, you need a source of data, and in this case, a streaming one. Enter Twitter, which provides a nice, continuous stream of data in the form of tweets.

Fetching the tweets 

To easily get the tweets into Beam, we hooked the stream into Google Cloud PubSub, a highly scalable messaging queue. We set some filters on the Twitter stream (obviously we included @teamdatatonic and #datatonic) and then hooked the PubSub stream into Dataflow.

Setting up the Beam pipeline

Dataflow has a connector that easily reads data from PubSub. The incoming data points are raw, stringified jsons and they first need some parsing and filtering. This is done in the "Parse tweets" part of the pipeline (see below for an overview of the pipeline). We extract the text from the tweets, but in order to have a valid input for the machine learning model, we have to vectorise them as well. This is done in the "Construct vectors from text" step.
As you might notice in the Dataflow graph below, a side input is used in this step as well. Since the inputs into the ML model we used are numbers rather than strings, we assign different words to different indexes and use a look-up table to convert them. This is done by loading the vocabulary into the memory (the full vocabulary was 4,5 MB) of all the Dataflow workers, where the look-up table is implemented as a Hashmap.
Google Cloud Dataflow pipeline for processing tweets, predicting sentiment and writing to BigQuery

Embedding Machine Learning into our pipeline

Next thing we needed was a simple, but cool model we could apply to our tweets. Luckily, there are some ready-to-use models on GitHub such as this one, which predicts sentiment using a convolutional neural network (CNN). We will not go into much detail here as we focused our work on the pipeline rather than getting the most accurate predictions. The only changes we made to the model are the ones outlined here, which enables Google Cloud Machine Learning to handle the inputs and outputs of the Tensorflow graph, and then we deployed it.

One of the capabilities of Cloud Machine Learning are online predictions - these are made by HTTP requests where the body of the requests contains the instances we wish to classify. We used a client library for Java to handle the communication between the Dataflow workers and the Cloud Machine Learning API interface, which requires authorization.

This is the most exciting part of this infrastructure - it is now possible to create a feed of data and make machine learning predictions on it in real-time, and have it as a fully managed and scalable service. In our pipeline we sent the tweets transformed into vectors to the deployed model and obtain a prediction of sentiment. All of this is done in the "Predict sentiment using CNN on Cloud Machine Learning" step.

Conclusion

We wrote each tweet with its sentiment to BigQuery where it can be used for further analysis or visualisation. The overview of the full architecture can be found below. As you can see, we added Google Cloud Datastore and Google Cloud BigTable, which can be used when the look-up file becomes too large to fit in the memory, or can be used to enhance the data in your pipeline in other ways. This would enable a range of various applications (such as time series analysis), where we need a low latency random access to large amounts of data.
Full architecture
Since this is rather a proof-of-concept and reference pipeline, the output is not exceptionally exciting. However, it opens up a whole new world of live processing and serving predictions, which can all be seamlessly embedded into your data pipeline. We are really excited about these new possibilities and looking forward to apply it in our future projects!

Traffic in London episode II: Predicting congestion with Tensorflow

A few weeks ago Datatonic took part in a hackathon organised by Transport for London (TfL). At the hackathon we received data from road sensors in London which allowed us to build a nice real-time traffic visualisation which you can find in this blog post here.  You can also find more information about the hackathon in that post and some details on how we processed the raw sensor data with Dataflow (Apache Beam).

In this post we use the traffic data and build something quite different: A deep learning model that predicts congestion!

Visualising road traffic

The TfL data we use stems from sensors placed at traffic junctions all over London. These sensors detect if a car is above them or not every quarter second (see Episode I). As for the real-time analysis we use Dataflow to convert the raw bitstream from the sensors into two commonly used measures in traffic engineering: occupancy and flow. Occupancy simply tells us how often a sensor detects a car. It can be further divided by the average car length (around 4.5 m) to obtain the traffic density of a sensor. Flow on the other hand is a measure of the number of cars passing over a detector in a given time. Combining these two measures allows us to calculate the average speed per car (as flow/density), which we will use later to determine congestion.  If you want to learn more about traffic flow have a look here.

Before we attempt to model the congestion in London, we will first visualise some of the sensor data. This will give us an overview of the key measures and it will also help us to define and quantify congestion. Tableau is a great choice for this task since it offers convenient and great looking map views.




The figure above shows some of the traffic indicators plotted against each other. You can also track the time behaviour by looking at the colours. Overall, the figures agree nicely with the theoretical behaviour found here. We see for example that the speed decreases as the density increases (more cars on the road -> slower traffic).

Defining Congestion

To find out if a road is congested or not we need a single, robust quantity that describes the traffic state on that road. Finding such a measure is far from simple as we can illustrate by looking at flow: Flow is small for small values of density, but also for large values of density. This means zero flow can either mean a free road with no traffic or total congestions. To resolve this ambiguity we settled for speed as our congestions measure after some testing. In more detail we use speed/max(speed) as a congestion measure, since the absolute speed can vary largely between roads. Here, the max(speed) value is the maximum speed of a road, which will be usually achieved if there is very little traffic.

A relative speed value of 0 then means complete congestion and a value of 1 means the road is free. We plotted this congestion measure for small group of sensors in the map view of the above figure (top left). Such a congestion measure based on relative speed has been previously used by traffic engineers. However, it is worth noting that there is no ideal measure of congestion and many different definitions can be found in the literature.

Predicting traffic with the LSTM network

Now that we have a good measure for congestion we can try and predict its value using historic data. For this we build a deep learning model which uses the past 40 minutes of traffic data (relative speed) as input and which will then predict the congestion measure 40 minutes from now.

For the model we selected a LSTM recurrent neural network (RNN) which performs exceptionally well on time series data. We won't go into detail here about how the neural network works and how it is setup up, but you find a great article explaining LSTM networks here.  Our implementation of the model was done in TensorFlow which has built in functions for RNNs and the LSTM network. To simplify the problem we selected a small group of around 300 sensors out of the 12'000 sensors in London (the group can be seen in the map view above). 





We trained the model on a small set of only 8 days of traffic, but the results are already promising. The model can forecast the traffic 40 minutes into the future with good accuracy and it predicts the morning and afternoon rush hours nicely. The image below shows the predicted vs. actual speed for a few different sensors.







You can see how the neural network captures the different daily behaviour of each sensor individually. However, looking closely at the above graph you will find that some of the details of the time series data are not well reproduced. We also found that there are a few sensors that show a large overall prediction error. In order to increase the accuracy of the LSTM network it could be trained on much more data than just 8 days. Further, including more sensors might help training and of course all the sensors needed for a congestion prediction system for the whole of Greater London. Still, the model we have developed has a lot of potential and might help to do the following after some more tuning:

   +    predict future traffic state
   +    predict congested areas so they can be avoided or reduced
   +    detect incidents quickly

Final thoughts

The road sensor data we received from TfL is fascinating for two reasons: First, it is highly relevant as most us have to deal with some sort of road traffic during their everyday lives. Second, it allows us to analyse it in various different ways. In these two blog posts we leveraged the streaming capabilities of Dataflow (Apache Beam) to power a live visualisation and we forecasted congestion. Both are great use cases of the data and we hope that we are going to see more such datasets from TfL in the future.


Thanks to the great team at TfL for organising the demo and providing the data. 








Traffic in London episode I: processing 100 billion IoT events




A couple of weeks ago we - the Datatonic UK team - participated in a hackathon organised by Transport for London. At the kick-off session, we received background information on all the challenges Transport for London (TfL), and by extension London, is facing: firstly, congestion on the wider transport network (managing 30 million journeys daily), secondly, air pollution throughout London, and thirdly, more specifically, congestion on the roads.

The hackathon

After the general introduction, the datasets were specified, and one of them immediately grabbed our attention: the data originating from the Urban Traffic Control system (UTC-data) spanning 3 months, registering car activity all over London using more then 14 000 sensors. The whole dataset amounted to over 120 billion lines, and we were curious about the challenge of processing this data and putting it to good use.

The goals

After some discussions and brainstorming, we came up with the 2 goals. We would direct our efforts at:

  1. 1. Visualising this vast amount of data in a cool and interactive way;
  2. 2. Analysing the data in a more aggregated way and try to predict the future state of the traffic - with the ultimate goal of predicting congestion. This included trying to become traffic experts in a very short amount of time... (if you want to become one yourself, have a look here and here)


Transport for London Hackweek: kick-off session

The data

Let us give you a quick introduction to the data. London is divided into 5 zones (NORTH, SOUTH, CENTER, OUTER and EAST). All the files were stored as zipped CSVs on Google Cloud Storage (GCS), a total of about 5TB, every file containing 5 minutes of data per zone. Every line in the file contained a timestamp (measurements are taken every quarter of a second), a certain amount of sensors (up to 8) specified with a sensor ID , a bitstring and some less important information.
The part we are interested in, is the bitstring. This represents the presence or absence of a car; for example: for timestamp 26-10-2016 16:38:02 and bitstring "0010" this would mean a car was present on top of the sensor at 16:38:02.500 today and no car was present on top of the sensor for the timestamps 16:38:02.000, 16:38:02.250 and 16:38:02.750.

The architecture

When we hear "5TB of data" and "processing, analysing and transforming data" in one sentence, we immediately think Apache Beam (running as "Google Dataflow"). So we got to the drawing board and starting drafting an architecture for all this. You can see the result below:

Processing 120 billion rows using Google Dataflow

The pipeline includes a look-up to a file containing the coordinates of each sensor, some windowing (to get an aggregated view over 5 minutes) and some grouping and combining  by sensorID. We won't go into too much detail here, because that might lead us too far (and you guys might stop reading). 
Take-away is that Dataflow poses a rather easy way to write processing logic and can scale with the click of a button: we ran the pipeline using 150 machines, in order to get the resulting data written to BigQuery overnight, so we could keep up with the high pace of the hackweek. 
We should mention though that the pipeline we constructed here is just a proof of concept. Optimisation is certainly possible (and recommended) and could improve the efficiency of the pipeline by several factors. This would in turn reduce the computation time needed and/or reduce the number of machines used. This puts the time you see in each block in perspective: the ReadFromStorage-step for example did not last 1 day 2 hr 33 min in real life but is merely the combined computation time of all the machines running the code and processing the data. If you were to time this step while it is running, it would most likely last about 10 minutes!

We didn't rest on our laurels while the Dataflow job was running though. Instead, we went back to the drawing board and came up with the full architecture for our problem. Reading the files from GCS (and in the future maybe from Google PubSub, when TfL provides the data as a streaming service), processing with Dataflow, writing to BigQuery for Tableau analysis and predictions and to PubSub for live visualisation and predictions).

Overview of the full architecture

The results

For this blogpost we left the prediction part out of scope (but it gives you a sneak peak into what our next blogpost might be about ;)). For the live visualisation we used Processing, an open-source programming language aimed at programming in a visual context and largely based on Java.

Screenshot of the visualisation

Some more context on what you see might be interesting:
   + In the box on the left you can see:
         + The selected sensor.
         + The location of the sensor.
         + The date and timestamp of the window we're looking at.
         + Some calculated measures relevant for traffic engineers: flow, occupancy and average speed estimation.
         + The raw data: a box is white when a car is on top of the sensor, blue when no car is on top of the sensor.
         + The current timestamp.

   The map itself shows:
         + All the sensors in their exact locations.
         + The sensor lights up (white) when a burst of 1's (consecutive white boxes in the grid on the left) starts.

The takeaways

Apart from looking very cool, this tool might be useful to people monitoring traffic or learning about traffic engineering.  Moreover, it's useful as a first step into analysing this type of sensordata. Technically, we see that Google Dataflow is capable of scaling up easy and fast, and is able to meet our large processing requirements. A nice add-on to this, is that the Apache Beam SDK is future proof: this code can be re-used completely, with very few changes, to process data from a real-time live stream instead of batch file input (one of the key benefits of Apache Beam).

To be continued... read our next blog post about how we use this processed data to do congestion predictions one hour ahead.

Thanks to the great team at TfL for organising the demo and providing the data. Watch this space for an interactive demo for you to play with!

GDG Cloud meetup: Cloud Machine Learning and TensorFlow

We were very happy with the great turnout on the latest GDG Cloud meetup that we sponsored.
It was about Cloud Machine Learning and TensorFlow. What's even better is that between this meetup and today, CloudML went in public beta. So you can start playing with it yourself!


For the people that missed the presentations, and for those who would like to see them again, you can find the slides below:

Google Cloud Release notes

https://docs.google.com/presentation/d/1dOe4api48mytgO-km0tooOThAJA65jP66FrhK-D9aoQ

Pre-trained models: Speech, Vision and NLP API

https://docs.google.com/presentation/d/1dOe4api48mytgO-km0tooOThAJA65jP66FrhK-D9aoQ/

Custom models: how to build and deploy your own models using TensorFlow


For the people interested in learning more about TensorFlow, you can join the new TensorFlow meetup group!

Make bots great again

An illustrated longread on the perils & pitfalls of building a natural language AI.


Sometimes in 2016, bots became the next big thing. Spurred by positive press coverage and hyped demand, many digital companies jumped on the bandwagon and started providing mediocre solutions for a very complex problem; one that’s still unsolved after more than 60 years of research in the field. 

Forget what you read in the Sunday paper: machines can’t use language efficiently, form abstractions without being coaxed to do so, or represent machine-born concepts in human state. The artificial intelligence Hollywood has you dreaming of just isn’t here yet.

At Datatonic, the company I work for, we’re no experts in bots. We’re a machine learning company; and since we mainly work with very large customers we’re basically immune from the media’s hype machine. We also tend to deliver technology that works — and bots, in their current incarnation, did not fit the bill.

Or so we thought, until Google called us with a challenging offer. A very large IT provider for an even larger European telecom company had been struggling for quite some time with creating believable bots that would take over their customer service. After almost one year, the project had stalled. The IT provider had turned to Google, which in turn decided to ping our team. We had our reservations, but all in all it was an offer we simply couldn’t refuse.

Our NDA unfortunately forbids us from sharing actual code, but we thought we’d give something back to the community and to the general public at large by charting our thought process, technical approach and final results.

Part 1: Why bots don’t work


A complete explanation of each and every issue involved in natural language understanding would be beyond the scope of this article. But we’d like to pinpoint how and why the bot-hype machine lied to you.
First of all: bots (and AIs in general) have no semantic understanding of what a word actually means. Humans are very good at detecting the meaning of a word based on the context: if I told you I’d spent my weekend hunting axolotls in Extremadura, you’d be able to deduce that axolotls are animals and Extremadura a location — probably a hispanic one, even though you can’t really put your finger on why. And even if you’ve never seen an axolotl in your life, you’d immediately understand how it looks like if I were to tell you it’s a ‘transparent lizard that lives underwater’. The semantic representation in your head allows you to do any operation of the sort: Axolotl = lizard + water minus color, even though there’s no obvious way to sum or subtract words.

Second major pain point: a bot has no general context understanding. If I told you ‘someone drew a gun’ your reaction would be vastly different depending on if we were talking about artists or bank robbers.

Last main issue to consider: a bot has no memory. You can trick it into holding a rudimentary ‘database’ of stuff that has already been handled in the conversation; but even on a sentence level every bot suffers from a bad case of memory loss. Just think about the sentence ‘I’d like a pizza’, and how it changes if you prepend ‘I’m not sure if-’, ‘I’m totally positive- ’ or ‘I don’t think -’. A bot that has no memory of what’s been said previously in the sentence and in the conversation cannot possibly understand humans correctly.

Many of the existing off-the-shelf bot AI solutions have decided to sidestep those issue by naively focusing on a very narrow pre-processed user experience path. To configure your bot for prime-time, you need to input all of the possible conversation branches, after which the AI tries to recognize what your users are actually asking. All of the user’s interaction have to be charted beforehand. It’s more of a choose-your-adventure book than a real chatbot.

In this widely used brute-force approach, only an (almost) exact sentence match will be recognized. So you have to define your entire conversation beforehand.

bag-of-words is slightly better: the match probability is calculated on each single word and aggregated: you don’t need an exact sentence match

Bots that focus on intent recognition will often simplify sentences by disregarding their grammar and structure entirely; throwing all of the sentence’s words in a single ‘bag’ and counting the number of occurrences. And while this bag-of-words approach has been very successful in the past for simple tasks such as spam filters (as an e-mail containing many occurrences of ‘Russian brides’ or ‘enlargement’ is probably spam), it fails to account for many nuances in human speech such as word order and punctuation (keep in mind a single comma separates the friendly “let’s eat, grandpa” from the cannibalistic “let’s eat grandpa”)

Even worse, because of their catch-all structure, general purpose bots will only try to count words with a specific overarching meaning (‘entities’), and by doing so miss important clues about the true meaning of the conversation.

For these and many other reasons, not only our client’s bot but actually most chatbots are failing at providing a new, compelling UX paradigm. They are simply not smart enough.

For these and many other reasons, not only our client’s bot but actually most chatbots are failing at providing a new, compelling UX paradigm. They are simply not smart enough.


Part 2: how to fix bots


I. giving our Bot semantic knowledge
The first thing we do when starting a new machine learning project is brainstorming about the most important features of the problem at hand. A feature is a parameter that allows you to discern between outcomes, or ‘labels’ . For example, if you were to label pizzas into ‘Margherita’, ‘Pesto’ or ‘Romana’, then ‘sauce colour’ would be a great feature to use, whereas ‘is_round’ would be extremely unhelpful in finding out the right category.

In our case the ‘labels’ we wanted to predict were numerous and open-ended — they basically all are the answer to the questions: ‘what does the customer mean/want?’. That’s all pretty straightforward. The major challenge is identifying relevant features. For what features would you, as a human, use to distinguish between the sentences ‘I’d like a large Pizza’ and ‘my smartphone is not working?’

It’s clear then that meaning of words and their actual structure are completely disconnected. ‘Dog’, ‘Chien’ and ‘Perro’ all refer to the same concept, but they have otherwise nothing in common. The bag-of-word approach only counts total occurrences, and can therefore get away with random tags. But, as research groups realized in the late seventies, this very same flexibility was both a blessing and a curse, making bag-of-words bots very robust but wildly inaccurate.

A more modern standard for context reconstruction was tested in the late eighties under the name Wordnet: it attempted to model relations between words by having humans assign them to synsets- general, tree-structured groups of categories such as ‘n.dog’, ‘n.canine’, ‘n.domestic_animal’.


The main issue with Wordnet is that humans still needed to tag each and every word that was going to be used in the system, for each and every language. That is a monumental task considering the numerous, ever-changing thesaurus of our planet. It would be just like having our system discern between ‘cats’ and ‘lions’ by photographing each and every cat and lion under the sun: a no-go, given the size and scope of our project.

Therefore, we needed something entirely different. Our choice fell on a seminal discovery, pioneered by Tomas Mikolov at Google in 2013. This approach is called Word2vec, or more generally a ‘word-embedding approach’, and just like many deep-learning systems it allows a computer to model features all on its own instead of relying on human ‘translators’.

Word2vec basically ingests a very large corpus of texts (usually Wikipedia in the local language), and assigns an N-dimensional vector to each word based on the context that usually occurs around it. So for example, ‘I’m eating cheese’, ‘I’m eating pasta’ and ‘I’m eating pizza’ will have the system identify ‘cheese’, ‘pasta’, and ‘pizza ’ as belonging to a single category: therefore, those three words will be neighbours in an N-dimensional vector space.


As it turns out, this kind of representation has three main advantages: first of all, it’s easy to store, understand and debug, allowing us to reduce a very large set of words (usually in the tens of thousands) to a matrix with just N columns (200, in our case). Secondly, it represents words as vectors, allowing mathematical operations on them (something that would be impossible both using bag-of-words or wordnet-like methods).
Third main advantage: it’s pretty damn similar to what’s actually in our brains. And so if you take the vector for ‘Italy’, sum the vector ‘Rome’ to it and subtract ‘France’, what you get is actually ‘Paris’. If you take ‘King’+’Woman’-’Man’, you totally do get ‘Queen’!



All similar verbs are grouped together, as are prepositions, junk foods, famous figures, past tenses and so on. In the word2vec representation, a basic version of the equality ‘Axolotl = lizard + water — color’ actually holds true. And while it’s true that, unlike a real person, the system still has no idea of what those N-dimensional clusters actually represent (and how could a computer actually understand what a ‘queen’ is?), we are computer scientists, not linguists or philosophers, and it seemed to us we had solved the first issue. Our bot now could do, if not semantic understanding, at least semantics clustering.

II. giving our Bot contextual understanding
Once you have a convincing semantic representation, you still have to account for the relations between words. The bag-of-words approach does entirely away with grammar and sequencing order, and simply focuses on which words appear in the entire sentence. This is a very naive approach; but the technical challenge increases exponentially as soon as you deviate from it. For, if you do so, you need to account for a large number of details and grammatical nuances. The length of your sequence becomes a variable too (and how long can a sentence get?), something that makes it extremely difficult to build coherent optimization routines, to allocate memory and computing power efficiently, and to actually identify what’s important and what isn’t.

To solve this problem you just need to think about how humans read. Not by ingesting the entire sequence at once, but by reading it word-by-word. A human is able to sequentially read a sequence, and break it down in sub-sentences, isolating the most important words.

To implement this, we borrowed a technology that’s used in image and signal recognition: a convolutional neural network. A CNN uses a ‘sliding window’ (think about your eyes while you’re reading) to cluster neighbouring words in a sentence, filter them and isolate key concepts. Just like in image recognition, a convolutional network can robustly ignore small differences in words used and sentence ordering by considering many windows of varying sizes using different ‘filters’. All those filters output a smaller sentence vector, from which the most important dimensions are selected in the step called ‘max-pooling’. Those key terms are then scanned again in a second ‘convolution’ step, then ‘pooled’ again, and so on.

this CNN has just two steps: a convolution step, with a window size of 3, and a max-pooling step with 2 filters at a time.
Such a network will ignore small variation in ordering and context (the so-called location invariance). It will also account for the the local context inside the sliding window, and build many filtered sub-vectors, breaking down the sequencing problem in many smaller subsets. Last but not least, it will analyze an entire variable-length phrase and reduce it to a fixed length vector that is guaranteed to contain the most relevant information about the original input (the so-called compositional completeness).

Convolutional networks are a very wide topic that deserves its own write-up (and we recommend this one from the amazingly talented Chris Olah). But for the sake of keeping it short let’s just quickly recap our results. CNNs were our answer to our semantic representation (a vectorial one, with Word2vec) missing the general context. By coupling Word2vec with CNNs we were finally able to have our bots take into account the entire sentence context, break down longer phrases into sub-vectors, and filter them into their most relevant components. Repeat those steps multiple times, and you get a bot that’s able to understand the entire sequence by systematically reducing her complexity.

At this point our chatbot was performing much better than most competitors, and at least 20% better than our own baseline model, but it was still missing an important component: memory. A CNN breaks down sentences in clusters and only keeps the most relevant items from each of those: it discriminates based on context and word ordering (that is: it sees the difference between ‘it is a great pizza, isn’t it?’ and ‘it isn’t a great pizza, is it?’), but it still forgets that we’re talking about pizza after just a few words.

III. giving memory to our Bot
This was the last, most important point. A bot with no memory will probably understand short statements and unambiguous sentences, but fail badly at decoding meaning swings in a longer phrase. Humans remember the context they’re working with, and use it to nuance the meaning of every following word; CNNs on the other hand start from scratch with each filtered word cluster.


To solve this problem, we chose to work with an LSTM (Long-short-term memory) network. LSTMs are a special kind of recurrent neural networks: whereas a normal neural network (such as a CNN) takes the entire sentence as input and processes it at once, a recurrent network actually ingests information sequentially: the state of the network at a previous timestep is also used as input, allowing the network to have a rudimentary form of memory.

Unfortunately, RNNs tend to be unable to remember more than the last few words. LSTMs, on the other side, replace normal neuron with ‘memory cells’, that allow them to keep track of important (past) informations, while discarding unuseful stuff.

A word-based LSTM memory cell will take two inputs: the word to consider and the memory state of the network. A forget gate will allow the network to discard information that’s no longer relevant: if the sentence’s subject has changed, for example, the cell will forget the previous subject’s gender and number. An input gate will select relevant information about the current word and its interaction with the current network state. That information will be committed to memory; and an output gate will, in turn, spew two outputs: the classification for the current word as well as a general, updated memory state. The second will be used as input in the following cell, along with the next word, and so on.


an LSTM has a memory state flow (above) and an input/output flow (below). Each gate uses its own set of neurons and activations.
LSTMs have been involved in many of the most interesting breakthroughs of the last few years, and they do indeed work as advertised. After a very lengthy training process, our chatbot was ready: time to ship.

Part 3: booting up


We unveiled the first version of the Datatonic-bot to our client after two very long weeks of research, testing and training on Google’s online infrastructure. DT-bot uses its brains to process chat requests from tens of thousands of users every day; it is written in Tensorflow and runs on Google’s Cloud ML, but it is nonetheless able to classify up to 4000 sentences per second into hundreds of different categories (describing the needs of the calling user), and to do so correctly 85% of the times.


Since then, we’ve been busy adding even more functionality to our chatbot, enabling it to work in different languages, automatically process entity names (such as the caller’s address, name, and more), and work not only with chat data, but also with voice and telephone data.







And we’ve grown quite fond of it. It’s not the smart bot you’d see in a big-budget movie. It’s no HAL 9000 or Samantha; heck, it’s not even wall-E! But it’s a great little product that works amazingly well. A shining example of how a small, passionate, talented technology outfit can pull off a state-of-the-art artificial intelligence. It’s lacking in marketing buzzwords, but high in functionality. It’s a work of art, a labour of love, a big step forward.

Of course it’s not the artificial intelligence Hollywood has you dreaming of. Because this one, we dreamed it up ourselves.