GDG Cloud meetup: Cloud Machine Learning and TensorFlow

We were very happy with the great turnout on the latest GDG Cloud meetup that we sponsored.
It was about Cloud Machine Learning and TensorFlow. What's even better is that between this meetup and today, CloudML went in public beta. So you can start playing with it yourself!

For the people that missed the presentations, and for those who would like to see them again, you can find the slides below:

Google Cloud Release notes

Pre-trained models: Speech, Vision and NLP API

Custom models: how to build and deploy your own models using TensorFlow

For the people interested in learning more about TensorFlow, you can join the new TensorFlow meetup group!

Make bots great again

An illustrated longread on the perils & pitfalls of building a natural language AI.

Sometimes in 2016, bots became the next big thing. Spurred by positive press coverage and hyped demand, many digital companies jumped on the bandwagon and started providing mediocre solutions for a very complex problem; one that’s still unsolved after more than 60 years of research in the field. 

Forget what you read in the Sunday paper: machines can’t use language efficiently, form abstractions without being coaxed to do so, or represent machine-born concepts in human state. The artificial intelligence Hollywood has you dreaming of just isn’t here yet.

At Datatonic, the company I work for, we’re no experts in bots. We’re a machine learning company; and since we mainly work with very large customers we’re basically immune from the media’s hype machine. We also tend to deliver technology that works — and bots, in their current incarnation, did not fit the bill.

Or so we thought, until Google called us with a challenging offer. A very large IT provider for an even larger European telecom company had been struggling for quite some time with creating believable bots that would take over their customer service. After almost one year, the project had stalled. The IT provider had turned to Google, which in turn decided to ping our team. We had our reservations, but all in all it was an offer we simply couldn’t refuse.

Our NDA unfortunately forbids us from sharing actual code, but we thought we’d give something back to the community and to the general public at large by charting our thought process, technical approach and final results.

Part 1: Why bots don’t work

A complete explanation of each and every issue involved in natural language understanding would be beyond the scope of this article. But we’d like to pinpoint how and why the bot-hype machine lied to you.
First of all: bots (and AIs in general) have no semantic understanding of what a word actually means. Humans are very good at detecting the meaning of a word based on the context: if I told you I’d spent my weekend hunting axolotls in Extremadura, you’d be able to deduce that axolotls are animals and Extremadura a location — probably a hispanic one, even though you can’t really put your finger on why. And even if you’ve never seen an axolotl in your life, you’d immediately understand how it looks like if I were to tell you it’s a ‘transparent lizard that lives underwater’. The semantic representation in your head allows you to do any operation of the sort: Axolotl = lizard + water minus color, even though there’s no obvious way to sum or subtract words.

Second major pain point: a bot has no general context understanding. If I told you ‘someone drew a gun’ your reaction would be vastly different depending on if we were talking about artists or bank robbers.

Last main issue to consider: a bot has no memory. You can trick it into holding a rudimentary ‘database’ of stuff that has already been handled in the conversation; but even on a sentence level every bot suffers from a bad case of memory loss. Just think about the sentence ‘I’d like a pizza’, and how it changes if you prepend ‘I’m not sure if-’, ‘I’m totally positive- ’ or ‘I don’t think -’. A bot that has no memory of what’s been said previously in the sentence and in the conversation cannot possibly understand humans correctly.

Many of the existing off-the-shelf bot AI solutions have decided to sidestep those issue by naively focusing on a very narrow pre-processed user experience path. To configure your bot for prime-time, you need to input all of the possible conversation branches, after which the AI tries to recognize what your users are actually asking. All of the user’s interaction have to be charted beforehand. It’s more of a choose-your-adventure book than a real chatbot.

In this widely used brute-force approach, only an (almost) exact sentence match will be recognized. So you have to define your entire conversation beforehand.

bag-of-words is slightly better: the match probability is calculated on each single word and aggregated: you don’t need an exact sentence match

Bots that focus on intent recognition will often simplify sentences by disregarding their grammar and structure entirely; throwing all of the sentence’s words in a single ‘bag’ and counting the number of occurrences. And while this bag-of-words approach has been very successful in the past for simple tasks such as spam filters (as an e-mail containing many occurrences of ‘Russian brides’ or ‘enlargement’ is probably spam), it fails to account for many nuances in human speech such as word order and punctuation (keep in mind a single comma separates the friendly “let’s eat, grandpa” from the cannibalistic “let’s eat grandpa”)

Even worse, because of their catch-all structure, general purpose bots will only try to count words with a specific overarching meaning (‘entities’), and by doing so miss important clues about the true meaning of the conversation.

For these and many other reasons, not only our client’s bot but actually most chatbots are failing at providing a new, compelling UX paradigm. They are simply not smart enough.

For these and many other reasons, not only our client’s bot but actually most chatbots are failing at providing a new, compelling UX paradigm. They are simply not smart enough.

Part 2: how to fix bots

I. giving our Bot semantic knowledge
The first thing we do when starting a new machine learning project is brainstorming about the most important features of the problem at hand. A feature is a parameter that allows you to discern between outcomes, or ‘labels’ . For example, if you were to label pizzas into ‘Margherita’, ‘Pesto’ or ‘Romana’, then ‘sauce colour’ would be a great feature to use, whereas ‘is_round’ would be extremely unhelpful in finding out the right category.

In our case the ‘labels’ we wanted to predict were numerous and open-ended — they basically all are the answer to the questions: ‘what does the customer mean/want?’. That’s all pretty straightforward. The major challenge is identifying relevant features. For what features would you, as a human, use to distinguish between the sentences ‘I’d like a large Pizza’ and ‘my smartphone is not working?’

It’s clear then that meaning of words and their actual structure are completely disconnected. ‘Dog’, ‘Chien’ and ‘Perro’ all refer to the same concept, but they have otherwise nothing in common. The bag-of-word approach only counts total occurrences, and can therefore get away with random tags. But, as research groups realized in the late seventies, this very same flexibility was both a blessing and a curse, making bag-of-words bots very robust but wildly inaccurate.

A more modern standard for context reconstruction was tested in the late eighties under the name Wordnet: it attempted to model relations between words by having humans assign them to synsets- general, tree-structured groups of categories such as ‘’, ‘n.canine’, ‘n.domestic_animal’.

The main issue with Wordnet is that humans still needed to tag each and every word that was going to be used in the system, for each and every language. That is a monumental task considering the numerous, ever-changing thesaurus of our planet. It would be just like having our system discern between ‘cats’ and ‘lions’ by photographing each and every cat and lion under the sun: a no-go, given the size and scope of our project.

Therefore, we needed something entirely different. Our choice fell on a seminal discovery, pioneered by Tomas Mikolov at Google in 2013. This approach is called Word2vec, or more generally a ‘word-embedding approach’, and just like many deep-learning systems it allows a computer to model features all on its own instead of relying on human ‘translators’.

Word2vec basically ingests a very large corpus of texts (usually Wikipedia in the local language), and assigns an N-dimensional vector to each word based on the context that usually occurs around it. So for example, ‘I’m eating cheese’, ‘I’m eating pasta’ and ‘I’m eating pizza’ will have the system identify ‘cheese’, ‘pasta’, and ‘pizza ’ as belonging to a single category: therefore, those three words will be neighbours in an N-dimensional vector space.

As it turns out, this kind of representation has three main advantages: first of all, it’s easy to store, understand and debug, allowing us to reduce a very large set of words (usually in the tens of thousands) to a matrix with just N columns (200, in our case). Secondly, it represents words as vectors, allowing mathematical operations on them (something that would be impossible both using bag-of-words or wordnet-like methods).
Third main advantage: it’s pretty damn similar to what’s actually in our brains. And so if you take the vector for ‘Italy’, sum the vector ‘Rome’ to it and subtract ‘France’, what you get is actually ‘Paris’. If you take ‘King’+’Woman’-’Man’, you totally do get ‘Queen’!

All similar verbs are grouped together, as are prepositions, junk foods, famous figures, past tenses and so on. In the word2vec representation, a basic version of the equality ‘Axolotl = lizard + water — color’ actually holds true. And while it’s true that, unlike a real person, the system still has no idea of what those N-dimensional clusters actually represent (and how could a computer actually understand what a ‘queen’ is?), we are computer scientists, not linguists or philosophers, and it seemed to us we had solved the first issue. Our bot now could do, if not semantic understanding, at least semantics clustering.

II. giving our Bot contextual understanding
Once you have a convincing semantic representation, you still have to account for the relations between words. The bag-of-words approach does entirely away with grammar and sequencing order, and simply focuses on which words appear in the entire sentence. This is a very naive approach; but the technical challenge increases exponentially as soon as you deviate from it. For, if you do so, you need to account for a large number of details and grammatical nuances. The length of your sequence becomes a variable too (and how long can a sentence get?), something that makes it extremely difficult to build coherent optimization routines, to allocate memory and computing power efficiently, and to actually identify what’s important and what isn’t.

To solve this problem you just need to think about how humans read. Not by ingesting the entire sequence at once, but by reading it word-by-word. A human is able to sequentially read a sequence, and break it down in sub-sentences, isolating the most important words.

To implement this, we borrowed a technology that’s used in image and signal recognition: a convolutional neural network. A CNN uses a ‘sliding window’ (think about your eyes while you’re reading) to cluster neighbouring words in a sentence, filter them and isolate key concepts. Just like in image recognition, a convolutional network can robustly ignore small differences in words used and sentence ordering by considering many windows of varying sizes using different ‘filters’. All those filters output a smaller sentence vector, from which the most important dimensions are selected in the step called ‘max-pooling’. Those key terms are then scanned again in a second ‘convolution’ step, then ‘pooled’ again, and so on.

this CNN has just two steps: a convolution step, with a window size of 3, and a max-pooling step with 2 filters at a time.
Such a network will ignore small variation in ordering and context (the so-called location invariance). It will also account for the the local context inside the sliding window, and build many filtered sub-vectors, breaking down the sequencing problem in many smaller subsets. Last but not least, it will analyze an entire variable-length phrase and reduce it to a fixed length vector that is guaranteed to contain the most relevant information about the original input (the so-called compositional completeness).

Convolutional networks are a very wide topic that deserves its own write-up (and we recommend this one from the amazingly talented Chris Olah). But for the sake of keeping it short let’s just quickly recap our results. CNNs were our answer to our semantic representation (a vectorial one, with Word2vec) missing the general context. By coupling Word2vec with CNNs we were finally able to have our bots take into account the entire sentence context, break down longer phrases into sub-vectors, and filter them into their most relevant components. Repeat those steps multiple times, and you get a bot that’s able to understand the entire sequence by systematically reducing her complexity.

At this point our chatbot was performing much better than most competitors, and at least 20% better than our own baseline model, but it was still missing an important component: memory. A CNN breaks down sentences in clusters and only keeps the most relevant items from each of those: it discriminates based on context and word ordering (that is: it sees the difference between ‘it is a great pizza, isn’t it?’ and ‘it isn’t a great pizza, is it?’), but it still forgets that we’re talking about pizza after just a few words.

III. giving memory to our Bot
This was the last, most important point. A bot with no memory will probably understand short statements and unambiguous sentences, but fail badly at decoding meaning swings in a longer phrase. Humans remember the context they’re working with, and use it to nuance the meaning of every following word; CNNs on the other hand start from scratch with each filtered word cluster.

To solve this problem, we chose to work with an LSTM (Long-short-term memory) network. LSTMs are a special kind of recurrent neural networks: whereas a normal neural network (such as a CNN) takes the entire sentence as input and processes it at once, a recurrent network actually ingests information sequentially: the state of the network at a previous timestep is also used as input, allowing the network to have a rudimentary form of memory.

Unfortunately, RNNs tend to be unable to remember more than the last few words. LSTMs, on the other side, replace normal neuron with ‘memory cells’, that allow them to keep track of important (past) informations, while discarding unuseful stuff.

A word-based LSTM memory cell will take two inputs: the word to consider and the memory state of the network. A forget gate will allow the network to discard information that’s no longer relevant: if the sentence’s subject has changed, for example, the cell will forget the previous subject’s gender and number. An input gate will select relevant information about the current word and its interaction with the current network state. That information will be committed to memory; and an output gate will, in turn, spew two outputs: the classification for the current word as well as a general, updated memory state. The second will be used as input in the following cell, along with the next word, and so on.

an LSTM has a memory state flow (above) and an input/output flow (below). Each gate uses its own set of neurons and activations.
LSTMs have been involved in many of the most interesting breakthroughs of the last few years, and they do indeed work as advertised. After a very lengthy training process, our chatbot was ready: time to ship.

Part 3: booting up

We unveiled the first version of the Datatonic-bot to our client after two very long weeks of research, testing and training on Google’s online infrastructure. DT-bot uses its brains to process chat requests from tens of thousands of users every day; it is written in Tensorflow and runs on Google’s Cloud ML, but it is nonetheless able to classify up to 4000 sentences per second into hundreds of different categories (describing the needs of the calling user), and to do so correctly 85% of the times.

Since then, we’ve been busy adding even more functionality to our chatbot, enabling it to work in different languages, automatically process entity names (such as the caller’s address, name, and more), and work not only with chat data, but also with voice and telephone data.

And we’ve grown quite fond of it. It’s not the smart bot you’d see in a big-budget movie. It’s no HAL 9000 or Samantha; heck, it’s not even wall-E! But it’s a great little product that works amazingly well. A shining example of how a small, passionate, talented technology outfit can pull off a state-of-the-art artificial intelligence. It’s lacking in marketing buzzwords, but high in functionality. It’s a work of art, a labour of love, a big step forward.

Of course it’s not the artificial intelligence Hollywood has you dreaming of. Because this one, we dreamed it up ourselves.

Simple Recommendations using Spark on Google Cloud Dataproc

In this blog post we’re going to show how to build a very simple recommendation engine using basket analysis and frequent pattern mining. We’re going to implement it using Spark on Google Cloud Dataproc and show how to visualise the output in an informative way using Tableau.

Given ‘baskets’ of items bought by individual customers, one can use frequent pattern mining to identify which items are likely to be bought together. We will show this with a simple example using the groceries dataset, but it could easily be extended to movies, tv, music, etc!

The groceries dataset contains a list of baskets from a grocery store in the format

+ Citrus fruit, Semi-finished bread, Margarine, Ready soups
+ Tropical fruit, Yogurt, Coffee
+ Whole milk
+ Pip fruit, Yogurt, Cream cheese, Meat spreads
+ Other vegetables, Whole milk, Condensed milk, Long life bakery product

So we see customer 1’s basket contained some fruit, bread, margarine and soup, customer 2’s basket contained tropical fruit, yogurt and coffee etc. Now lets see what a basket analysis tells us.

Basket analysis

Lets start by defining some terms. The support or frequency for a particular item is defined as the percentage of baskets that item features in which is a measure of the popularity of individual items. In this dataset the most popular items are whole milk, vegetables and rolls/buns.

We similarly define the support for the item pair [A, B] (or we could generalise to items groups [A, B, C], etc) as the percentage of baskets the two items feature in together.

Can we use this measure for a recommendation engine? We might naively think so, a high support for [A, B] means that lots of people bought items A and B together, so why not recommend item B to someone buying item A?

To see the problem, consider this example. Say 50% of baskets contain fruit, and 50% contain chocolate, if there was no correlation between buying chocolate and fruit then we would probably expect 50% of the baskets that contain fruit, also contain chocolate, so 25% of all baskets contain the pair [fruit, chocolate]. What if 10% of all baskets contain sweets? Then we might expect 5% of baskets to contain [fruit, sweets], and 5% to contain [sweets, chocolate].

Now say we look at the data, we discover the following:

+ [fruit, chocolate] feature in 20% of baskets
+ [fruit, sweets] feature in 5% of baskets
+ [sweets, chocolate] feature in 9% of baskets

Now we begin to see why using support was a bad choice. The number of baskets containing fruit and chocolate is lower than what we argued in the case where the buying of these two items is uncorrelated, this means that you are actually less likely to buy fruit if you’ve bought chocolate and vice versa! Conversely sweets and chocolate feature in many more baskets than we thought, so if you’ve bought sweets you are more likely to buy chocolate!

Using support would have led us to recommend fruit for people who buy chocolate. But we have seen a better recommendation would have been sweets! This is where lift comes in

lift(A, B) = support for [A, B] / ((support for A) x (support for B))

Lift is the support for item pair [A, B] normalised to the product of the support for A and the support for B.

What is this normalisation? Well "support for A x support for B" is the number we found above when we discussed the expected values when the buying of A and B were uncorrelated! So lift is the actual support for A and B, normalised to the expected support if the items were uncorrelated. Thus lift is correlation. Given you bought chocolate, you are more likely than the average customer to also buy sweets (and less likely to buy fruit)!

How did we do it?

We used Google Cloud Platform to perform the analysis. Such a small input file probably did not need to be run on the big data tools but this analysis is an interesting use case for them. We created a (small) cluster in Google Cloud Dataproc, and using an initialisation script were able to install Spark on the cluster. We transferred the file from Google Cloud Storage onto the master node of the cluster and then we were good to go in under five minutes!

We used the FPGrowth algorithm in Spark to perform frequent pattern mining. This algorithm efficiently finds frequent patterns in the datasets, in our case: frequent itemsets!

from pyspark.mllib.fpm import FPGrowth
#import data
model = FPGrowth.train(data, minSupport=0.0005, numPartitions=10)
result = sorted(model.freqItemsets().collect())

The algorithm finds all sets of frequent items that have support greater than minSupport. The output is a list of items of any length, and the support of that list of items in the dataset

[Item A], 0.5
[Item B], 0.4
[Item A, Item B], 0.1
[Item C], 0.09
[Item A, Item B, Item C], 0.01

And that is it! We now have everything we need to build a very simple recommendation engine.


Visualising product affinity in an interesting and simple way is tricky. We focus only on itemsets of length 2, so pairs of items only. We do this because if there is an itemset of length >2 i.e. (item A, item B, item C), there must also be itemsets of length 2 for every combination of the longer item set, (item A, item B), (item B, item C), (item C, item A).

We can simply manipulate the output of the FPGrowth algorithm to restrict to item pairs, and then use the itemsets of length 1 to calculate the lift, we also give each pair a unique identifier. We get the output

Itemset ID, item 1, item 1 support, item 2, item 2 support, pair support, lift

We write this output into Google Cloud BigQuery so we can easily visualise the results using Tableau.

We use the visualisation method shown above. We like this method as its intuitive and allows us to show the results clearly. In the scatter plot we plot each item, the x axis has the maximum lift for that item and a matching pair. The y axis shows the average support for that item and its matching pairs, and the size is the support for that item. So size and height denote how popular that item is, and how many frequent item pairs it belongs to. The really interesting axis is the maximum lift direction.

If an item is located on the right hand side of the plot, then there is another item that pairs very well with this item. Lets look at a few cool examples:


The lift parameter tells us that people who are buying flour are likely to also buy sugar and baking powder, all ingredients for baking!

If we look at the third column on the bottom instead, we see the largest support is for sugar, but also root vegetables, which doesn’t seem right at all! In fact if we scroll down the highest support is actually for flour and whole milk. Whole milk just so happens to be the most popular item in the whole dataset! Luckily the use of lift normalises for this fact and we’re left only with the most relevant matching items at the top. So instead of recommending milk for someone buying flour (who doesn’t buy milk already?) we would recommend sugar and baking powder.

Frozen fish

People who buy frozen fish and also likely to buy frozen meals. These people probably don’t like or have time to cook. We starting to see now how this becomes useful for a recommendation engine.

Out final use case is Processed Cheese, which pairs very strongly with ham and white bread (perhaps for a lovely sandwich!).

Concluding Remarks

We’ve managed to build a very simple recommender using basket analysis in only a few lines of code using Spark on Google Cloud Dataproc. It is simple to see that this could be extended to lots more interesting sectors: for instance we could recommend music, films, or tv programmes, in this case the ‘baskets’ are albums/movies downloaded by individual customers. Now we can recommend new movies for a viewer based on what they’re currently watching! And this is just a first step, there are lots more interesting and complex things that one could do, for instance collaborative filtering to build a recommendation engine.

Learning deeper: a Tensorflow use case

Understanding what a new technology is and how it fits in your life used to be an easy task. The electronic spreadsheet, the iPod, fast printing, low-cost flights are advancements that are easy to grasp, easy to measure, and follow the ‘10x rule’. They are 10 times cheaper, or faster, or better than the previous solution — and that’s why their adoption happened at a breakneck pace.

But as our world has grown increasingly complex, it has gotten quite difficult for all of us to discern between hyped lemons and real game-changers. And when it became harder to understand whether or not a product could be the ‘next big thing’, most companies actually decided not to choose at all.

The problem with this approach -and a pretty big one at that- is that standing still does not guarantee the conservation of the status quo. Much like the red queen in Lewis Carroll’s masterpiece, sometimes you need to run just to keep your place. And what has worked in the past might actually not be what you need now.

A contagious case of linear regression:

Case in point: linear regression. Invented more than a century ago, this tool became a staple of the analyst’s bag of tricks by being easy to implement, intuitive to interpret and (probably the biggest culprit) integrated as a one-click solution in most analytics packages.

The principle behind this method is extremely simple: map an input x to the predicted output y by multiplying it by a coefficient α. Then compare your prediction to the actual outcome, update your coefficients and repeat until the difference between prediction and reality is minimized.

Linear regression can provide valid estimates and be a decent tool to get intuitive insights on mid-sized data, it fails miserably when confronted with more complex datasets; and it is generally outperformed by modern solutions that, believe it or not, are just as intuitive and easy to implement. In this case, we’ll use a neural network.

both z and w are non-linearly transformed (in our case with a linear rectifier — or RELU)

A neural network can simply be explained as a regression of transformed regressions: instead of linearly mapping your input x with the prediction y, you first use it to detect features using so-called hidden layers: the outcome of those hidden layers (z and w in our figure) is fed back to the following layer until the final layer — the prediction — is reached.

Since an example is worth more than a thousand words, we’ll show you a step-by-step comparison of past and future tools along with some very readable Python code you can easily try and tinker with on your own. So let’s get started!

For this example we’ll be working with the UCI Bike Sharing database. This is a very popular real-life dataset showing the number of bikes shared in Porto, aggregated hourly over the span of two years, along with a number of convenient predictors, such as ‘time of day’, ‘humidity’, ‘temperature’ and more.

With a couple of simple plots we can see how our data distribution follows what one might expect from such a dataset: more bikes are shared during normal working hours, when the climate is milder and if there’s not that much wind.

 A more in-depth graphical analysis can be found in the complete code listing; but let’s cut to the chase and see how well our linear regression performs using the the scikit-learn package, which executes a standard OLS linear regression in just a few lines of code:

How did it do? Not very well: if we were to follow what our model prescribes, we’d end up with around 140 bikes too much or too little during the forecast period. And sure enough, our R² score is negative (a statistical quirk that happens when expressing R² as 1-ESS/TSS), meaning that our model is not effective in predicting the number of bikes we’d need to fulfill our demand and would never be useful in an actual production scenario.

Faced with a similar problem, a motivated analytics team would probably start adding quadratic terms to the regression function to model polynomial relationships, or simply try different functions in their favorite software package until an acceptable solution is reached. We sure tried, and the best we could do was a Random Forest model with a R² score of 56%.

But what if there was a better way? A technology that’s an order of magnitude better than our current toolkit?

Ten times better

Enter deep learning. Despite being touted as the latest cutting-edge advancement, deep learning as such has been around for the last 40 years. We simply didn’t have enough data and enough computing power to really let it shine. With modern implementations such as Google’s Tensorflow (which we use at Datatonic for most of our commercial projects), scalable online infrastructure and continous advances in computing power, replacing our old tricks with new ones has never been this easy.

Case in point: our deep neural network model for the bike dataset. We explicitly decided to use tensorflow.learn (formerly SkFlow), to show how a state-of-the art deep regressor can be implemented with as many lines of code as our standard linear models. And here it is:

Doesn’t look that complicated, does it? And yet this model is able to predict our test set with 92% accuracy, reducing the Mean Squared Error by a factor 10. Best of all, and thanks to the amazing work of a number of very talented developers, this incredible increase in efficiency does not make the implementation much more complicated: on the contrary, with a model as simple as our standard OLS we are basically able to produce a production-ready forecaster.

If you’re feeling dizzy after all those numbers, here’s a visual representation of the linear model’s predictions vs the actual values for a random subset of our test set:

 And here’s its counterpart, this time for the DNN model’s predictions:

There are just as many values on this plot as on the first one — the relative emptiness is simply due to the DNN’s model predictive power (as smaller differences between predicted and actual values lead to much shorter blue segments).

In this particular case we decided to keep the mood light by predicting the number of bikes in sunny Portugal. But in today’s world (and especially if we consider at the kind of projects we usually take on at the office) it might as well have been something crucially important — cancer occurrences maybe, or streams of products sold, or high-speed financial data, or the number of people crossing the border and in need of assistance. When the stakes are this high, a combination of old and new techniques can make the difference between greatness and failure.

That’s why there’s simply no excuse for any modern team to stop innovating. The technology to be at the cutting edge of the analytics game is out there: it’s demonstrably order of magnitudes better than the incumbent solutions, and extremely easy to actually deploy it in a real-world environment. 
If you want to see our code and play with it, we have bundled the entire codebase along with our visualization at this link, and the dataset at this link. Just install the needed dependencies and run it in in your editor of choice (we generally use Jupyter Notebook on a Google Compute Engine instance).

And should you have any questions (or if you want to hear first-hand how we use this kind of insights to solve real problems for our clients), don’t hesitate to contact us via our website, or at our European or British headquarters!

-the Datatonic team

Interactive Map Spotfire

Creating custom interactive maps for TIBCO Spotfire

Have you ever wanted to create an interactive dashboard with your own custom process map? If so, then this blogpost is exactly what you're looking for!

Here at Datatonic we also specialize in data visualization tools and all of our blogpost so far have been about Tableau Software, however this time, we will be using TIBCO Spotfire.

In order to create a custom interactive map the first thing we need is the actual map image that we want to use as our background. The original image used in this tutorial shows a nuclear power plant with several modules, each of which should be clickable in our final dashboard.

To create feature shapes from our image we will use ArcGis Map, a downloadable, free to try tool that has all the functionality we need. 

Upon opening ArcGis, right click the layers folder and click add Data to import the background image of our nuclear plant. Now, under the 'windows' menu in the top bar we can open the catalog and browse to our current folder. By right clicking the folder we can create a new shapefile for our image. In this case we will select 'Polygon type'.

A new layer is now added to our hierarchy. Right click the new layer and select 'edit features' and 'start editing'. We can now start drawing all the feature shapes we need on top of our map using the tools in ArcMap. The ultimate goal will be to link these features to subsets of our data so that we can integrate the map in a dashboard. To do this, we have to label the features that we create. Select each feature, right click, go to attributes and give it an Id number. Note that by right clicking the feature layer in the left menu we can choose to display the labels on top of the map if we want to keep track of everything. In the 'Symbology' tab we can also choose to display different colors for each of the labels if we want to. 

Once the feature layer is finished, we can export our layers to be used in Spotfire. Right click the image layer, select Data - Export Data... Set both the extent and spatial reference options to use the original image. Export the image as .BMP format. For the feature layer, repeat the same process, making sure to export all features for the layer's source data.

In Spotfire

The first thing we will need in addition to some data from our plant is a lookup table that will provide the link between the feature layer we just created and the actual data from our nuclear plant.

For each of the Id labels in our feature layer (as created in ArcMap) we need to have a corresponding instance label that indicates the data column it should be linked to. The linkfile could for example look like this:


Upon opening Spotfire, we import both our actual data file and our linkfile. Note that in order to be able to link the two, we need an 'instance' field for each of the rows in the data file. A possible format could look like this:

452 °C
15.3 MPa

Once both files are imported, it's time to start visualizing our interactive map.

  1. + Add a new map chart
  2. + Delete all the existing layers
  3. + Set the Appearance Coordinate Reference system to 'None'

  • Add an image layer, import the background image (.BMP file) and set the Coordinate Reference System to 'None' in the Appearance menu
  • Add a feature layer using the linkfile
  • Again, set the Coordinate Reference system to 'None'
  • Now, under geocoding: add a new Geocoding Hierarchy and import the shapefile (.shp) we just created in ArcMap
  • In the "Feature by" menu at the top, tell Spotfire that each polygon in the shapefile corresponds to one Id of the linkfile
  • Next, we have to edit the column matches to link the feature layer elements to the data columns using the Id from ArcGis. To do this, add a column match between the imported shapefile and the linkfile using the "Id" field


Since we didn't specify a custom coordinate system it is possible that the feature layer and the image layer are not aligned correctly. To fix this:
  • Right click the map and select the image layer
  • Under the data tab: change the 'Extent Settings'
  • In ArcGis: right click the image layer and go to properties - Extent
  • Copy "the current settings of this layer" numbers into Spotfire and apply.

Making the map interactive:

OK, now that the map is correctly initialized we can start building a dashboard around it. However, if we want the dashboard to be interactive inside Spotfire we will want to apply a marking whenever we click one of the plant instances on the map and then update any existing graphs using only data from that selected instance. This means that we will have to apply a marking in the linkfile (which is the data file used as feature layer) and carry it over to a different data file in our analysis, namely the actual plant data.

To do this we need to add a relation between these two data tables.
  • Go to Edit - Data table properties - relations - manage relations
  • Create a new relation between the linkfile and the data using the "instance" column and apply it to the analysis
Now, whenever we mark something on the process map, the corresponding rows in the data table (using the "instance" match) will also be marked.

And that's it! We can now create any visualization we want using the data table, while limiting the displayed views to whatever instance was marked in the interactive map. Moreover, we could color each of the feature layers using whatever calculation we want, to display useful information about the corresponding instance.

Note that it is also possible to add multiple feature layers on top of each other. We could for example create a top layer that displays all the instances and a drill down layer below it that shows features within each of the instances. Using data functions we could then toggle specific feature layers on or off, allowing drill downs into the interactive map!