Learning deeper: a Tensorflow use case



Understanding what a new technology is and how it fits in your life used to be an easy task. The electronic spreadsheet, the iPod, fast printing, low-cost flights are advancements that are easy to grasp, easy to measure, and follow the ‘10x rule’. They are 10 times cheaper, or faster, or better than the previous solution — and that’s why their adoption happened at a breakneck pace.

But as our world has grown increasingly complex, it has gotten quite difficult for all of us to discern between hyped lemons and real game-changers. And when it became harder to understand whether or not a product could be the ‘next big thing’, most companies actually decided not to choose at all.

The problem with this approach -and a pretty big one at that- is that standing still does not guarantee the conservation of the status quo. Much like the red queen in Lewis Carroll’s masterpiece, sometimes you need to run just to keep your place. And what has worked in the past might actually not be what you need now.

A contagious case of linear regression:

Case in point: linear regression. Invented more than a century ago, this tool became a staple of the analyst’s bag of tricks by being easy to implement, intuitive to interpret and (probably the biggest culprit) integrated as a one-click solution in most analytics packages.


The principle behind this method is extremely simple: map an input x to the predicted output y by multiplying it by a coefficient α. Then compare your prediction to the actual outcome, update your coefficients and repeat until the difference between prediction and reality is minimized.

Linear regression can provide valid estimates and be a decent tool to get intuitive insights on mid-sized data, it fails miserably when confronted with more complex datasets; and it is generally outperformed by modern solutions that, believe it or not, are just as intuitive and easy to implement. In this case, we’ll use a neural network.

both z and w are non-linearly transformed (in our case with a linear rectifier — or RELU)

A neural network can simply be explained as a regression of transformed regressions: instead of linearly mapping your input x with the prediction y, you first use it to detect features using so-called hidden layers: the outcome of those hidden layers (z and w in our figure) is fed back to the following layer until the final layer — the prediction — is reached.

Since an example is worth more than a thousand words, we’ll show you a step-by-step comparison of past and future tools along with some very readable Python code you can easily try and tinker with on your own. So let’s get started!

For this example we’ll be working with the UCI Bike Sharing database. This is a very popular real-life dataset showing the number of bikes shared in Porto, aggregated hourly over the span of two years, along with a number of convenient predictors, such as ‘time of day’, ‘humidity’, ‘temperature’ and more.




With a couple of simple plots we can see how our data distribution follows what one might expect from such a dataset: more bikes are shared during normal working hours, when the climate is milder and if there’s not that much wind.

 A more in-depth graphical analysis can be found in the complete code listing; but let’s cut to the chase and see how well our linear regression performs using the the scikit-learn package, which executes a standard OLS linear regression in just a few lines of code:



How did it do? Not very well: if we were to follow what our model prescribes, we’d end up with around 140 bikes too much or too little during the forecast period. And sure enough, our R² score is negative (a statistical quirk that happens when expressing R² as 1-ESS/TSS), meaning that our model is not effective in predicting the number of bikes we’d need to fulfill our demand and would never be useful in an actual production scenario.

Faced with a similar problem, a motivated analytics team would probably start adding quadratic terms to the regression function to model polynomial relationships, or simply try different functions in their favorite software package until an acceptable solution is reached. We sure tried, and the best we could do was a Random Forest model with a R² score of 56%.

But what if there was a better way? A technology that’s an order of magnitude better than our current toolkit?


Ten times better


Enter deep learning. Despite being touted as the latest cutting-edge advancement, deep learning as such has been around for the last 40 years. We simply didn’t have enough data and enough computing power to really let it shine. With modern implementations such as Google’s Tensorflow (which we use at Datatonic for most of our commercial projects), scalable online infrastructure and continous advances in computing power, replacing our old tricks with new ones has never been this easy.

Case in point: our deep neural network model for the bike dataset. We explicitly decided to use tensorflow.learn (formerly SkFlow), to show how a state-of-the art deep regressor can be implemented with as many lines of code as our standard linear models. And here it is:



Doesn’t look that complicated, does it? And yet this model is able to predict our test set with 92% accuracy, reducing the Mean Squared Error by a factor 10. Best of all, and thanks to the amazing work of a number of very talented developers, this incredible increase in efficiency does not make the implementation much more complicated: on the contrary, with a model as simple as our standard OLS we are basically able to produce a production-ready forecaster.

If you’re feeling dizzy after all those numbers, here’s a visual representation of the linear model’s predictions vs the actual values for a random subset of our test set:



 And here’s its counterpart, this time for the DNN model’s predictions:

  
There are just as many values on this plot as on the first one — the relative emptiness is simply due to the DNN’s model predictive power (as smaller differences between predicted and actual values lead to much shorter blue segments).

In this particular case we decided to keep the mood light by predicting the number of bikes in sunny Portugal. But in today’s world (and especially if we consider at the kind of projects we usually take on at the office) it might as well have been something crucially important — cancer occurrences maybe, or streams of products sold, or high-speed financial data, or the number of people crossing the border and in need of assistance. When the stakes are this high, a combination of old and new techniques can make the difference between greatness and failure.

That’s why there’s simply no excuse for any modern team to stop innovating. The technology to be at the cutting edge of the analytics game is out there: it’s demonstrably order of magnitudes better than the incumbent solutions, and extremely easy to actually deploy it in a real-world environment. 
If you want to see our code and play with it, we have bundled the entire codebase along with our visualization at this link, and the dataset at this link. Just install the needed dependencies and run it in in your editor of choice (we generally use Jupyter Notebook on a Google Compute Engine instance).

And should you have any questions (or if you want to hear first-hand how we use this kind of insights to solve real problems for our clients), don’t hesitate to contact us via our website, or at our European or British headquarters!

-the Datatonic team




Interactive Map Spotfire

Creating custom interactive maps for TIBCO Spotfire

Have you ever wanted to create an interactive dashboard with your own custom process map? If so, then this blogpost is exactly what you're looking for!

Here at Datatonic we also specialize in data visualization tools and all of our blogpost so far have been about Tableau Software, however this time, we will be using TIBCO Spotfire.

In order to create a custom interactive map the first thing we need is the actual map image that we want to use as our background. The original image used in this tutorial shows a nuclear power plant with several modules, each of which should be clickable in our final dashboard.


To create feature shapes from our image we will use ArcGis Map, a downloadable, free to try tool that has all the functionality we need. 

Upon opening ArcGis, right click the layers folder and click add Data to import the background image of our nuclear plant. Now, under the 'windows' menu in the top bar we can open the catalog and browse to our current folder. By right clicking the folder we can create a new shapefile for our image. In this case we will select 'Polygon type'.

A new layer is now added to our hierarchy. Right click the new layer and select 'edit features' and 'start editing'. We can now start drawing all the feature shapes we need on top of our map using the tools in ArcMap. The ultimate goal will be to link these features to subsets of our data so that we can integrate the map in a dashboard. To do this, we have to label the features that we create. Select each feature, right click, go to attributes and give it an Id number. Note that by right clicking the feature layer in the left menu we can choose to display the labels on top of the map if we want to keep track of everything. In the 'Symbology' tab we can also choose to display different colors for each of the labels if we want to. 

Once the feature layer is finished, we can export our layers to be used in Spotfire. Right click the image layer, select Data - Export Data... Set both the extent and spatial reference options to use the original image. Export the image as .BMP format. For the feature layer, repeat the same process, making sure to export all features for the layer's source data.

In Spotfire

The first thing we will need in addition to some data from our plant is a lookup table that will provide the link between the feature layer we just created and the actual data from our nuclear plant.

For each of the Id labels in our feature layer (as created in ArcMap) we need to have a corresponding instance label that indicates the data column it should be linked to. The linkfile could for example look like this:


Id
Instance
0
Reactor
1
SteamGenerator
...
...

Upon opening Spotfire, we import both our actual data file and our linkfile. Note that in order to be able to link the two, we need an 'instance' field for each of the rows in the data file. A possible format could look like this:


Timestamp
Instance
Temperature
Pressure
...
3/14/2016
Reactor
452 °C
15.3 MPa
...
...
...
...
...
...


Once both files are imported, it's time to start visualizing our interactive map.


  1. + Add a new map chart
  2. + Delete all the existing layers
  3. + Set the Appearance Coordinate Reference system to 'None'


  • Add an image layer, import the background image (.BMP file) and set the Coordinate Reference System to 'None' in the Appearance menu
  • Add a feature layer using the linkfile
  • Again, set the Coordinate Reference system to 'None'
  • Now, under geocoding: add a new Geocoding Hierarchy and import the shapefile (.shp) we just created in ArcMap
  • In the "Feature by" menu at the top, tell Spotfire that each polygon in the shapefile corresponds to one Id of the linkfile
  • Next, we have to edit the column matches to link the feature layer elements to the data columns using the Id from ArcGis. To do this, add a column match between the imported shapefile and the linkfile using the "Id" field

Troubleshooting:

Since we didn't specify a custom coordinate system it is possible that the feature layer and the image layer are not aligned correctly. To fix this:
  • Right click the map and select the image layer
  • Under the data tab: change the 'Extent Settings'
  • In ArcGis: right click the image layer and go to properties - Extent
  • Copy "the current settings of this layer" numbers into Spotfire and apply.

Making the map interactive:

OK, now that the map is correctly initialized we can start building a dashboard around it. However, if we want the dashboard to be interactive inside Spotfire we will want to apply a marking whenever we click one of the plant instances on the map and then update any existing graphs using only data from that selected instance. This means that we will have to apply a marking in the linkfile (which is the data file used as feature layer) and carry it over to a different data file in our analysis, namely the actual plant data.

To do this we need to add a relation between these two data tables.
  • Go to Edit - Data table properties - relations - manage relations
  • Create a new relation between the linkfile and the data using the "instance" column and apply it to the analysis
Now, whenever we mark something on the process map, the corresponding rows in the data table (using the "instance" match) will also be marked.

And that's it! We can now create any visualization we want using the data table, while limiting the displayed views to whatever instance was marked in the interactive map. Moreover, we could color each of the feature layers using whatever calculation we want, to display useful information about the corresponding instance.

Note that it is also possible to add multiple feature layers on top of each other. We could for example create a top layer that displays all the instances and a drill down layer below it that shows features within each of the instances. Using data functions we could then toggle specific feature layers on or off, allowing drill downs into the interactive map!










Datatonic wins global Google Cloud Service Partner 2015


We are thrilled to have won the global Google Cloud Service Partner of the year award last week in Las Vegas, which we obviously celebrated in style! This is a big recognition for our innovative work with the big data components of Google Cloud Platform: BigQuery, Datalab, Dataproc, Dataflow, ...


Mostly this gives us confidence to push harder and make a big impact with our customers. We are sure that Google's Cloud offering is unrivaled, especially in the space of big data and machine learning. The best is yet to come!

Google Cloud Platform now have 'their Netflix': Spotify

Last week there was some exciting news published from our partner Google Cloud Platform as they announced that Spotify is moving onto their infrastructure.

Spotify will now rely on the Google's virtual machines and big data components to serve its 75 million users, to stream their 30 million songs, and manage the 2 billion playlists. This GCP+Spotify partnership shows the huge value and capabilities of the managed data services of Google Cloud, and without a doubt other major data players will follow in the coming months and quarters.




Datatonic is very keen to continue building solutions on Google Cloud, internally and for our customers. Reach out to us to understand how technologies like BigQuery, Dataflow or Dataproc can help your own business.

Read more about our GCP offering on www.datatonic.com/google

GDG Cloud Belgium - Kickoff : Big Data on Google Cloud

Last week was the kickoff of the Google Developers Group Cloud Belgium.
The focus of this meetup was on Big Data.



We started with a customer use-case (Vente-Exclusive) and subsequently presented two of the coolest technologies on GCP for your data needs. BigQuery for your interactive data analytics and DataFlow for your data pipelines, be it streaming or batch.

We would like to thank everybody who was present, and Google for providing some food and drinks afterwards!

You can find all the slides below:







Also, make sure to register to the GDG Cloud meetup page in case you would like to be informed about future events.