Simple Recommendations using Spark on Google Cloud Dataproc

Wednesday, September 14, 2016 Andrew Powell

In this blog post we’re going to show how to build a very simple recommendation engine using basket analysis and frequent pattern mining. We’re going to implement it using Spark on Google Cloud Dataproc and show how to visualise the output in an informative way using Tableau.

Given ‘baskets’ of items bought by individual customers, one can use frequent pattern mining to identify which items are likely to be bought together. We will show this with a simple example using the groceries dataset, but it could easily be extended to movies, tv, music, etc!

The groceries dataset contains a list of baskets from a grocery store in the format

+ Citrus fruit, Semi-finished bread, Margarine, Ready soups
+ Tropical fruit, Yogurt, Coffee
+ Whole milk
+ Pip fruit, Yogurt, Cream cheese, Meat spreads
+ Other vegetables, Whole milk, Condensed milk, Long life bakery product

So we see customer 1’s basket contained some fruit, bread, margarine and soup, customer 2’s basket contained tropical fruit, yogurt and coffee etc. Now lets see what a basket analysis tells us.

Basket analysis

Lets start by defining some terms. The support or frequency for a particular item is defined as the percentage of baskets that item features in which is a measure of the popularity of individual items. In this dataset the most popular items are whole milk, vegetables and rolls/buns.

We similarly define the support for the item pair [A, B] (or we could generalise to items groups [A, B, C], etc) as the percentage of baskets the two items feature in together.

Can we use this measure for a recommendation engine? We might naively think so, a high support for [A, B] means that lots of people bought items A and B together, so why not recommend item B to someone buying item A?

To see the problem, consider this example. Say 50% of baskets contain fruit, and 50% contain chocolate, if there was no correlation between buying chocolate and fruit then we would probably expect 50% of the baskets that contain fruit, also contain chocolate, so 25% of all baskets contain the pair [fruit, chocolate]. What if 10% of all baskets contain sweets? Then we might expect 5% of baskets to contain [fruit, sweets], and 5% to contain [sweets, chocolate].

Now say we look at the data, we discover the following:

+ [fruit, chocolate] feature in 20% of baskets
+ [fruit, sweets] feature in 5% of baskets
+ [sweets, chocolate] feature in 9% of baskets

Now we begin to see why using support was a bad choice. The number of baskets containing fruit and chocolate is lower than what we argued in the case where the buying of these two items is uncorrelated, this means that you are actually less likely to buy fruit if you’ve bought chocolate and vice versa! Conversely sweets and chocolate feature in many more baskets than we thought, so if you’ve bought sweets you are more likely to buy chocolate!

Using support would have led us to recommend fruit for people who buy chocolate. But we have seen a better recommendation would have been sweets! This is where lift comes in

lift(A, B) = support for [A, B] / ((support for A) x (support for B))

Lift is the support for item pair [A, B] normalised to the product of the support for A and the support for B.

What is this normalisation? Well "support for A x support for B" is the number we found above when we discussed the expected values when the buying of A and B were uncorrelated! So lift is the actual support for A and B, normalised to the expected support if the items were uncorrelated. Thus lift is correlation. Given you bought chocolate, you are more likely than the average customer to also buy sweets (and less likely to buy fruit)!

How did we do it?

We used Google Cloud Platform to perform the analysis. Such a small input file probably did not need to be run on the big data tools but this analysis is an interesting use case for them. We created a (small) cluster in Google Cloud Dataproc, and using an initialisation script were able to install Spark on the cluster. We transferred the file from Google Cloud Storage onto the master node of the cluster and then we were good to go in under five minutes!

We used the FPGrowth algorithm in Spark to perform frequent pattern mining. This algorithm efficiently finds frequent patterns in the datasets, in our case: frequent itemsets!

from pyspark.mllib.fpm import FPGrowth
#import data
model = FPGrowth.train(data, minSupport=0.0005, numPartitions=10)
result = sorted(model.freqItemsets().collect())

The algorithm finds all sets of frequent items that have support greater than minSupport. The output is a list of items of any length, and the support of that list of items in the dataset

[Item A], 0.5
[Item B], 0.4
[Item A, Item B], 0.1
[Item C], 0.09
[Item A, Item B, Item C], 0.01

And that is it! We now have everything we need to build a very simple recommendation engine.


Visualising product affinity in an interesting and simple way is tricky. We focus only on itemsets of length 2, so pairs of items only. We do this because if there is an itemset of length >2 i.e. (item A, item B, item C), there must also be itemsets of length 2 for every combination of the longer item set, (item A, item B), (item B, item C), (item C, item A).

We can simply manipulate the output of the FPGrowth algorithm to restrict to item pairs, and then use the itemsets of length 1 to calculate the lift, we also give each pair a unique identifier. We get the output

Itemset ID, item 1, item 1 support, item 2, item 2 support, pair support, lift

We write this output into Google Cloud BigQuery so we can easily visualise the results using Tableau.

We use the visualisation method shown above. We like this method as its intuitive and allows us to show the results clearly. In the scatter plot we plot each item, the x axis has the maximum lift for that item and a matching pair. The y axis shows the average support for that item and its matching pairs, and the size is the support for that item. So size and height denote how popular that item is, and how many frequent item pairs it belongs to. The really interesting axis is the maximum lift direction.

If an item is located on the right hand side of the plot, then there is another item that pairs very well with this item. Lets look at a few cool examples:


The lift parameter tells us that people who are buying flour are likely to also buy sugar and baking powder, all ingredients for baking!

If we look at the third column on the bottom instead, we see the largest support is for sugar, but also root vegetables, which doesn’t seem right at all! In fact if we scroll down the highest support is actually for flour and whole milk. Whole milk just so happens to be the most popular item in the whole dataset! Luckily the use of lift normalises for this fact and we’re left only with the most relevant matching items at the top. So instead of recommending milk for someone buying flour (who doesn’t buy milk already?) we would recommend sugar and baking powder.

Frozen fish

People who buy frozen fish and also likely to buy frozen meals. These people probably don’t like or have time to cook. We starting to see now how this becomes useful for a recommendation engine.

Out final use case is Processed Cheese, which pairs very strongly with ham and white bread (perhaps for a lovely sandwich!).

Concluding Remarks

We’ve managed to build a very simple recommender using basket analysis in only a few lines of code using Spark on Google Cloud Dataproc. It is simple to see that this could be extended to lots more interesting sectors: for instance we could recommend music, films, or tv programmes, in this case the ‘baskets’ are albums/movies downloaded by individual customers. Now we can recommend new movies for a viewer based on what they’re currently watching! And this is just a first step, there are lots more interesting and complex things that one could do, for instance collaborative filtering to build a recommendation engine.