How I Would Use the Google Prediction API ( To Find Your Musical Profile)
This past week Google made a couple announcements that they would be offering BigQuery and Predication API’s that would help people analyze massive datasets and predict outcomes. On the surface, this type of offering is quite intriguing to people who work in machine learning types of jobs.
At Grooveshark, we have a variety of problems that require Analysis and Prediction. For example, a few of the things we do include:
- Music recommendations for both long tail and popular music. Pandora, of course, is already famous for their expert-style recommendations. What makes our problem so unique is that we have so much more content that needs to be recommended. It is rumored that Pandora only has 700,000 tracks. Our library runs in the millions.
- People recommendations for users to follow others with similar tastes in music. Like Facebook, we have a social discovery feature. However instead of recommending people in your social graph, we analyze your listening behavior and offer you the opportunity to follow others who listen to similar styles of music.
- Artist analytics for new and established artists who want to know where they have new fans, how their music is being accepted, where they might want to go on tour, or steps to take to get signed (if unsigned) or to further promote their music.
- Search. While we currently use Sphinx Full-Text Search to power our search-backend, we are constantly testing new methods for music discovery.
- Advertising analytics. Advertising is our core revenue generator, and so it makes perfect sense that we use every tool at our disposal to make the user experience more enjoyable. What types of ads resonate well with our users given their music taste, location, and usage of the site? How can we provide our advertisers with deep knowledge about their targeted users while respecting the rights and privacy of our users? How can we forecast whether new campaigns might be successful?
Getting Technical: The Google Prediction API
Now that you know the types of problems the BigQuery and Prediction API might be good for, how do we go about using them? Well, unfortunately Google didn’t provide many details at launch. In the forums, they refused to tell which algorithms they would be offering. Old tried and true? New state-of-the art? Proprietary or open? No clue. Upon signing up for an invitation to test though, I did gain a few insights.
They seem to be offering categorical classification and prediction services.
Huh, you say? Let’s dig in further.
I have three friends- Elico, Ashlee, and Pat. Elico likes a mix of popular and niche music, everything from Dierks Bentley to Ke$ha to Rock Kills Kid. He is the easiest user to generate recommendations for. Ashlee likes primarily Jack’s Mannequin and any type of Rap. Her tastes are disparate, but nevertheless easy to characterize. Pat likes a mix of New Age and Electronica. His tastes are the most unique and difficult to characterize.
Our Listening Matrix may look like this, where the numbers are amount of times they listened to that artist in a week:
What we might want to do is come up with a musical profile for each user. That way when I (as a Friend and a Follower on GS) try to see what new songs they might have discovered, I don’t need to see the songs I’m not interested in (or doesn’t fit MY musical profile).
Of course, what we feed into Google’s API isn’t a Keynote table. And this is not how data is represented in our database either. We have to anonymize the data like this:
UserID ArtistID Count TotalCount
1 1 15 71
1 2 2 27
1 3 1 108
2 1 21 71
2 2 15 27
2 3 7 108
3 1 35 71
3 2 0 27
3 3 100 108
4 1 0 71
4 2 10 27
4 3 0 108
Now what? Well, this is the part that Google didn’t make clear. What are you supposed to do with that data to get a musical profile once you’ve uploaded it to Google Storage?
One algorithm they might choose to employ is Naive Bayes, a classic, tried-and-true probability algorithm for classification, particularly for things like Spam Classification. Paul Graham has a good example here. Still more complicated techniques exist like Fisher’s Linear Discriminant, Support Vector Machines, and Relevance Vector Machines. All are used both in research and production to do categorical classification.
In our case, we first need to add categories or tags or labels (depending on who you talk to the terminology is different but they mean the same thing). We could, for example, put those artists in 3 distinct categories- Hip-Hop, Electronica, and Rock, and re-label our dataset as follows:
UserID ArtistID LabelID Count TotalCount
1 1 1 15 71
1 2 2 2 27
1 3 3 1 108
2 1 1 21 71
2 2 2 15 27
2 3 3 7 108
3 1 1 35 71
3 2 2 0 27
3 3 3 100 108
4 1 1 0 71
4 2 2 10 27
4 3 3 0 108
Now we are free to use our algorithm of choice (or not if we have to use Google’s algorithms), to calculate our musical profile, that is, what proportion of our listening behavior can be attributed to Hip-Hop, Electronica, and Rock.
Let’s take Naive Bayes as an example. First we calculate what are known as prior probabilities, or how often Hip-Hop, Electronica, and Rock naturally occur in our dataset. Our dataset is pretty simple, so this is easy to calculate:
Hip-Hop(Label ID 1): 71/206
Electronica(Label ID 2): 27/206
Rock (Label ID 3): 108/206
Now we have to calculate what are known as likelihood probabilities, or the probability that each user likes a given genre of music (or label).

Lastly, we multiply the Likelihood by the Prior to get what is called the Posterior. This final probability is what we can use to classify each user under a certain genre (or label).
UserID LabelID Prior Likelihood Posterior
1 1 71/206 15/71 .07
1 2 27/206 2/27 .009
1 3 108/206 1/108 .004
2 1 71/206 21/71 .101
2 2 27/206 15/27 .07
2 3 108/206 7/108 .03
3 1 71/206 35/71 .169
3 2 27/206 0/27 0
3 3 108/206 100/108 .485
4 1 71/206 0/71 0
4 2 27/206 10/27 .04
4 3 108/206 0/108 0
So we can see here that (based our limited dataset), that I am mostly a Hip-Hop listener, Elico is mostly a Hip-Hop listener, Ashlee is a Rock listener, and Pat is an Electronica listener. It’s not yet known what Google will offer, but they could choose to offer either a binary classification (Am I a Hip Hop listener? Yes. Am I a Rock listener? No), or they could choose to output multiple classes, ie. show to what % someone could be each type of listener.
Since a dataset could potentially contain millions of users and hundreds to thousands of different categories, scaling this type of algorithm (or other more complex ones) could get very difficult, which is why I am eagerly waiting Google’s offering to take for a test drive.

Trackbacks & Pingbacks