With the recent news of image recognition start-up Clarifai raising $10M, I decided to experiment with their Web API. While Deep Learning (the core of their approach) has been used for music streaming, such as recommendations on Spotify, what about the image recognition side of it?

Here, I’ll describe an experiment which combines the Spotify Web API, Clarifai’s Image Recognition API and Google Prediction in order to identify an artist’s music genre based on their album covers.

clarifai’s deep learning image recognition API

Clarifai’s deep learning API: Machine-Learning-as-a-Service

Clarifai is one of those new services in the Machine-Learning-as-a-Service (MLaaS) area (Azure ML, Wise, etc.). If you think about how Web development evolved during the past years, it completely make sense, as an additional step towards the No-stack start-up.

Many start-ups need Machine-Learning approaches to scale their product: classifying customers, recommending music, identifying offensive user-generated content, etc. Yet, they better outsource this to platforms with a core ML expertise, rather than spending time and resource for this utility – unless that becomes their core business. The same way you don’t buy a fleet of servers and hire dev-ops but use Google App Engine, or don’t set-up a telephony system but rely on Twilio.

Clarifai’s API, which uses deep learning for image recognition, is very easy to use, with a simple API Key / token system and bindings for multiple languages such as as Python. For instance, the following tags Green Day’s Dookie album cover.

from clarifai.client import ClarifaiApi
clarifai_api = ClarifaiApi()

The API also lets you provide feedback to the extraction results, this feedback being used to feed their algorithm with additional data.

Read about related experiments in my #MusicTech e-book

SpotiTag: Tagging artists through album covers

The first step of my experiment was to tag artists using their album covers. To do so, I wrote a simple Python class which queries the Spotify Web API to get an artist’s top-10 albums, and pass those to the Clarifai API, filtering broad ones like “graphic”, “art”, etc.

This way, we can find the most relevant tags for an artist, as below.

>>> from spotitags import SpotiTags
>>> sp = Spotitags()
>>> print sp.tag(SOCIAL_DISTORTION)[:10]
[(u'text', 5.78628945350647), (u'silhouette', 4.906140387058258),
(u'people', 4.833337247371674), (u'background', 4.743582487106323),
(u'vintage', 3.9088551998138428), (u'banner', 3.8920465111732483),
(u'men', 3.76175057888031), (u'card', 3.67703515291214),
(u'party', 3.6343321204185486), (u'old', 2.952597975730896)]

The SpotiTags class is available on GitHub.

From artists to genres

To bridge the gap between artist-tags and genre-tags, I used “top x songs” playlists from Spotify, starting with two very-unrelated genres: Doom-metal and K-pop! Gathering a small dataset of 140 Doom-metal artist and 102 K-pop ones, and passing them through the previous tagging process, here are the top-tags for both genres.

K-pop Doom-metal
people nobody
female dark
women people
nobody abstract
men light
isolated night
fashion old
adult vintage
group death
business nature

This genre-tagging approach is, in a way, similar to what Spotify published about words associated with the different genres of music. Except that I’m not analyzing song titles, but album covers!

As you can see with the tags in italic, the overlap is quite large between genres – and I’ll come back to that later. But for now, let’s look at how I used this data to build an artist classifier.

Cloud-based classification with Google Prediction

As its name suggest, Google Prediction is Google’s (cloud-based) Machine-Learning API, predicting an output value for a set of features. It works whether this value is a class (in our case, a music genre), or a continuous value (e.g. the expected number of streams per month for an artist), i.e. classification or regression in ML-terms.

Example of training data in Google Prediction

To predict if a set of tags belong to the Doom-metal or K-pop category, I’ve built simple training set as follows (note that Google prediction splits the string into its own internal model):

genre,"list of tags"

Separating the previous dataset between training and testing lead to 180 examples like the following ones.

doom,"nobody vintage christmas light people nature architecture girl woman paper west history celebration islam traditional astrology event round party background reflection antique dark old shadow conceptual death man postcard back festival east years music fireworks banner couple night female abstract women model adult travel greeting religion card gold street church fine art silhouette style map sepia venice rain broken government country lantern castle color love one nude arms sexy shiny texture pin velvet wall pride"
kpop,"people adult one men north america vintage two vehicle motor vehicle transportation street cap group three police classic car humor hat cartoon outerwear xmas rapper fedora banner christmas jacket musician invitation man celebration background clothing decoration splash necktie text vest facial expression fast wedding automobile audio tape cassette stereo sound mono analogue obsolete record nobody tree radio broadcast compact reel fish eighties tuner unique nostalgia invertebrate conifer noise moon fine art black and white outfit sculpture winter singer season outdoors law nature monochrome greeting fashion menswear women serious dark change boy stroke merry travel garden female face war still life forest music teenager looking leader youth"

In addition, to try different models, I limited the list of tags (in the feature-set) to either all the tags for an artist, or their top-n. The results are as follows for the different approaches.

Model Success rate
All artist tags 88.71%
Top 100 artist tags 93.55%
Top 75 artist tags 87.10%
Top 50 artist tags 91.93%
Top 25 artist tags 87.101%
Top 10 artist tags 82.25%

The fun part comes next, with a small script that combines the three APIs together to return the expected class (i.e. music genre) for any Spotify artist:

  1. Using SpotiTags to query the Spotify API, get the artist top-10 albums and pass them to Clarifai – build the artist tag-set;
  2. Passing this tag-set to Google Prediction to predict the class, using the former models.

And here we are!

(env)marvin-7:clarifai alex$ python predict.py -a7zDtfSB0AOZWhpuAHZIOw5
Candlemass: doom
Guessing an artist genre based on their album covers

More genres, more models

That’s all good and fun, but an ideal system would be able classify between more genres, e.g. 10 genres of popular music. I haven’t been so far but added Punk-rock to the equation, in order to try additional models.  Everyone loves a good bass guitar.

Adding 75 Punk-rock artists, let’s have a look at the top-tags:

K-pop Doom-metal Punk-rock
people nobody people
female dark nobody
women people vintage
nobody abstract north america
men light men
isolated night text
fashion old old
adult vintage adult
group death street
business nature business

As earlier, there is also lots of overlap here. And, as new genres will be added, that overlap is expected to growth. Thus, using the former top-n tags approach, the results are unexpectedly worse than previously.

Model Success rate
All genre tags 82.50%
Top 100 genre tags 82.50%
Top 75 genre tags 82.50%
Top 50 genre tags 80.00%
Top 25 genre tags 71.25%
Top 10 genre tags 71.25%

Finding the most distinctive tags

So instead of the top-tags, let’s focus on the ones that are specific to the genre. I.e. tags (again, extracted from album covers via Clarifai’s deep learning API), which appear in the top-100 of a genre, but not in the top-100 of others (still, limited to those three ones).

K-pop Doom-metal Punk-rock
girl horror graffiti
young fantasy european
model water festival
style smoke collection
creativity black and white urban
beautiful history message
celebration fire city
sexy pattern performance
shape sky cartoon
person scary two

This definitely gives a better feeling of what the genre is about: Sexy and beautiful for k-pop, and scary and horrific for Doom-metal!

K-Pop album covers, via Google Images

Using this distinct-tags approach, I’ve updated the previous models (and accordingly, the training and test sets) to take into account not the top-n tags, but the top-n distinct ones.

Here are the results of the new classifiers, deciding if an artist is playing Doom-metal, K-pop or Punk-rock based on their album cover’s tags.

Model Success rate
Top 100 distincts genre tags 97.50%
Top 75 distincts genre tags 98.75%
Top 50 distincts genre tags 97.50%
Top 25 distincts genre tags 96.25%
Top 10 distincts genre tags 95.00%

Much better than the previous approach. Yet, as the number of genre growth, there will probably a need to find tune those models to accurately identify an artist genre. This means using more examples in the training sets – but also probably additional data, going further than images only, with album titles, and maybe some MIR techniques.

The MLaaS opportunity for developers, and for Clarifai

While being a limited experiment, this post showcases how different elements of a cloud-based machine-learning approach can help to identify what an artist is playing, based solely on what their album cover look like!

More globally, using such APIs definitely simplifies the life of developers and entrepreneurs. While you still need to grasp the underlying concepts, no need for an army of Ph.D.s and a fleet of GPU-based servers running scikit, Caffe or more to understand and make sense of your data!

As for Clarifai itself, I believe that classification and clustering could be two additional features that the former audience would enjoy:

And, I’m not even thinking of features outside the image domain (music?), as they’ve already started with video analyzis. In any case, with $10M now in the bank, and a growing team of experts in the area, I have no doubt we will see new features in the API sooner than later, showcasing what deep learning can bring to Web engineering!