How Shazam Works – Technology Explained
Shazam is a cellphone app that most of us use for identifying songs that we hear on the radio, playing in a restaurant, or anywhere else really. For us humans, identifying a song is very easy. We can hear a small snippet of a song and identify it very quickly, even if it is played on a different instrument or hummed. We can even identify a song from one single instrument playing, or possibly even the bass line.
Before we dip into how Shazam works, let’s see how our brains work. Our brains don’t listen to the clip of sound and then compare it to everything we’ve ever heard before coming up with a result, but a computer doesn’t work the same way as our brains. For a computer, every match is found by doing a comparison to songs in its database.
How Songs are Identified
Imagine having to find a needle in a haystack by looking at each individual piece of straw and comparing its length and colour to the reference image until you have a match. That is exactly the way computers identify things – comparing each in the database until it gets a match.
Our brains are essentially hardwired to identify sounds on the fly. When we hear something, certain neurons in our brains are activated to identify that historic information even if it isn’t an exact match. What our brains rely on is called the timbre of the note, which is a characteristic of sound independent of pitch or loudness from which its source or manner of production can be inferred. Thanks to this, we can intuitively identify sounds.
For example, a guitar is a harmonic instrument. It doesn’t just produce a single tone of a single frequency, but multiple complementary tones which are related to the base (not bass) note called overtones, and they are multiples of the base note.
In order to quantify these sounds, we use something called a spectrogram. A spectrogram is essentially a three dimensional graph used as a representation of sound. Time is shown on the X axis, frequency on the Y axis, and amplitude (or volume) on the Z axis as colours. This is easiest to explain with the following image.
Spectrograms and Fingerprints
A computer is very good at working with spectrograms, but they contain an incredible amount of information. The more information a computer has to deal with, the slower it is at processing said information. Shazam converts the spectrogram into a star map called a fingerprint, where each dot which represents the highest amplitude at a given time. As no two songs are exactly the same, these fingerprints are unique to each song.
This method of fingerprinting reduces the amount of information the computer has to deal with by an extremely large amount, as not only has the graph been reduced from a 3D graph to a 2D graph but the sampling frequency can be reduced from thousands of data points per second to a much lower number.
When you open Shazam and tell it to identify a song, it uses your phone’s microphone to begin making its own fingerprint to be uploaded to the Shazam servers. This is not only less information to process, but also drastically reduces the amount of data that needs to be uploaded.
Once the fingerprint has been uploaded to the Shazam server it can begin comparing the uploaded fingerprint, which is a short part of the song, to its entire fingerprint database. If we look at the following simplified image, we can see how the comparison is done.
This comparison would then have to be made against every part of every song in Shazam’s entire database, which contains millions of songs. As you can imagine, doing these billions of comparisons is still incredibly time consuming, even though it is a greatly reduced amount of data when compared to the original spectrogram.
Hashing Fingerprints
To further reduce the amount of data that needs to be processed, these fingerprints are converted into a hash. A hash works by taking any length of data and producing an output of a set length.
Anything from a single word to an entire dictionary could be converted into a hash such as “46C93F334463A7B,” and we can then categorize these hashes based on their characteristics. Hashes also prevent collisions by attempting to avoid multiple inputs resulting in the same output. The hash stores a lot less information than the information being hashed. These hashes represent very short pieces of the song, so multiple matches in succession are required.
This is an extremely simplified view the complex workings of Shazam, but it should give you a better understanding of the process that goes into the app, which was sold to Apple for $ 400 million.