Demystifying Shazam (or Soundhound)

Dasun Pubudumal
The Startup
Published in
10 min readDec 27, 2018

--

Twenty-five years back, if somebody said that a mobile device will be able to recognize whatever the song that another device plays just by listening to it, it would have been a pure joke. It was not just a fantasy, but a mere mythological phenomenon. But so were most of the wonderful circumstances people have achieved thus far! Music, back in the day, was in the blood of many of us — in the sense that it gave us a little time off into a serene environment, however lousy the actual scenario we were in. So when the society got hit with the revolving technologies, plenty of questions were asked, and an answer for such a convenient question was Shazam. Well, I’m not going to delve too much into the history of this wonderful application — you can check it out here.

The main use case of the application is that a user records any part of a song and sends forth that particular recording to the Shazam servers, and those servers give the user the relevant metadata regarding that particular music itself.

Isn’t this trivial?

If you think about it, it’s just simple. Imagine you have a big tape where the waveform of a song is drawn, like the diagram below.

The waveform of I Want To Break Free — Queens, generated by Sonic Visualizer

So now think that you have this chunk of audio, from the same song, which you recorded from some sort of a mobile device.

A portion of the same song (From 3:13 to 3:50)

Alright so what the Shazam servers have to do, is to compare each songs’ waveform until it finds a waveform like this, isn’t it the case?

Well, yes — that is the case. But think of the mammoth dataset that the matching algorithm needs to go through. In [1], Chris Barton says that the datasets at the Shazam databases consist of 45 years of music! So it is nearly impossible to accomplish the above task in such a trivial manner as we stated above. Moreover, think about the degradation (or attenuation, in a scientific tongue) and noise that could be embedded in a recording, due to the surrounding and other factors. Thus, Shazam needed a robust, fast and concise algorithm for their application. Chris Barton and the crew approached several universities and found two engineers who were involved in audio processing and both of them were Ph.D holders.

Into the reign of Spectrograms

The two major concerns of the Shazam algorithm were noise resilience and rapid searching capability. So the engineers needed a solid representation of the song, in a digitized format, of course, to represent the song. The waveform was not that much intriguing as it does not provide a prominent search mechanism, despite the method that we discussed above. And it is better to remind at this instance, that this algorithm mainly took place in the late 90s, near the millennium. So not much tools and frameworks were there to generate these spectrograms in the software layer. But still, the core fundamentals were already developed, such as Fourier Transform to generate a spectrogram.

So what is a spectrogram, exactly?

A spectrogram is a three-dimensional phenomenon, even though people see it as a two-dimensional picture. Time, frequency and amplitude make up the three axes for the spectrogram. Even though it is possible to conjure up a three-dimensional surface as a spectrogram, people found it easier to depict the third axis, amplitude, as a color. This is the case for contour plots as well. Take a look at the following image.

Source — http://hplgit.github.io/primer.html/doc/pub/plot/._plot-readable007.html

In the above contour plot, it is clear how a 2D representation of a 3D surface is made. The color essentially depicts the intensity (or the height) of the value of the third axis that is going to be eliminated from the 3D space.

This is the same for spectrograms as well. Take a look at this image.

Image Source — https://en.wikipedia.org/wiki/Spectrogram

This a nicely represented spectrogram. The color becomes intensified (reddish) as the third axis (amplitude) arrives at a peak.

This dramatically reduces the complexity of the phenomenon of representing a 3D surface. Therefore, engineers at Shazam used this to represent a song and a part of a song (or a recording).

Figure A — Image Source — [2]

Pardon for the bad quality image but this image is from the original paper written by the lead engineer (Avery Li-Chun). It is a spectrogram of a musical clip.

However convincing as this may be, this representation did not solve all of the problems as it was still too complex for a computer to fathom. The engineers needed a solid data structure to hold a spectrogram so that the data inside the spectrogram could be parsed and analyzed. This lead to the idea of Fingerprinting an audio clip.

Fingerprinting?

Yes, it sounds a bit odd. But the term empirically was given to the phenomenon of deterministically generating a digital summary of an audio signal so that the audio signal could be identified quickly. So how was this digital summary generated in Shazam?

Remember the reddish regions in the spectrogram? This is where those become handy. Since they represent the highest peaks in the amplitude, it is trivial to say that those points in time, are the points in the audio signals to have the least chances of getting degraded. Does that make sense? Think of it like this.

Say you’re in a DJ, and all your friends are talking to each other (Some party! They do not dance, but they talk in midst of the DJ). Say Fred, who has a lousy voice speaks very loudly to his friend. And Daphnie, whom we all loved once, speaks in a very feminine voice and volume. If you record this, whose voice has the highest chance of getting affected by the noise of the DJ? It’s Daphnies’.

In Oscillations and Waves 101, we all learned that the energy of a wave is directly proportional to the square of the amplitude. The points with the highest energy have the least likelihood of being distorted when it comes to a wave.

So the Shazam algorithm accounts for the noise resilience by adhering to the aforementioned points in the spectrogram.

Constellation Map. Image Source — [2]

The above stars are the points of the spectrogram which have the highest amplitude. Since this image looks like a star-map, it is called a Constellation Map.

A Constellation Diagram generated for the song I Want To Break Free by Queens (Generated by Sonic Visualizer)

Storing a constellation map is done via a Hash Table data structure like this.

Image Source — https://gizmodo.com/5647458/how-shazam-works-to-identify-nearly-every-song-you-throw-at-it

Now that we have a constellation map, and we have accounted for the noise, we have to get really into the main use case, that is matching the recording with the song. As we mentioned earlier with the waveform, we can do the same with the constellation map as well! But still, it takes time. So a better way of matching is still needed.

So the engineers came across with a very perceptive idea. That is, they take a point, called an anchor point in the constellation diagram, have a target window of points (which is related to the time of the recorded video — generally 10s — 15s) and calculate a hash for each point with respect to the anchor points.

Image Source — [2]

A Hash function is essentially a function which takes a variable length input and produces a fixed length output. The common example given for a hashing function is that of a book shelf. You have books that you need to put into a set of shelves, in a productive manner. What you do is you take the title of the book, and calculate the character length of each book title, which eventually produces an integer. (See how we get a fixed length output from a varying length input?) The book titles are less likely to have the same character length hence outputting a sparse set of outputs.

Why do we use a hash?

Why Hash? why?

Let me explain this as we go along — I think it will benefit the reader better.

So this hash function, say h, is a function of three variables.

h = f(f₁, f₂, Δt)

The symbols are self-explanatory if you look at the image below.

Image Source — [2]

I think now it is possible for me to explain the question Why Hash? Why?

The key principle here is that our search must be time-invariant. Because the use case is not to give a recording of the first 10 seconds or last 10 seconds, but any continuous 10 seconds of the song. In the hashing method, since it focuses on Δt, and does not focus on t (absolute time), you can still generate the same hashes even if you record a part of that song!

The same hashes will be generated even for this slice of the song, Image Source — [2]

If this circumstance is looked at in a mathematical manner, since A`⊂ A, A being the whole song, then shazam(A`) shazam(A), where shazam(A) is the set of hashes in the whole song A.

There should be a sophisticated method of storing all these hashes in a database. This structure is quite trivial.

[32-bit hash as an unsigned integer]:[32-bit time offset plus the track ID]

This time offset, say t₁, is the time from the beginning of the song to the anchor point. This will come in handy when we come to preparing time-pairs later.

So the fingerprinting step is all about generating the constellation map and hashing each anchor point. Each sample (or a recording) needs to be fingerprinted in order to perform a match, and the song database should also be a datastore of fingerprints, with track IDs labeled in them.

Performance

For a moment let us compare this method with a matching algorithm which occurs by comparing single points in the constellation map.

As an example, say that each frequency component is 10-bits, and the Δt component is also 10-bits, which therefore leads to a 30-bit storage space (to frequencies and one Δt component comprises a hash). However, if a single point-to-point search was performed, we need only to store the two frequencies, therefore leading only to a 20-bit storage space, making the former method 10-bits spacially expensive. But the specificity of the hash makes the search algorithm faster, as it provides more information on Δt, constraining the search space than that of a single point search. This is apparent in the fact that 20 extra bits are embedded into the search so the specificity is enhanced. The paper [2] states that if the fan-out factor (target zone density) is F, the combinatorial hashing algorithm is 1000000/F² times faster than a single point search.

That is not the only complication. Assume that p is the probability of a peak point of the constellation map surviving the journey from the original source (maybe a radio) to the mobile phone. And since we are dealing with two such points, the joint probability of both of the points surviving would be p², which is lesser than p. (As p < 1) This is a tradeoff that the algorithm must endure in order to attain speed.

Searching and Sorting algorithm

This is the genius part of the algorithm. For each hash match between the recorded audio clip, and the database, a time-pair is formed and put into bins categorized according to the Track ID. Remember we attached a track ID when fingerprinting the database audio files? That’s where the track IDs come from. Say you find the hash x in an audio recording and in the database. Thereafter the time offsets are being put into a pair, say (t₁, t₂), where t₂ being the time offset (time from the beginning of the recording to the anchor point) of the audio recording, and t₁ the time offset of the original song in the database.

After this, each bin is scanned and a scatterplot is drawn. If the scatterplot displays a linear variation between most of the scatter points, then the song is a match.

How in the world does that happen??

If the song is a match, the difference between t₁ and t₂ should be the same, isn’t it? The only difference between the database constellation map and the recorded constellation map, given that the song is a match, is the time of start of the recording. For all points of the constellation maps, this difference should be the same, if the song is a match! Thus, searching for a match ultimately boils down to checking whether a significant cluster of points in the scatterplot (in a bin) forms a diagonal line within the scatterplot. This technique is said to be run in N*log(N) time complexity, which is very rapid, in contrast to other methods (N is the number of points in the scatterplot).

Scatterplot of t’s in a non-matching case, Image Source — [2]
Scatterplot of t’s in a matching case forming a diagonal line, Image Source — [2]

Furthermore, in order to get a score of the amount of the match, these differences are drawn in a histogram, as below.

Histogram of a non-matching time difference — Image Source — [2]
Histogram of a non-matching time difference — Image Source — [2]

If a match occurs, a peak in appears in the histogram and the score of the match is essentially the number of matching pairs in the histogram peak.

References

[1] DraperTV, “How Shazam Developed Their Signature, Awesome Algorithm | Shazam Founder Chris Barton,” YouTube, 15-Jun-2015. [Online]. Available: https://www.youtube.com/watch?v=O_iv702jj90. [Accessed: 26-Dec-2018].

[2]A. Li-chun Wang, “An industrial-strength audio search algorithm”, in Proceedings of the 4 th International Conference on Music Information Retrieval, 2003.

--

--

Dasun Pubudumal
The Startup

Software Engineer, CSE Graduate @ University of Moratuwa