In an attempt to get the latter to perform a bit more like the former, researchers at Deezer developed an artificially intelligent system that can associate certain tracks with moods. They describe their work in a new paper (“Music Mood Detection Based on Audio Lyrics With Deep Neural Nets“) published on the preprint server Arxiv.org
“Automatic music mood detection has been an active field of research … for the past twenty years,” they wrote. “It consists of automatically determining the emotion felt when listening to a track. In this work, we focus on the task of multimodal mood detection based on the audio signal and the lyrics of the track.”
The team, citing psychological studies suggesting that lyrics “should be jointly considered” when analyzing musical mood, designed a neural network they separately fed audio signals and word2vec embeddings trained on 1.6 million lyrics. To teach it to gauge songs’ emotional resonance, they tapped the Million Song Dataset (MSD) — a database of tracks associated with tags from LastFM, some of which relate to moods — and 14,000 English words with embeddings ranging in valence (from negative to positive) and arousal (from calm to energetic), which they used to select the aforementioned tags for training.
Because the MSD doesn’t include audio signals and lyrics, the team mapped it with Deezer’s catalog using song metadata — specifically the song title, artist name, and album title. And they extracted words from lyrics at the corresponding location relative to the length of the lyrics.
About 60 percent of the resulting dataset — 18,644 annotated tracks in all — was used to train the model, with 40 percent reserved for validation and testing.
Compared to classical systems that draw on lexicons related to emotion, the deep learning model was superior in arousal detection. When it came to valence detection, the results were more of a mixed bag — the researchers note that lyrics-based methods in deep learning tend to perform poorly — but it still managed to match the performance of feature engineering-based approaches.
“It seems that this gain of performance is the result of the capacity of our model to unveil and use mid-level correlations between audio and lyrics, particularly when it comes to predicting valence,” the researchers wrote. “Studying and optimizing in detail ConvNets for music mood detection offers the opportunity to temporally localize zones responsible for the valence and arousal of a track.”
They suggest that subsequent research could use a database with labels to indicate the degree of ambiguity in the mood of tracks, or leverage an unsupervised model trained on high volumes of unlabeled data. Both tacks, they contend, would “improve significantly” the prediction accuracy of future models.