Skip to content

Daniel-Ro-Santiago/Spotify-Songs-MDS-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify-Songs-MDS-Analysis

Table of contents

Project overview

This project applies multidimensional scaling (MDS) to a Spotify songs dataset with the objective of identifying the two most similar songs among the top 30 most popular tracks. The analysis is based on standardized musical and audio features and relies on distance-based similarity measures to capture relationships between songs in a reduced dimensional space.

Data

The dataset used, "songs_normalize.csv", contains musical and audio information for 2000 songs across the following variables:

Variable Definition
artist Name of the artist
song Song title
duration_ms Duration of the song in milliseconds
explicit Indicates whether the song contains explicit content
year Release year of the song
popularity Popularity score assigned by Spotify
danceability Measure of how suitable a track is for dancing
energy Measure of intensity and activity of the song
key Musical key of the track
loudness Overall loudness of the track in decibels
mode Modality of the track (1 = Major, 0 = Minor)
speechiness Presence of spoken words in the track
acousticness Measure of whether the track is acoustic
instrumentalness Probability that the track contains no vocals
liveness Presence of an audience in the recording
valence Musical positiveness conveyed by the track
tempo Estimated tempo of the track in beats per minute
genre Musical genre of the song

Technologies

  • Python
  • Jupyter Notebook

Features

Here is what this project does:

  • Top 30 most popular songs: Identification and selection of the 30 most popular tracks based on Spotify’s popularity score.
  • Multidimensional analysis: Application of MDS using selected musical attributes to reduce dimensionality.
  • Most similar songs: Identification of the two most similar songs based on distances in the MDS space.

Limitations

The main limitations of this project are:

  • The stress value is relatively high, which may indicate distortion between the original and reduced dimensional spaces.
  • Only two components are considered, which may not capture all the variability in the data.
  • The analysis relies exclusively on the Euclidean distance metric.

Process

First, the dataset "songs_normalize.csv" was loaded, its dimensions (2000 rows and 18 columns) were verified, data types were reviewed, and the presence of missing values and outliers was checked:

Box plots

Next, the data was filtered to obtain the top 30 most popular songs, sorting them in descending order according to the popularity variable. A subset of relevant musical attributes such as duration_ms, danceability, energy, loudness, etc. was selected to compare the songs. These features were then standardized to ensure a mean of 0 and a standard deviation of 1, preventing variables with larger scales from dominating the distance calculations.

Subsequently, a dissimilarity matrix was computed using Euclidean distance, measuring how different each pair of songs is in the standardized feature space. Multidimensional scaling (MDS) was then applied to this matrix, reducing the data to two dimensions and creating a 2D space where distances between points represent dissimilarities between the original songs. The resulting MDS coordinates were used to perform linear regressions between each MDS component and the original features. This allowed the calculation of direction vectors, indicating how each attribute contributes to the positioning of songs in the MDS space.

The songs were then visualized in the MDS space, as shown below:

Songs in the MDS space

The attribute vectors were included to aid interpretation of the most influential musical features along different directions of the plot:

Attribute vectors

Finally, Euclidean distances between all pairs of songs were computed in the reduced MDS space. Distances along the diagonal were replaced to avoid self-comparisons, and the pair of songs with the minimum distance was identified as the most similar.

Results

The analysis revealed that the two most similar songs according to Euclidean distance in the MDS space are:

  • Song 1: Perfect - Ed Sheeran
  • Song 2: Do I Wanna Know? - Arctic Monkeys

The distance between these two songs in the MDS space is 0.7979. A comparison of their original features shows similarities in:

  • duration_ms: A difference of only 8994 milliseconds.
  • popularity: Both songs differ by just one popularity point.
  • mode: Both share the same musical mode.

Although differences exist in features such as energy, loudness, danceability, and tempo, their proximity in the MDS space indicates that they share the most similar overall musical profile within the selected set. The direction vectors in the MDS plot further help explain which attributes contribute most to this similarity.

Final recommendation

As a final recommendation for this analysis, to improve the quality of the dimensionality reduction, the stress value could be reduced by increasing the number of MDS components to three or four, allowing the model to capture more variability from the original data.

What I learned

The most important things I learned from this project are that effective multivariate analysis depends heavily on selecting appropriate features that meaningfully represent the data within a reduced dimensional space. It is also crucial to align the choice of distance metric with the nature of the data and the analysis objective. For musical data, alternatives such as cosine similarity may provide more insightful results than Euclidean distance.

Additionally, evaluating performance metrics such as stress is essential to assess how well the reduced space preserves original distances. This evaluation provides valuable insight into the accuracy and reliability of the MDS representation.

How can it be improved

  • Optimize the stress value and the number of components used.
  • Explore alternative distance metrics (Manhattan, correlation, cosine similarity, etc.).
  • Implement non-metric MDS for more flexible distance preservation.

Running the project

To run the project, simply open the Jupyter Notebook "Spotify_Songs_MDS_Analysis", load the csv file "songs_normalize.csv" and run all cells.

About

This project applies multidimensional scaling (MDS) to a Spotify songs dataset to identify the two most similar songs among the top 30 most popular tracks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors