Spotify-Songs-MDS-Analysis

Project overview

This project applies multidimensional scaling (MDS) to a Spotify songs dataset with the objective of identifying the two most similar songs among the top 30 most popular tracks. The analysis is based on standardized musical and audio features and relies on distance-based similarity measures to capture relationships between songs in a reduced dimensional space.

Data

The dataset used, "songs_normalize.csv", contains musical and audio information for 2000 songs across the following variables:

Variable	Definition
`artist`	Name of the artist
`song`	Song title
`duration_ms`	Duration of the song in milliseconds
`explicit`	Indicates whether the song contains explicit content
`year`	Release year of the song
`popularity`	Popularity score assigned by Spotify
`danceability`	Measure of how suitable a track is for dancing
`energy`	Measure of intensity and activity of the song
`key`	Musical key of the track
`loudness`	Overall loudness of the track in decibels
`mode`	Modality of the track (1 = Major, 0 = Minor)
`speechiness`	Presence of spoken words in the track
`acousticness`	Measure of whether the track is acoustic
`instrumentalness`	Probability that the track contains no vocals
`liveness`	Presence of an audience in the recording
`valence`	Musical positiveness conveyed by the track
`tempo`	Estimated tempo of the track in beats per minute
`genre`	Musical genre of the song

Technologies

Python
Jupyter Notebook

Features

Here is what this project does:

Top 30 most popular songs: Identification and selection of the 30 most popular tracks based on Spotify’s popularity score.
Multidimensional analysis: Application of MDS using selected musical attributes to reduce dimensionality.
Most similar songs: Identification of the two most similar songs based on distances in the MDS space.

Limitations

The main limitations of this project are:

The stress value is relatively high, which may indicate distortion between the original and reduced dimensional spaces.
Only two components are considered, which may not capture all the variability in the data.
The analysis relies exclusively on the Euclidean distance metric.

Process

First, the dataset "songs_normalize.csv" was loaded, its dimensions (2000 rows and 18 columns) were verified, data types were reviewed, and the presence of missing values and outliers was checked:

Next, the data was filtered to obtain the top 30 most popular songs, sorting them in descending order according to the popularity variable. A subset of relevant musical attributes such as duration_ms, danceability, energy, loudness, etc. was selected to compare the songs. These features were then standardized to ensure a mean of 0 and a standard deviation of 1, preventing variables with larger scales from dominating the distance calculations.

Subsequently, a dissimilarity matrix was computed using Euclidean distance, measuring how different each pair of songs is in the standardized feature space. Multidimensional scaling (MDS) was then applied to this matrix, reducing the data to two dimensions and creating a 2D space where distances between points represent dissimilarities between the original songs. The resulting MDS coordinates were used to perform linear regressions between each MDS component and the original features. This allowed the calculation of direction vectors, indicating how each attribute contributes to the positioning of songs in the MDS space.

The songs were then visualized in the MDS space, as shown below:

The attribute vectors were included to aid interpretation of the most influential musical features along different directions of the plot:

Finally, Euclidean distances between all pairs of songs were computed in the reduced MDS space. Distances along the diagonal were replaced to avoid self-comparisons, and the pair of songs with the minimum distance was identified as the most similar.

Results

The analysis revealed that the two most similar songs according to Euclidean distance in the MDS space are:

Song 1: Perfect - Ed Sheeran
Song 2: Do I Wanna Know? - Arctic Monkeys

The distance between these two songs in the MDS space is 0.7979. A comparison of their original features shows similarities in:

duration_ms: A difference of only 8994 milliseconds.
popularity: Both songs differ by just one popularity point.
mode: Both share the same musical mode.

Although differences exist in features such as energy, loudness, danceability, and tempo, their proximity in the MDS space indicates that they share the most similar overall musical profile within the selected set. The direction vectors in the MDS plot further help explain which attributes contribute most to this similarity.

Final recommendation

As a final recommendation for this analysis, to improve the quality of the dimensionality reduction, the stress value could be reduced by increasing the number of MDS components to three or four, allowing the model to capture more variability from the original data.

What I learned

The most important things I learned from this project are that effective multivariate analysis depends heavily on selecting appropriate features that meaningfully represent the data within a reduced dimensional space. It is also crucial to align the choice of distance metric with the nature of the data and the analysis objective. For musical data, alternatives such as cosine similarity may provide more insightful results than Euclidean distance.

Additionally, evaluating performance metrics such as stress is essential to assess how well the reduced space preserves original distances. This evaluation provides valuable insight into the accuracy and reliability of the MDS representation.

How can it be improved

Optimize the stress value and the number of components used.
Explore alternative distance metrics (Manhattan, correlation, cosine similarity, etc.).
Implement non-metric MDS for more flexible distance preservation.

Running the project

To run the project, simply open the Jupyter Notebook "Spotify_Songs_MDS_Analysis", load the csv file "songs_normalize.csv" and run all cells.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README-Img		README-Img
LICENSE		LICENSE
README.md		README.md
Spotify_Songs_MDS_Analysis.ipynb		Spotify_Songs_MDS_Analysis.ipynb
songs_normalize.csv		songs_normalize.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify-Songs-MDS-Analysis

Table of contents

Project overview

Data

Technologies

Features

Limitations

Process

Results

What I learned

How can it be improved

Running the project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spotify-Songs-MDS-Analysis

Table of contents

Project overview

Data

Technologies

Features

Limitations

Process

Results

What I learned

How can it be improved

Running the project

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages