- Project overview
- Data
- Technologies
- Features
- Limitations
- Process
- Results
- What I learned
- How can it be improved
- Running the project
This project applies multidimensional scaling (MDS) to a Spotify songs dataset with the objective of identifying the two most similar songs among the top 30 most popular tracks. The analysis is based on standardized musical and audio features and relies on distance-based similarity measures to capture relationships between songs in a reduced dimensional space.
The dataset used, "songs_normalize.csv", contains musical and audio information for 2000 songs across the following variables:
| Variable | Definition |
|---|---|
artist |
Name of the artist |
song |
Song title |
duration_ms |
Duration of the song in milliseconds |
explicit |
Indicates whether the song contains explicit content |
year |
Release year of the song |
popularity |
Popularity score assigned by Spotify |
danceability |
Measure of how suitable a track is for dancing |
energy |
Measure of intensity and activity of the song |
key |
Musical key of the track |
loudness |
Overall loudness of the track in decibels |
mode |
Modality of the track (1 = Major, 0 = Minor) |
speechiness |
Presence of spoken words in the track |
acousticness |
Measure of whether the track is acoustic |
instrumentalness |
Probability that the track contains no vocals |
liveness |
Presence of an audience in the recording |
valence |
Musical positiveness conveyed by the track |
tempo |
Estimated tempo of the track in beats per minute |
genre |
Musical genre of the song |
- Python
- Jupyter Notebook
Here is what this project does:
- Top 30 most popular songs: Identification and selection of the 30 most popular tracks based on Spotify’s popularity score.
- Multidimensional analysis: Application of MDS using selected musical attributes to reduce dimensionality.
- Most similar songs: Identification of the two most similar songs based on distances in the MDS space.
The main limitations of this project are:
- The stress value is relatively high, which may indicate distortion between the original and reduced dimensional spaces.
- Only two components are considered, which may not capture all the variability in the data.
- The analysis relies exclusively on the Euclidean distance metric.
First, the dataset "songs_normalize.csv" was loaded, its dimensions (2000 rows and 18 columns) were verified, data types were reviewed, and the presence of missing values and outliers was checked:
Next, the data was filtered to obtain the top 30 most popular songs, sorting them in descending order according to the popularity variable. A subset of relevant musical attributes such as duration_ms, danceability, energy, loudness, etc. was selected to compare the songs. These features were then standardized to ensure a mean of 0 and a standard deviation of 1, preventing variables with larger scales from dominating the distance calculations.
Subsequently, a dissimilarity matrix was computed using Euclidean distance, measuring how different each pair of songs is in the standardized feature space. Multidimensional scaling (MDS) was then applied to this matrix, reducing the data to two dimensions and creating a 2D space where distances between points represent dissimilarities between the original songs. The resulting MDS coordinates were used to perform linear regressions between each MDS component and the original features. This allowed the calculation of direction vectors, indicating how each attribute contributes to the positioning of songs in the MDS space.
The songs were then visualized in the MDS space, as shown below:
The attribute vectors were included to aid interpretation of the most influential musical features along different directions of the plot:
Finally, Euclidean distances between all pairs of songs were computed in the reduced MDS space. Distances along the diagonal were replaced to avoid self-comparisons, and the pair of songs with the minimum distance was identified as the most similar.
The analysis revealed that the two most similar songs according to Euclidean distance in the MDS space are:
- Song 1: Perfect - Ed Sheeran
- Song 2: Do I Wanna Know? - Arctic Monkeys
The distance between these two songs in the MDS space is 0.7979. A comparison of their original features shows similarities in:
duration_ms: A difference of only 8994 milliseconds.popularity: Both songs differ by just one popularity point.mode: Both share the same musical mode.
Although differences exist in features such as energy, loudness, danceability, and tempo, their proximity in the MDS space indicates that they share the most similar overall musical profile within the selected set. The direction vectors in the MDS plot further help explain which attributes contribute most to this similarity.
Final recommendation
As a final recommendation for this analysis, to improve the quality of the dimensionality reduction, the stress value could be reduced by increasing the number of MDS components to three or four, allowing the model to capture more variability from the original data.
The most important things I learned from this project are that effective multivariate analysis depends heavily on selecting appropriate features that meaningfully represent the data within a reduced dimensional space. It is also crucial to align the choice of distance metric with the nature of the data and the analysis objective. For musical data, alternatives such as cosine similarity may provide more insightful results than Euclidean distance.
Additionally, evaluating performance metrics such as stress is essential to assess how well the reduced space preserves original distances. This evaluation provides valuable insight into the accuracy and reliability of the MDS representation.
- Optimize the stress value and the number of components used.
- Explore alternative distance metrics (Manhattan, correlation, cosine similarity, etc.).
- Implement non-metric MDS for more flexible distance preservation.
To run the project, simply open the Jupyter Notebook "Spotify_Songs_MDS_Analysis", load the csv file "songs_normalize.csv" and run all cells.


