Skip to content

Commit 425f498

Browse files
committed
Supporting spotify python app
1 parent 9dc1075 commit 425f498

File tree

9 files changed

+567
-0
lines changed

9 files changed

+567
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
3+
max-line-length = 180
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[FORMAT]
2+
3+
# Maximum number of characters on a single line.
4+
max-line-length=180
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Spotify to Elasticsearch
2+
3+
What does it do?
4+
5+
- It works with the spotify API to retrieve metadata.
6+
- It imports your Spotify Privacy export.
7+
- It sends all your songs to an Elasticsearch Cluster for analyzing.
8+
9+
## Requirements
10+
11+
It uses the [Spotipy](https://spotipy.readthedocs.io/en/2.25.0/) library to connect and interact with Spotify API. Therefore you need to create your own Spotify Developer Account.
12+
13+
To minimize the strain on the Spotify API, we are creating a local `metadata_cache.json` file that stores all the unique songs ID + metadata retrieved. If you listen to a song twice, we will only ask the Spotify API once for the metadata.
14+
15+
This was tested and written with Python 3.13.
16+
17+
### Spotify data export
18+
19+
This can take up to 30 days. You will get a mail as soon as the download is ready.
20+
21+
1. Go to [Spotify Privacy](https://www.spotify.com/account/privacy/)
22+
2. Scroll down and select: `Extended Streaming History` it is the right top one.
23+
3. Click in the bottom `request data`
24+
4. You will get a mail to validate that you want this data.
25+
5. Wait until you receive a mail that your data is ready for download.
26+
27+
### Spotify developer account
28+
29+
We need a Spotify developer account otherwise we are not allowed to ask the API.
30+
31+
1. Go to [Spotify Developer](http://developer.spotify.com/)
32+
2. In the top right corner `Log In`
33+
3. Log In with your normal Spotify Account
34+
4. In the top right corner, where the `Log In` button was, click on your name and select `Dashboard`
35+
5. Click on `Create App`
36+
6. Give it an App name like `Elasticsearch Wrapped`
37+
7. Give it a description like `Reading metadata about songs for Elasticsearch`
38+
8. Redirect URIs put `http://localhost:9100`
39+
9. Under: `Which API/SDKs are you planning to use?`
40+
1. Select `Web API`
41+
10. Accept the terms and conditions
42+
11. In the top right corner select `Settings`
43+
12. Copy `client ID` and `client secret`. (We pass this as parameters when we run the script)
44+
45+
### Elastic API Key & Elasticsearch URL
46+
47+
1. Log into your Elastic cluster [Elastic Cloud](https://cloud.elastic.co) and either do a serverless project or hosted deployment. (It works with On Premise or any other form of deployment as well)
48+
2. Serverless, you would create an `Observability Project`
49+
1. Go to manage, click in the top right corner on: `Connection details` and mark down the `Elasticsearch URL`. Should look something like this `https://<project-name>-number.es.<region>`
50+
2. API Key (Please note that this will give the API key the same permissions you have, easiest and quickest)
51+
1. UI:
52+
1. Project Settings => Management => API keys => Create API Key => `spotify` as name. Copy the `endcoded` value. It will only be shown once.
53+
2. Developer Tools:
54+
55+
```json
56+
POST _security/api_key
57+
{
58+
"name": "spotify"
59+
}
60+
```
61+
62+
3. Hosted deployment:
63+
1. Go onto your deployment, or create a new one.
64+
2. Press the `copy endpoint` button for Elasticsearch.
65+
3. API Key:
66+
67+
```json
68+
POST _security/api_key
69+
{
70+
"name": "spotify"
71+
}
72+
```
73+
74+
> Note: If you want more fine grained control, this is the minimum the application needs:
75+
76+
<details>
77+
<summary> API Request </summary>
78+
79+
```json
80+
POST _security/api_key
81+
{
82+
"name": "spotify",
83+
"role_descriptors": {
84+
"spotify_history": {
85+
"cluster": [
86+
"monitor",
87+
"manage_ingest_pipelines"
88+
],
89+
"indices": [
90+
{
91+
"names": [
92+
"spotify-history"
93+
],
94+
"privileges": [
95+
"all"
96+
],
97+
"field_security": {
98+
"grant": [
99+
"*"
100+
],
101+
"except": []
102+
},
103+
"allow_restricted_indices": false
104+
}
105+
],
106+
"applications": [],
107+
"run_as": [],
108+
"metadata": {},
109+
"transient_metadata": {
110+
"enabled": true
111+
}
112+
}
113+
}
114+
}
115+
```
116+
117+
</details>
118+
119+
## Executing
120+
121+
1. Place the extracted files from the zip folder into the `to_read` folder. It needs to be the JSON files directly and not the zip.
122+
1. Execute `pip install -r requirements.txt` and install all the dependencies.
123+
1. Just run in your favorite shell and it will execute and find all the files in the `to_read` folder.
124+
125+
```shell
126+
python3 python/main.py \
127+
--es-url "https://spotify.es....:443" \
128+
--es-api-key "WFdNcE1KTU...==" \
129+
--spotify-client-id "f972762..." \
130+
--spotify-client-secret "74bcf5196b..." \
131+
--user-name "philipp"
132+
```
133+
134+
The `--user-name` is optional but helpful if you index the data of your friends and family as well. The field in Elastic is then called `user`
135+
136+
## Caveats
137+
138+
- It only works with songs. No support for videos, podcasts or anything else yet.
139+
- If you restart it at any point, it will just index everything again and overwrite what is in there. It moves the finished file to the `processed` folder. Once a file is fully done, it won't be touched again unless you move it to the `to_read` folder again.
140+
- The way I set the `_id` means that you can only listen to one artist per second.
141+
- It will log the track for which it cannot find any metadata. That could be due to spotify changing the track ID, because it removed the album the track was part of.
142+
143+
## Kibana Dashboard
144+
145+
There is a prebuild dashboard available, you can import that through the saved objects in Kibana. It was built on 8.17.
146+
147+
![Kibana Dashboard Preview](kibana/dashboard.jpeg)
3.35 MB
Loading

supporting-blog-content/spotify-to-elasticsearch/kibana/dashboard.ndjson

Lines changed: 3 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
import logging
2+
from pathlib import Path
3+
import typer
4+
from datetime import datetime
5+
import json
6+
from rich.logging import RichHandler
7+
from rich.console import Console
8+
from rich.progress import (
9+
Progress,
10+
SpinnerColumn,
11+
BarColumn,
12+
TaskProgressColumn,
13+
TimeElapsedColumn,
14+
)
15+
from services import SpotifyService, ElasticsearchService
16+
from models import SpotifyTrack
17+
18+
logger = None
19+
20+
21+
def try_parsing_date(text):
22+
"""Attempt to parse a date"""
23+
for fmt in ("%Y-%m-%dT%H:%M:%SZ", "%Y-%m-%dT%H:%M:%S.%fZ"):
24+
try:
25+
return datetime.strptime(text, fmt)
26+
except ValueError:
27+
logger.error(f"Error parsing date: {text}")
28+
pass
29+
30+
31+
def process_history_file(
32+
file_path: str,
33+
spotify_svc: SpotifyService,
34+
es_svc: ElasticsearchService,
35+
user_name: str,
36+
):
37+
"""Main processing function"""
38+
# Set up rich logging
39+
logging.basicConfig(
40+
level=logging.INFO,
41+
format="%(message)s",
42+
handlers=[RichHandler(rich_tracebacks=True)],
43+
)
44+
logger = logging.getLogger(__name__)
45+
console = Console()
46+
47+
with open(file_path) as f:
48+
history = json.load(f)
49+
50+
console.print(f"[green]Processing {file_path}")
51+
52+
documents = []
53+
with Progress(
54+
SpinnerColumn(),
55+
"[progress.description]{task.description}",
56+
BarColumn(),
57+
TaskProgressColumn(),
58+
TimeElapsedColumn(),
59+
) as progress:
60+
task = progress.add_task("[cyan]Processing tracks...", total=len(history))
61+
62+
total_entries = len(history)
63+
batch_size = 50
64+
for i in range(0, total_entries, batch_size):
65+
entries_batch = history[i : i + batch_size]
66+
metadata_batch = spotify_svc.get_tracks_metadata(entries_batch)
67+
for entry in entries_batch:
68+
try:
69+
# let's make sure to only look at songs
70+
# we do not support videos, podcats or
71+
# anything else yet.
72+
if entry["spotify_track_uri"] is not None and entry[
73+
"spotify_track_uri"
74+
].startswith("spotify:track:"):
75+
track_id = entry["spotify_track_uri"].replace(
76+
"spotify:track:", ""
77+
)
78+
metadata = metadata_batch.get(track_id, None)
79+
played_at = try_parsing_date(entry["ts"])
80+
if metadata is not None:
81+
documents.append(
82+
SpotifyTrack(
83+
id=str(
84+
int(
85+
(
86+
played_at - datetime(1970, 1, 1)
87+
).total_seconds()
88+
)
89+
)
90+
+ "_"
91+
+ entry["master_metadata_album_artist_name"],
92+
artist=[
93+
artist["name"] for artist in metadata["artists"]
94+
],
95+
album=metadata["album"]["name"],
96+
country=entry["conn_country"],
97+
duration=metadata["duration_ms"],
98+
explicit=metadata["explicit"],
99+
listened_to_pct=(
100+
entry["ms_played"] / metadata["duration_ms"]
101+
if metadata["duration_ms"] > 0
102+
else None
103+
),
104+
listened_to_ms=entry["ms_played"],
105+
ip=entry["ip_addr"],
106+
reason_start=entry["reason_start"],
107+
reason_end=entry["reason_end"],
108+
shuffle=entry["shuffle"],
109+
skipped=entry["skipped"],
110+
offline=entry["offline"],
111+
title=metadata["name"],
112+
platform=entry["platform"],
113+
played_at=played_at,
114+
spotify_metadata=metadata,
115+
hourOfDay=played_at.hour,
116+
dayOfWeek=played_at.strftime("%A"),
117+
url=metadata["external_urls"]["spotify"],
118+
user=user_name,
119+
)
120+
)
121+
else:
122+
console.print(f"[red]Metadata not found for track: {entry}")
123+
if len(documents) >= 500:
124+
console.print(
125+
f"[green]Indexing batch of tracks... {len(documents)}"
126+
)
127+
es_svc.bulk_index(documents)
128+
documents = []
129+
progress.advance(task)
130+
131+
except Exception as e:
132+
logger.error(f"Error processing track: {e}")
133+
spotify_svc.metadata_cache.save_cache()
134+
raise
135+
136+
if documents:
137+
console.print(f"[green]Indexing final batch of tracks... {len(documents)}")
138+
es_svc.bulk_index(documents)
139+
console.print(f"[green]Done! {file_path} processed!")
140+
141+
spotify_svc.metadata_cache.save_cache()
142+
143+
144+
app = typer.Typer()
145+
146+
147+
@app.command()
148+
def process_history(
149+
es_url: str = typer.Option(..., help="Elasticsearch URL"),
150+
es_api_key: str = typer.Option(..., help="Elasticsearch API Key"),
151+
spotify_client_id: str = typer.Option(None, help="Spotify Client ID"),
152+
spotify_client_secret: str = typer.Option(None, help="Spotify Client Secret"),
153+
user_name: str = typer.Option(None, help="User name"),
154+
):
155+
"""Setup the services"""
156+
if spotify_client_id and spotify_client_secret:
157+
spotify_svc = SpotifyService(
158+
client_id=spotify_client_id,
159+
client_secret=spotify_client_secret,
160+
redirect_uri="http://localhost:9100",
161+
)
162+
es_svc = ElasticsearchService(es_url=es_url, api_key=es_api_key)
163+
# Ensure index exists
164+
es_svc.check_index()
165+
es_svc.check_pipeline()
166+
167+
files = list(Path("to_read").glob("*Audio*.json"))
168+
if not files:
169+
raise ValueError(
170+
"No JSON files found in 'to_read' directory, expected them to be named *Audio*.json, like Streaming_History_Audio_2023_8.json"
171+
)
172+
else:
173+
for file_path in files:
174+
process_history_file(file_path, spotify_svc, es_svc, user_name)
175+
move_file(file_path)
176+
177+
178+
def move_file(file_path: Path):
179+
"""Move the file to the 'processed' directory"""
180+
processed_dir = Path("processed")
181+
processed_dir.mkdir(exist_ok=True)
182+
new_path = Path("processed") / file_path.name
183+
file_path.rename(new_path)
184+
185+
186+
if __name__ == "__main__":
187+
app()

0 commit comments

Comments
 (0)