Skip to content

Commit 2cd0d29

Browse files
committed
feat: download of artifact without version (uses lates) and group (all artifacts and their latest version)
1 parent 8eead89 commit 2cd0d29

File tree

4 files changed

+168
-17
lines changed

4 files changed

+168
-17
lines changed

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
FROM python:3.10-slim
22

3-
WORKDIR /app
3+
WORKDIR /data
44

55
COPY . .
66

README.md

Lines changed: 98 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,16 @@ Commands:
2525
deploy
2626
downoad
2727
```
28+
29+
## Docker Image Usage
30+
31+
A docker image is available at [dbpedia/databus-python-client](https://hub.docker.com/r/dbpedia/databus-python-client). See [download section](#usage-of-docker-image) for details.
32+
2833
### Deploy command
2934
```
3035
databusclient deploy --help
3136
```
3237
```
33-
34-
3538
Usage: databusclient deploy [OPTIONS] DISTRIBUTIONS...
3639
3740
Arguments:
@@ -40,14 +43,14 @@ Arguments:
4043
content variants of a distribution, fileExt and Compression can be set, if not they are inferred from the path [required]
4144
4245
Options:
43-
--versionid TEXT target databus version/dataset identifier of the form <h
46+
--versionid TEXT Target databus version/dataset identifier of the form <h
4447
ttps://databus.dbpedia.org/$ACCOUNT/$GROUP/$ARTIFACT/$VE
4548
RSION> [required]
46-
--title TEXT dataset title [required]
47-
--abstract TEXT dataset abstract max 200 chars [required]
48-
--description TEXT dataset description [required]
49-
--license TEXT license (see dalicc.net) [required]
50-
--apikey TEXT apikey [required]
49+
--title TEXT Dataset title [required]
50+
--abstract TEXT Dataset abstract max 200 chars [required]
51+
--description TEXT Dataset description [required]
52+
--license TEXT License (see dalicc.net) [required]
53+
--apikey TEXT API key [required]
5154
--help Show this message and exit.
5255
```
5356
Examples of using deploy command
@@ -65,6 +68,93 @@ A few more notes for CLI usage:
6568
* For complete inferred: Just use the URL with `https://raw.githubusercontent.com/dbpedia/databus/master/server/app/api/swagger.yml`
6669
* If other parameters are used, you need to leave them empty like `https://raw.githubusercontent.com/dbpedia/databus/master/server/app/api/swagger.yml||yml|7a751b6dd5eb8d73d97793c3c564c71ab7b565fa4ba619e4a8fd05a6f80ff653:367116`
6770

71+
### Download command
72+
```
73+
python3 -m databusclient downoad --help
74+
```
75+
76+
```
77+
Usage: python3 -m databusclient download [OPTIONS] DATABUSURIS...
78+
79+
Arguments:
80+
DATABUSURIS... databus uris to download from https://databus.dbpedia.org,
81+
or a query statement that returns databus uris from https://databus.dbpedia.org/sparql
82+
to be downloaded [required]
83+
84+
Download datasets from databus, optionally using vault access if vault
85+
options are provided.
86+
87+
Options:
88+
--localdir TEXT Local databus folder (if not given, databus folder
89+
structure is created in current working directory)
90+
--databus TEXT Databus URL (if not given, inferred from databusuri, e.g.
91+
https://databus.dbpedia.org/sparql)
92+
--token TEXT Path to Vault refresh token file
93+
--authurl TEXT Keycloak token endpoint URL [default:
94+
https://auth.dbpedia.org/realms/dbpedia/protocol/openid-
95+
connect/token]
96+
--clientid TEXT Client ID for token exchange [default: vault-token-
97+
exchange]
98+
--help Show this message and exit. Show this message and exit.
99+
```
100+
101+
Examples of using download command
102+
103+
**File**: download of a single file
104+
```
105+
python3 -m databusclient download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals/2022.12.01/mappingbased-literals_lang=az.ttl.bz2
106+
```
107+
108+
**Version**: download of all files of a specific version
109+
```
110+
python3 -m databusclient download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals/2022.12.01
111+
```
112+
113+
**Artifact**: download of all files with latest version of an artifact
114+
```
115+
python3 -m databusclient download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals
116+
```
117+
118+
**Group**: download of all files with lates version of all artifacts of a group
119+
```
120+
python3 -m databusclient download https://databus.dbpedia.org/dbpedia/mappings
121+
```
122+
123+
If no `--localdir` is provided, the current working directory is used as base directory. The downloaded files will be stored in the working directory in a folder structure according to the databus structure, i.e. `./$ACCOUNT/$GROUP/$ARTIFACT/$VERSION/`.
124+
125+
**Collcetion**: download of all files within a collection
126+
```
127+
python3 -m databusclient download https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-12
128+
```
129+
130+
**Query**: download of all files returned by a query (sparql endpoint must be provided with `--databus`)
131+
```
132+
python3 -m databusclient download 'PREFIX dcat: <http://www.w3.org/ns/dcat#> SELECT ?x WHERE { ?sub dcat:downloadURL ?x . } LIMIT 10' --databus https://databus.dbpedia.org/sparql
133+
```
134+
135+
#### Authentication with vault
136+
137+
For downloading files from the vault, you need to provide a vault token. See [getting-the-access-refresh-token](https://github.com/dbpedia/databus-vault-access?tab=readme-ov-file#step-1-getting-the-access-refresh-token) for details. You can come back here once you have a `vault-token.dat` file. To use it, just provide the path to the file with `--token /path/to/vault-token.dat`.
138+
139+
Example:
140+
```
141+
python3 -m databusclient download https://databus.dbpedia.org/dbpedia-enterprise/live-fusion-snapshots/fusion/2025-08-23 --token vault-token.dat
142+
```
143+
144+
If vault authentication is required for downloading a file, the client will use the token. If no vault authentication is required, the token will not be used.
145+
146+
#### Usage of docker image
147+
148+
A docker image is available at [dbpedia/databus-python-client](https://hub.docker.com/r/dbpedia/databus-python-client). You can use it like this:
149+
150+
```
151+
docker run --rm -v $(pwd):/data dbpedia/databus-python-client download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals/2022.12.01
152+
```
153+
If using vault authentication, make sure the token file is available in the container, e.g. by placing it in the current working directory.
154+
```
155+
docker run --rm -v $(pwd):/data dbpedia/databus-python-client download https://databus.dbpedia.org/dbpedia-enterprise/live-fusion-snapshots/fusion/2025-08-23/fusion_props=all_subjectns=commons-wikimedia-org_vocab=all.ttl.gz --token vault-token.dat
156+
```
157+
68158
## Module Usage
69159

70160
### Step 1: Create lists of distributions for the dataset

databusclient/cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,15 +63,15 @@ def app():
6363

6464
@app.command()
6565
@click.option(
66-
"--version-id",
66+
"--versionid",
6767
required=True,
6868
help="Target databus version/dataset identifier of the form "
6969
"<https://databus.dbpedia.org/$ACCOUNT/$GROUP/$ARTIFACT/$VERSION>",
7070
)
7171
@click.option("--title", required=True, help="Dataset title")
7272
@click.option("--abstract", required=True, help="Dataset abstract max 200 chars")
7373
@click.option("--description", required=True, help="Dataset description")
74-
@click.option("--license-uri", required=True, help="License (see dalicc.net)")
74+
@click.option("--license", required=True, help="License (see dalicc.net)")
7575
@click.option("--apikey", required=True, help="API key")
7676
@click.argument(
7777
"distributions",

databusclient/client.py

Lines changed: 67 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -533,10 +533,12 @@ def __handle_databus_file_query__(endpoint_url, query) -> List[str]:
533533
yield value
534534

535535

536-
def __handle_databus_file_json__(json_str: str) -> List[str]:
536+
def __handle_databus_artifact_version__(json_str: str) -> List[str]:
537537
"""
538538
Parse the JSON-LD of a databus artifact version to extract download URLs.
539539
Don't get downloadURLs directly from the JSON-LD, but follow the "file" links to count access to databus accurately.
540+
541+
Returns a list of download URLs.
540542
"""
541543

542544
databusIdUrl = []
@@ -549,6 +551,48 @@ def __handle_databus_file_json__(json_str: str) -> List[str]:
549551
return databusIdUrl
550552

551553

554+
def __get_databus_latest_version_of_artifact__(json_str: str) -> str:
555+
"""
556+
Parse the JSON-LD of a databus artifact to extract URLs of the latest version.
557+
558+
Returns download URL of latest version of the artifact.
559+
"""
560+
json_dict = json.loads(json_str)
561+
versions = json_dict.get("databus:hasVersion")
562+
563+
# Single version case {}
564+
if isinstance(versions, dict):
565+
versions = [versions]
566+
# Multiple versions case [{}, {}]
567+
568+
version_urls = [v["@id"] for v in versions if "@id" in v]
569+
if not version_urls:
570+
raise ValueError("No versions found in artifact JSON-LD")
571+
572+
version_urls.sort(reverse=True) # Sort versions in descending order
573+
return version_urls[0] # Return the latest version URL
574+
575+
576+
def __get_databus_artifacts_of_group__(json_str: str) -> List[str]:
577+
"""
578+
Parse the JSON-LD of a databus group to extract URLs of all artifacts.
579+
580+
Returns a list of artifact URLs.
581+
"""
582+
json_dict = json.loads(json_str)
583+
artifacts = json_dict.get("databus:hasArtifact", [])
584+
585+
result = []
586+
for item in artifacts:
587+
uri = item.get("@id")
588+
if not uri:
589+
continue
590+
_, _, _, _, version, _ = __get_databus_id_parts__(uri)
591+
if version is None:
592+
result.append(uri)
593+
return result
594+
595+
552596
def wsha256(raw: str):
553597
return sha256(raw.encode('utf-8')).hexdigest()
554598

@@ -558,7 +602,7 @@ def __handle_databus_collection__(uri: str) -> str:
558602
return requests.get(uri, headers=headers).text
559603

560604

561-
def __handle_databus_artifact_version__(uri: str) -> str:
605+
def __get_json_ld_from_databus__(uri: str) -> str:
562606
headers = {"Accept": "application/ld+json"}
563607
return requests.get(uri, headers=headers).text
564608

@@ -607,6 +651,7 @@ def download(
607651
client_id: Client ID for token exchange
608652
"""
609653

654+
# TODO: make pretty
610655
for databusURI in databusURIs:
611656
host, account, group, artifact, version, file = __get_databus_id_parts__(databusURI)
612657

@@ -627,15 +672,31 @@ def download(
627672
__download_list__([databusURI], localDir, vault_token_file=token, auth_url=auth_url, client_id=client_id)
628673
# databus artifact version
629674
elif version is not None:
630-
json_str = __handle_databus_artifact_version__(databusURI)
631-
res = __handle_databus_file_json__(json_str)
675+
json_str = __get_json_ld_from_databus__(databusURI)
676+
res = __handle_databus_artifact_version__(json_str)
632677
__download_list__(res, localDir, vault_token_file=token, auth_url=auth_url, client_id=client_id)
633678
# databus artifact
634679
elif artifact is not None:
635-
print("artifactId not supported yet") # TODO
680+
json_str = __get_json_ld_from_databus__(databusURI)
681+
latest = __get_databus_latest_version_of_artifact__(json_str)
682+
print(f"No version given, using latest version: {latest}")
683+
json_str = __get_json_ld_from_databus__(latest)
684+
res = __handle_databus_artifact_version__(json_str)
685+
__download_list__(res, localDir, vault_token_file=token, auth_url=auth_url, client_id=client_id)
686+
636687
# databus group
637688
elif group is not None:
638-
print("groupId not supported yet") # TODO
689+
json_str = __get_json_ld_from_databus__(databusURI)
690+
artifacts = __get_databus_artifacts_of_group__(json_str)
691+
for artifact_uri in artifacts:
692+
print(f"Processing artifact {artifact_uri}")
693+
json_str = __get_json_ld_from_databus__(artifact_uri)
694+
latest = __get_databus_latest_version_of_artifact__(json_str)
695+
print(f"No version given, using latest version: {latest}")
696+
json_str = __get_json_ld_from_databus__(latest)
697+
res = __handle_databus_artifact_version__(json_str)
698+
__download_list__(res, localDir, vault_token_file=token, auth_url=auth_url, client_id=client_id)
699+
639700
# databus account
640701
elif account is not None:
641702
print("accountId not supported yet") # TODO

0 commit comments

Comments
 (0)