Skip to content
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3f73738
added deploy script with uploading to given rclone remote
gg46ixav Jul 3, 2025
9edc0dc
added webdav-url argument
gg46ixav Jul 4, 2025
a56f01d
added deploying to the databus without upload to nextcloud
gg46ixav Jul 25, 2025
5fdf78b
Merge branch 'download-capabilities' into nextcloudclient
gg46ixav Oct 21, 2025
800256c
updated pyproject.toml and content-hash
gg46ixav Oct 21, 2025
66f1c8e
Merge branch 'main' into nextcloudclient
gg46ixav Oct 28, 2025
4259229
Merge remote-tracking branch 'origin/main' into nextcloudclient
gg46ixav Oct 28, 2025
b179f90
updated README.md
gg46ixav Oct 28, 2025
a504b9d
Merge remote-tracking branch 'origin/nextcloudclient' into nextcloudc…
gg46ixav Oct 28, 2025
0ce0c24
added checksum validation
gg46ixav Oct 28, 2025
6596cbc
updated upload_to_nextcloud function to accept list of source_paths
gg46ixav Oct 28, 2025
b9f9854
only add result if upload successful
gg46ixav Oct 28, 2025
2f8493d
use os.path.basename instead of .split("/")[-1]
gg46ixav Oct 28, 2025
07359cc
added __init__.py and updated README.md
gg46ixav Oct 28, 2025
8047968
changed append to extend (no nested list)
gg46ixav Oct 28, 2025
0172450
fixed windows separators and added rclone error message
gg46ixav Oct 28, 2025
f957512
moved deploy.py to cli upload_and_deploy
gg46ixav Nov 3, 2025
607f527
changed metadata to dict list
gg46ixav Nov 3, 2025
6cb7e11
removed python-dotenv
gg46ixav Nov 3, 2025
7651c31
small updates
gg46ixav Nov 3, 2025
df17a7c
refactored upload_and_deploy function
gg46ixav Nov 3, 2025
7492531
updated README.md
gg46ixav Nov 3, 2025
c985603
updated metadata_string for new metadata format
gg46ixav Nov 3, 2025
62a3611
updated README.md
gg46ixav Nov 3, 2025
22ac02f
updated README.md
gg46ixav Nov 3, 2025
3faaf4d
Changed context url back
gg46ixav Nov 3, 2025
5dfebe5
added check for known compressions
gg46ixav Nov 3, 2025
f9367c0
updated checksum to sha256
gg46ixav Nov 3, 2025
5d474db
updated README.md
gg46ixav Nov 3, 2025
bef78ef
size check
gg46ixav Nov 3, 2025
529f2ae
updated checksum validation
gg46ixav Nov 3, 2025
77dca5a
added doc
gg46ixav Nov 3, 2025
02b1873
- refactored deploy, upload_and_deploy and deploy_with_metadata to on…
gg46ixav Nov 4, 2025
04c0b6e
updated README.md
gg46ixav Nov 4, 2025
fb93bc9
fixed docstring
gg46ixav Nov 4, 2025
8e6167b
removed metadata.json
gg46ixav Nov 4, 2025
943e30b
moved COMPRESSION_EXTS out of loop
gg46ixav Nov 4, 2025
1274cbc
removed unnecessary f-strings
gg46ixav Nov 4, 2025
02481b3
set file_format and compression to None
gg46ixav Nov 4, 2025
a5ec24d
get file_format and compression from metadata file
gg46ixav Nov 4, 2025
f95155f
updated README.md
gg46ixav Nov 4, 2025
274f252
chores
Integer-Ctrl Nov 5, 2025
f22c71d
updated metadata format (removed filename - used url instead)
gg46ixav Nov 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 93 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,8 +221,100 @@ If using vault authentication, make sure the token file is available in the cont
docker run --rm -v $(pwd):/data dbpedia/databus-python-client download https://databus.dbpedia.org/dbpedia-enterprise/live-fusion-snapshots/fusion/2025-08-23/fusion_props=all_subjectns=commons-wikimedia-org_vocab=all.ttl.gz --token vault-token.dat
```

## Module Usage

### Upload-and-deploy command
```bash
databusclient upload-and-deploy --help
```
```text
Usage: databusclient upload-and-deploy [OPTIONS] [FILES]...

Upload files to Nextcloud and deploy to DBpedia Databus.

Arguments:
FILES... files in the form of List[path], where every path must exist locally, which will be uploaded and deployed

Options:
--webdav-url TEXT WebDAV URL (e.g.,
https://cloud.example.com/remote.php/webdav)
--remote TEXT rclone remote name (e.g., 'nextcloud')
--path TEXT Remote path on Nextcloud (e.g., 'datasets/mydataset')
--no-upload Skip file upload and use existing metadata
--metadata PATH Path to metadata JSON file (required if --no-upload is
used)
--version-id TEXT Target databus version/dataset identifier of the form <h
ttps://databus.dbpedia.org/$ACCOUNT/$GROUP/$ARTIFACT/$VE
RSION> [required]
--title TEXT Dataset title [required]
--abstract TEXT Dataset abstract max 200 chars [required]
--description TEXT Dataset description [required]
--license TEXT License (see dalicc.net) [required]
--apikey TEXT API key [required]
--help Show this message and exit.
```
The script uploads all given files and all files in the given folders to the given remote.
Then registers them on the databus.


#### Example of using upload-and-deploy command

```bash
databusclient upload-and-deploy \
--webdav-url https://cloud.scadsai.uni-leipzig.de/remote.php/webdav \
--remote scads-nextcloud \
--path test \
--version-id https://databus.org/user/dataset/version/1.0 \
--title "Test Dataset" \
--abstract "This is a short abstract of the test dataset." \
--description "This dataset was uploaded for testing the Nextcloud → Databus deployment pipeline." \
--license https://dalicc.net/licenselibrary/Apache-2.0 \
--apikey "API-KEY" \
/home/test \
/home/test_folder/test
```


### deploy-with-metadata command
```bash
databusclient deploy-with-metadata --help
```
```text
Usage: databusclient deploy-with-metadata [OPTIONS]

Deploy to DBpedia Databus using metadata json file.

Options:
--metadata PATH Path to metadata JSON file [required]
--version-id TEXT Target databus version/dataset identifier of the form <h
ttps://databus.dbpedia.org/$ACCOUNT/$GROUP/$ARTIFACT/$VE
RSION> [required]
--title TEXT Dataset title [required]
--abstract TEXT Dataset abstract max 200 chars [required]
--description TEXT Dataset description [required]
--license TEXT License (see dalicc.net) [required]
--apikey TEXT API key [required]
--help Show this message and exit.
```

Use the metadata.json file (see [databusclient/metadata.json](databusclient/metadata.json)) to list all files which should be added to the databus.
The script registers all files on the databus.


#### Examples of using deploy-with-metadata command

```bash
databusclient deploy-with-metadata \
--metadata /home/metadata.json \
--version-id https://databus.org/user/dataset/version/1.0 \
--title "Test Dataset" \
--abstract "This is a short abstract of the test dataset." \
--description "This dataset was uploaded for testing the Nextcloud → Databus deployment pipeline." \
--license https://dalicc.net/licenselibrary/Apache-2.0 \
--apikey "API-KEY"
```


## Module Usage
### Step 1: Create lists of distributions for the dataset

```python
Expand Down
75 changes: 75 additions & 0 deletions databusclient/cli.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
#!/usr/bin/env python3
import json

import click
from typing import List
from databusclient import client

from nextcloudclient import upload

@click.group()
def app():
Expand Down Expand Up @@ -36,6 +39,78 @@ def deploy(version_id, title, abstract, description, license_url, apikey, distri
client.deploy(dataid=dataid, api_key=apikey)


@app.command()
@click.option(
"--metadata", "metadata_file",
required=True,
type=click.Path(exists=True),
help="Path to metadata JSON file",
)
@click.option(
"--version-id", "version_id",
required=True,
help="Target databus version/dataset identifier of the form "
"<https://databus.dbpedia.org/$ACCOUNT/$GROUP/$ARTIFACT/$VERSION>",
)
@click.option("--title", required=True, help="Dataset title")
@click.option("--abstract", required=True, help="Dataset abstract max 200 chars")
@click.option("--description", required=True, help="Dataset description")
@click.option("--license", "license_url", required=True, help="License (see dalicc.net)")
@click.option("--apikey", required=True, help="API key")
def deploy_with_metadata(metadata_file, version_id, title, abstract, description, license_url, apikey):
"""
Deploy to DBpedia Databus using metadata json file.
"""

with open(metadata_file, 'r') as f:
metadata = json.load(f)

client.deploy_from_metadata(metadata, version_id, title, abstract, description, license_url, apikey)


@app.command()
@click.option(
"--webdav-url", "webdav_url",
required=True,
help="WebDAV URL (e.g., https://cloud.example.com/remote.php/webdav)",
)
@click.option(
"--remote",
required=True,
help="rclone remote name (e.g., 'nextcloud')",
)
@click.option(
"--path",
required=True,
help="Remote path on Nextcloud (e.g., 'datasets/mydataset')",
)
@click.option(
"--version-id", "version_id",
required=True,
help="Target databus version/dataset identifier of the form "
"<https://databus.dbpedia.org/$ACCOUNT/$GROUP/$ARTIFACT/$VERSION>",
)
@click.option("--title", required=True, help="Dataset title")
@click.option("--abstract", required=True, help="Dataset abstract max 200 chars")
@click.option("--description", required=True, help="Dataset description")
@click.option("--license", "license_url", required=True, help="License (see dalicc.net)")
@click.option("--apikey", required=True, help="API key")
@click.argument(
"files",
nargs=-1,
type=click.Path(exists=True),
)
def upload_and_deploy(webdav_url, remote, path, version_id, title, abstract, description, license_url, apikey,
files: List[str]):
"""
Upload files to Nextcloud and deploy to DBpedia Databus.
"""

click.echo(f"Uploading data to nextcloud: {remote}")
metadata = upload.upload_to_nextcloud(files, remote, path, webdav_url)
client.deploy_from_metadata(metadata, version_id, title, abstract, description, license_url, apikey)


@app.command()
@click.argument("databusuris", nargs=-1, required=True)
@click.option("--localdir", help="Local databus folder (if not given, databus folder structure is created in current working directory)")
Expand Down
124 changes: 123 additions & 1 deletion databusclient/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,79 @@ def create_distribution(

return f"{url}|{meta_string}"

def create_distributions_from_metadata(metadata: List[Dict[str, Union[str, int]]]) -> List[str]:
"""
Create distributions from metadata entries.

Parameters
----------
metadata : List[Dict[str, Union[str, int]]]
List of metadata entries, each containing:
- filename: str - Name of the file
- checksum: str - SHA-256 hex digest (64 characters)
- size: int - File size in bytes (positive integer)
- url: str - Download URL for the file

Returns
-------
List[str]
List of distribution identifier strings for use with create_dataset
"""
distributions = []
counter = 0
for entry in metadata:
# Validate required keys
required_keys = ["filename", "checksum", "size", "url"]
missing_keys = [key for key in required_keys if key not in entry]
if missing_keys:
raise ValueError(f"Metadata entry missing required keys: {missing_keys}")

filename = entry["filename"]
checksum = entry["checksum"]
size = entry["size"]
if not isinstance(size, int) or size <= 0:
raise ValueError(f"Invalid size for {filename}: expected positive integer, got {size}")
url = entry["url"]
# Validate SHA-256 hex digest (64 hex chars)
if not isinstance(checksum, str) or len(checksum) != 64 or not all(
c in '0123456789abcdefABCDEF' for c in checksum):
raise ValueError(f"Invalid checksum for {filename}")
# Known compression extensions
COMPRESSION_EXTS = {"gz", "bz2", "xz", "zip", "7z", "tar", "lz", "zst"}

parts = filename.split(".")
if len(parts) == 1:
file_format = "none"
compression = "none"
elif len(parts) == 2:
file_format = parts[-1]
compression = "none"
else:
# Check if last part is a known compression

if parts[-1] in COMPRESSION_EXTS:
compression = parts[-1]
# Handle compound extensions like .tar.gz
if len(parts) > 2 and parts[-2] in COMPRESSION_EXTS:
file_format = parts[-3] if len(parts) > 3 else "file"
else:
file_format = parts[-2]
else:
file_format = parts[-1]
compression = "none"

distributions.append(
create_distribution(
url=url,
cvs={"count": f"{counter}"},
file_format=file_format,
compression=compression,
sha256_length_tuple=(checksum, size)
)
)
counter += 1
return distributions


def create_dataset(
version_id: str,
Expand Down Expand Up @@ -393,6 +466,55 @@ def deploy(
print(resp.text)


def deploy_from_metadata(
metadata: List[Dict[str, Union[str, int]]],
version_id: str,
title: str,
abstract: str,
description: str,
license_url: str,
apikey: str
) -> None:
"""
Deploy a dataset from metadata entries.

Parameters
----------
metadata : List[Dict[str, Union[str, int]]]
List of file metadata entries (see create_distributions_from_metadata)
version_id : str
Dataset version ID in the form $DATABUS_BASE/$ACCOUNT/$GROUP/$ARTIFACT/$VERSION
title : str
Dataset title
abstract : str
Short description of the dataset
description : str
Long description (Markdown supported)
license_url : str
License URI
apikey : str
API key for authentication
"""
distributions = create_distributions_from_metadata(metadata)

dataset = create_dataset(
version_id=version_id,
title=title,
abstract=abstract,
description=description,
license_url=license_url,
distributions=distributions
)

print(f"Deploying dataset version: {version_id}")
deploy(dataset, apikey)

print(f"Successfully deployed to {version_id}")
print(f"Deployed {len(metadata)} file(s):")
for entry in metadata:
print(f" - {entry['filename']}")


def __download_file__(url, filename, vault_token_file=None, auth_url=None, client_id=None) -> None:
"""
Download a file from the internet with a progress bar using tqdm.
Expand Down Expand Up @@ -635,7 +757,7 @@ def __download_list__(urls: List[str],
def __get_databus_id_parts__(uri: str) -> Tuple[Optional[str], Optional[str], Optional[str], Optional[str], Optional[str], Optional[str]]:
uri = uri.removeprefix("https://").removeprefix("http://")
parts = uri.strip("/").split("/")
parts += [None] * (6 - len(parts)) # pad with None if less than 6 parts
parts += [None] * (6 - len(parts)) # pad fwith None if less than 6 parts
return tuple(parts[:6]) # return only the first 6 parts


Expand Down
14 changes: 14 additions & 0 deletions databusclient/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[
{
"filename": "example.ttl",
"checksum": "0929436d44bba110fc7578c138ed770ae9f548e195d19c2f00d813cca24b9f39",
"size": 12345,
"url": "https://cloud.example.com/remote.php/webdav/datasets/mydataset/example.ttl"
},
{
"filename": "example.csv.gz",
"checksum": "2238acdd7cf6bc8d9c9963a9f6014051c754bf8a04aacc5cb10448e2da72c537",
"size": 54321,
"url": "https://cloud.example.com/remote.php/webdav/datasets/mydataset/example.csv.gz"
}
]
Empty file added nextcloudclient/__init__.py
Empty file.
Loading