Skip to content

meaningfy-ws/mapping-suite-sdk

Repository files navigation

mapping-suite-sdk

pylint

Note: The Pylint badge is a static indicator. For actual Pylint scores, see the automated Pylint reports in PR comments generated by our CI checks.

PyPI version PyPI Downloads

Stack Overflow

Quality Gate Status Bugs Code Smells Coverage Duplicated Lines (%) Lines of Code Reliability Rating Security Rating Technical Debt Maintainability Rating Vulnerabilities

The Mapping Suite SDK, or MSSDK, is a software development kit (SDK) designed to standardize and simplify the handling of packages that contain transformation rules and related artefacts for mapping data from XML to RDF (RDF Mapping Language).

Mapping package anatomy

A mapping package is a standardized collection of files and directories that contains all the necessary components for transforming data from one format to another, specifically from XML to RDF using RDF Mapping Language (RML).

Structure Overview

A mapping package consists of the following core components:

  1. Metadata - Essential identifying information about the package including:

    • Identifier
    • Title
    • Issue date
    • Description
    • Mapping version
    • Ontology version
    • Type
    • Eligibility constraints
    • Signature (hash digest for integrity verification)
  2. Conceptual Mapping Asset - Excel spreadsheets that define high-level mapping concepts and relationships between source data and target ontologies.

  3. Technical Mapping Suite - A collection of implementation-specific mapping files:

    • RML Mapping files - Define transformations from heterogeneous data structures to RDF
  4. Vocabulary Mapping Suite - Files that define specific value transformations and mappings between source and target data values (JSON, CSV, XML).

  5. Test Data Suites - Collections of test data files used for validation and verification of mapping processes.

  6. SPARQL Test Suites - Collections of SPARQL query files used for testing and validation of the transformed data.

  7. SHACL Test Suites - Collections of SHACL (Shapes Constraint Language) files used for RDF data validation.

Package Structure Diagram

mapping-package/
├── metadata.json                  # Package metadata
├── transformation/                # Transformation assets
│   ├── conceptual_mappings.xlsx   # Excel file with conceptual mappings
│   ├── mappings/                  # Technical mapping suite
│   │   ├── mapping1.rml.ttl       # RML mapping files
│   │   ├── mapping2.rml.ttl
│   │   └── mapping3.rml.ttl
│   └── resources/                 # Vocabulary mapping suite
│       ├── codelist1.json         # Value mapping files in various formats
│       └── codelist2.csv
├── validation/                    # Validation assets
│   ├── shacl/                     # SHACL test suites
│   │   └── shacl_suite1/                # Domain-specific SHACL shapes
│   │       └── shape1.ttl         # SHACL shape files
│   └── sparql/                    # SPARQL test suites
│       └── sparql_suite1/              # Category-specific SPARQL queries
│           ├── query1.rq          # SPARQL query files
│           └── query2.rq
└── test_data/                     # Test data suites
    ├── test_data_suite1/                # Test case directory
    │   └── input.xml              # Input test data
    └── test_data_suite2/                # Another test case directory
        └── input.xml              # Input test data

This standardized structure ensures consistency across mapping packages and simplifies the process of loading, validating, and executing data transformations.

Quick Start

Install the SDK using pip:

pip install mapping-suite-sdk

or using poetry:

poetry add mapping-suite-sdk

Loading a Mapping Package

The SDK provides several ways to load mapping packages:

from pathlib import Path
import mapping_suite_sdk as mssdk 

# Load from a local folder
package = mssdk.load_mapping_package_from_folder(
    mapping_package_folder_path=Path("/path/to/mapping/package")
)

# Load from a ZIP archive
package = mssdk.load_mapping_package_from_archive(
    mapping_package_archive_path=Path("/path/to/package.zip")
)

# Load from GitHub
packages = mssdk.load_mapping_packages_from_github(
    github_repository_url="https://github.com/your-org/mapping-repo",
    packages_path_pattern="mappings/package*",
    branch_or_tag_name="main"
)

Serializing a Mapping Package

# Serialize a mapping package to a dictionary
package_dict = mssdk.serialise_mapping_package(mapping_package)

Converting Mapping Packages

The SDK provides a CLI command to convert mapping packages between versions:

Convert Single Package

Convert a single mapping package from one version to another (in-place conversion):

mssdk convert --to-version v3 --from-version v2 \
    from-package /path/to/mapping/package

Convert Multiple Packages from Folder

Convert all mapping packages in a folder (in-place conversion):

mssdk convert --to-version v3 --from-version v2 \
    from-folder /path/to/mappings/folder

The from-folder command will:

  • Iterate through all subdirectories in the specified folder
  • Convert each valid mapping package in-place
  • Skip packages that cannot be converted (e.g., already in target version)
  • Report a summary with counts of successful and failed conversions

Options:

  • --to-version: Target mapping package version (e.g., v3)
  • --from-version: Source mapping package version (e.g., v2)
  • --verbose, -v: Show detailed debug logs

Extractors

The SDK provides flexible extractors for working with mapping packages from different sources.

Archive Package Extractor

Extract mapping packages from ZIP archives:

from pathlib import Path
from mapping_suite_sdk import ArchivePackageExtractor

extractor = ArchivePackageExtractor()

# Extract to a specific location
output_path = extractor.extract(
    source_path=Path("package.zip"),
    destination_path=Path("output_directory")
)

# Extract to a temporary location (automatically cleaned up)
with extractor.extract_temporary(Path("package.zip")) as temp_path:
    # Work with files in temp_path
    pass  # Cleanup is automatic

GitHub Package Extractor

Clone and extract mapping packages directly from GitHub repositories:

from mapping_suite_sdk import GithubPackageExtractor

extractor = GithubPackageExtractor()

# Extract multiple packages matching a pattern
with extractor.extract_temporary(
    repository_url="https://github.com/org/repo",
    packages_path_pattern="mappings/package*",
    branch_or_tag_name="v1.0.0"
) as package_paths:
    for path in package_paths:
        # Process each package
        print(f"Found package at: {path}")

MongoDB Support

The SDK provides seamless integration with MongoDB for storing and retrieving mapping packages.

Setting Up the Repository

from pymongo import MongoClient
from mapping_suite_sdk import MongoDBRepository
from mapping_suite_sdk.models.mapping_package_v2 import MappingPackageABC

# Initialize MongoDB client
mongo_client = MongoClient("mongodb://localhost:27017/")

# Create a repository for mapping packages
repository = MongoDBRepository(
    model_class=MappingPackageABC,
    mongo_client=mongo_client,
    database_name="mapping_suites",
    collection_name="packages"
)

Loading and Storing Packages

from pathlib import Path
from mapping_suite_sdk import load_mapping_package_from_folder, load_mapping_package_from_mongo_db

# Load a package from a folder
package = load_mapping_package_from_folder(
    mapping_package_folder_path=Path("/path/to/package")
)

# Store the package in MongoDB
repository.create(package)

# Retrieve the package by ID
retrieved_package = load_mapping_package_from_mongo_db(
    mapping_package_id=package.id,
    mapping_package_repository=repository
)

# Query multiple packages
packages = repository.read_many({"metadata.version": "1.0.0"})

OpenTelemetry Tracing

The SDK includes built-in support for OpenTelemetry tracing, which helps with performance monitoring and debugging.

Enabling Tracing

from mapping_suite_sdk import set_mssdk_tracing, get_mssdk_tracing

# Enable tracing
set_mssdk_tracing(True)

# Check if tracing is enabled
is_enabled = get_mssdk_tracing()

Adding Custom Span Processors

from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
from mapping_suite_sdk import add_span_processor_to_mssdk_tracer_provider

# Add a console exporter for tracing output
console_exporter = ConsoleSpanExporter()
span_processor = SimpleSpanProcessor(console_exporter)
add_span_processor_to_mssdk_tracer_provider(span_processor)

Using Tracer with OTLP Exporter

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from mapping_suite_sdk import add_span_processor_to_mssdk_tracer_provider, set_mssdk_tracing

# Configure and enable OpenTelemetry with OTLP exporter
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
add_span_processor_to_mssdk_tracer_provider(span_processor)
set_mssdk_tracing(True)

# Now all SDK operations will be traced and sent to your collector

Contributing

Contributions to the Mapping Suite SDK are welcome! Use fork and pull request workflow.

Development Setup

# Clone the repository
git clone https://github.com/meaningfy-ws/mapping-suite-sdk.git
cd mapping-suite-sdk

# Install dependencies
# Use Makefile commands
make install

# Run tests
make test-unit

Dependency Restrictions

  • LinkML 1.9.5 onwards introduces breaking changes in our data
  • Click 8.2 onwards introduces breaking changes in our CLI
  • Pandas 2.1.4 and OpenTelemetry 1.29.0 are required due to a downstream consumer which relies on Airflow 2.10.x

Get in Touch

About

SDK designed to standardize and simplify the handling of packages that contain transformation rules and related artefacts for mapping data to RDF (RDF Mapping Language).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors