diff --git a/docs/package/api-reference/c2pa_interop.md b/docs/package/api-reference/c2pa_interop.md index b5fb82f..d55b7d8 100644 --- a/docs/package/api-reference/c2pa_interop.md +++ b/docs/package/api-reference/c2pa_interop.md @@ -1,230 +1,146 @@ # C2PA Interoperability Module -The `c2pa` module provides utilities for interoperability between EncypherAI's metadata formats and the C2PA (Coalition for Content Provenance and Authenticity) standard. This enables text content to benefit from the same provenance and verification capabilities that C2PA provides for images and videos. +The ``encypher.interop.c2pa`` package groups the helpers that make EncypherAI's +text pipeline interoperable with the Coalition for Content Provenance and +Authenticity (C2PA) specification. These utilities are shared between the core +``UnicodeMetadata`` implementation, integration tests, and third-party +integrations that want to reason about manifests at a lower level. -## Overview +The module focuses on three areas: -C2PA is an open technical standard for providing provenance and verifiability for digital content. While C2PA was initially designed for media files like images and videos, EncypherAI extends these principles to text content through our Unicode variation selector embedding technique. +1. Building and decoding the FEFF-prefixed ``C2PATextManifestWrapper`` that + carries a complete C2PA manifest store inside a Unicode text asset. +2. Normalising text and calculating the SHA-256 content hash with the exact same + procedure during embedding and verification. +3. Converting manifests between EncypherAI's convenience shape and the canonical + C2PA structure when interoperability with external tooling is required. -Our implementation: -- Creates C2PA-compliant manifests for text content -- Embeds these manifests directly into the text using Unicode variation selectors -- Provides verification and tamper detection capabilities -- Maintains compatibility with C2PA concepts and structures +## Text Wrapper Helpers -## Hard Binding Implementation - -Our approach to C2PA for text is classified as a **hard binding** technique: - -- The manifest is embedded directly within the text content itself -- The embedding uses invisible Unicode variation selectors -- The binding is inseparable from the content - -This differs from soft binding approaches where the manifest exists separately from the content with only a reference included in the content. +```python +from encypher.interop.c2pa import encode_wrapper, find_and_decode +``` -## Content Hash Coverage +### ``encode_wrapper(manifest_bytes: bytes) -> str`` -A critical component of our C2PA implementation is the content hash assertion: +- Packs the ``magic | version | manifestLength`` header defined by the + ``C2PATextManifestWrapper`` proposal. +- Converts every byte of the header and manifest store into a Unicode variation + selector (0–15 → ``U+FE00``–``U+FE0F``; 16–255 → ``U+E0100``–``U+E01EF``). +- Prefixes the selector block with ``U+FEFF`` and returns the resulting string so + it can be appended to the visible text content. -- The hash covers the plain text content only (not HTML markup or other formatting) -- SHA-256 is used as the hashing algorithm -- The hash is computed before embedding the metadata -- This creates a cryptographic fingerprint of the original content +### ``find_and_decode(text: str) -> Tuple[Optional[bytes], str, Optional[Tuple[int, int]]]`` -This content hash enables tamper detection - if the text is modified after embedding, the current hash will no longer match the stored hash. +- Scans ``text`` for a ``U+FEFF`` marker followed by a contiguous run of + variation selectors. +- Verifies the ``C2PATXT\0`` magic value, version number, and manifest length + before returning the decoded JUMBF bytes. +- Normalises the remaining text to NFC and returns both the clean string and the + wrapper span (start/end indices) so callers can exclude the wrapper bytes when + recomputing hashes. -## API Reference +These helpers are used internally by ``UnicodeMetadata`` to append the wrapper at +embedding time and to detect tampering during verification. -### `c2pa_like_dict_to_encypher_manifest` +## Normalisation and Hashing Helpers ```python -def c2pa_like_dict_to_encypher_manifest( - c2pa_like_dict: Dict[str, Any] -) -> Dict[str, Any]: +from encypher.interop.c2pa import compute_normalized_hash, normalize_text ``` -Converts a C2PA-like dictionary to EncypherAI's internal manifest format for embedding. +### ``normalize_text(text: str) -> str`` -**Parameters:** -- `c2pa_like_dict`: A dictionary following the C2PA manifest structure +Returns the NFC-normalised form of ``text``. Normalisation occurs before any +byte offsets are calculated to guarantee that exclusion ranges match the C2PA +specification. -**Returns:** -- A dictionary in EncypherAI's internal manifest format ready for embedding +### ``compute_normalized_hash(text: str, exclusions: Sequence[Tuple[int, int]] | None = None, *, algorithm: str = "sha256")`` -### `encypher_manifest_to_c2pa_like_dict` +- Normalises ``text`` to NFC and encodes it as UTF-8 bytes. +- Removes the byte ranges specified by ``exclusions`` (each expressed as + ``(start, length)`` offsets into the normalised byte array). +- Computes a SHA-256 digest of the filtered bytes and returns a + ``NormalizedHashResult`` object containing the normalised text, raw bytes, and + digest. + +Embedding and verification both call this helper so that the hash recorded in +``c2pa.hash.data.v1`` matches the value recomputed by validators. + +## Manifest Conversion Helpers ```python -def encypher_manifest_to_c2pa_like_dict( - encypher_manifest: Dict[str, Any] -) -> Dict[str, Any]: +from encypher.interop.c2pa import ( + c2pa_like_dict_to_encypher_manifest, + encypher_manifest_to_c2pa_like_dict, + get_c2pa_manifest_schema, +) ``` -Converts an EncypherAI internal manifest back to a C2PA-like dictionary structure. - -**Parameters:** -- `encypher_manifest`: A dictionary in EncypherAI's internal manifest format - -**Returns:** -- A dictionary following the C2PA manifest structure - -## C2PA Manifest Structure - -A C2PA-like manifest for text content typically includes: - -```json -{ - "claim_generator": "EncypherAI/2.3.0", - "timestamp": "2025-06-16T15:00:00Z", - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "Article Title", - "author": {"@type": "Person", "name": "Author Name"}, - "publisher": {"@type": "Organization", "name": "Publisher Name"}, - "datePublished": "2025-06-15", - "description": "Article description" - } - }, - { - "label": "stds.c2pa.content.hash", - "data": { - "hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] -} -``` +These functions convert between the convenience dictionaries exposed by the +EncypherAI SDK and schema-compliant C2PA manifests. They are useful when you need +full control over the actions and assertions that will be embedded inside the +text manifest store. -## Example Usage +## End-to-End Example -### Creating and Embedding a C2PA Manifest +The snippet below demonstrates how the helpers combine with +``UnicodeMetadata`` to embed a manifest and verify it later. ```python -import hashlib from datetime import datetime from encypher.core.keys import generate_ed25519_key_pair from encypher.core.unicode_metadata import UnicodeMetadata -from encypher.interop.c2pa import c2pa_like_dict_to_encypher_manifest +from encypher.interop.c2pa import compute_normalized_hash -# Generate keys +# Prepare signer credentials private_key, public_key = generate_ed25519_key_pair() -signer_id = "example-key-001" - -# Article text -article_text = """This is the full article text. -It contains multiple paragraphs. -All of this text will be hashed for the content hash assertion.""" - -# Calculate content hash -content_hash = hashlib.sha256(article_text.encode('utf-8')).hexdigest() - -# Create C2PA manifest -c2pa_manifest = { - "claim_generator": "EncypherAI/2.3.0", - "timestamp": datetime.now().isoformat(), - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "Example Article", - "author": {"@type": "Person", "name": "John Doe"}, - "publisher": {"@type": "Organization", "name": "Example Publisher"}, - "datePublished": "2025-06-15", - "description": "An example article for C2PA demonstration" - } - }, - { - "label": "stds.c2pa.content.hash", - "data": { - "hash": content_hash, - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] -} - -# Convert to EncypherAI format -encypher_manifest = c2pa_like_dict_to_encypher_manifest(c2pa_manifest) - -# Extract first paragraph for embedding -first_paragraph = article_text.split('\n')[0] - -# Embed into first paragraph -embedded_paragraph = UnicodeMetadata.embed_metadata( - text=first_paragraph, +signer_id = "example-signer" + +text = "Breaking news: invisible provenance ships today." + +# Embed a C2PA manifest – the wrapper is appended as FEFF + variation selectors. +embedded = UnicodeMetadata.embed_metadata( + text=text, private_key=private_key, signer_id=signer_id, - metadata_format='cbor_manifest', - claim_generator=encypher_manifest.get("claim_generator"), - actions=encypher_manifest.get("assertions"), - ai_info=encypher_manifest.get("ai_assertion", {}), - custom_claims=encypher_manifest.get("custom_claims", {}), - timestamp=encypher_manifest.get("timestamp") + metadata_format="c2pa", + actions=[{"label": "c2pa.created", "when": datetime.now().isoformat()}], ) +assert embedded.startswith(text) +assert embedded != text # invisible wrapper appended -# Replace first paragraph in article -embedded_article = article_text.replace(first_paragraph, embedded_paragraph) -``` +# Copy/paste operations preserve the wrapper, so validators can recover it. +def resolver(requested_signer_id: str): + return public_key if requested_signer_id == signer_id else None -### Verifying and Extracting a C2PA Manifest - -```python -from encypher.core.unicode_metadata import UnicodeMetadata -from encypher.interop.c2pa import encypher_manifest_to_c2pa_like_dict -import hashlib - -# Define key provider function -def key_provider(kid): - if kid == signer_id: - return public_key - return None - -# Extract first paragraph (which contains the embedded metadata) -first_paragraph = embedded_article.split('\n')[0] - -# Verify and extract metadata -is_verified, extracted_signer_id, extracted_manifest = UnicodeMetadata.verify_and_extract_metadata( - text=first_paragraph, - public_key_provider=key_provider, - return_payload_on_failure=True +verified, recovered_signer, manifest = UnicodeMetadata.verify_metadata( + text=embedded, + public_key_resolver=resolver, ) +assert verified and recovered_signer == signer_id -if is_verified: - # Convert back to C2PA format - c2pa_extracted = encypher_manifest_to_c2pa_like_dict(extracted_manifest) - - # Verify content hash - current_content_hash = hashlib.sha256(article_text.encode('utf-8')).hexdigest() - - # Find content hash assertion - stored_hash = None - for assertion in c2pa_extracted.get("assertions", []): - if assertion.get("label") == "stds.c2pa.content.hash": - stored_hash = assertion["data"]["hash"] - break - - if stored_hash == current_content_hash: - print("Content hash verification successful!") - else: - print("Content hash verification failed - content may have been tampered with.") -else: - print("Signature verification failed!") +# The manifest records a hard-binding hash computed with the shared helper. +content_hash_assertion = next( + assertion + for assertion in manifest["assertions"] + if assertion["label"] == "c2pa.hash.data.v1" +) +exclusions = [ + (item["start"], item["length"]) + for item in content_hash_assertion["data"].get("exclusions", []) +] +hash_result = compute_normalized_hash(embedded, exclusions) +print(hash_result.hexdigest) ``` -## Tamper Detection - -Our C2PA implementation enables two types of tamper detection: - -1. **Content Tampering**: If the text content is modified after embedding, the current hash will no longer match the stored hash in the manifest. +During verification the library: -2. **Metadata Tampering**: If the embedded manifest itself is modified, the digital signature verification will fail. +1. Calls ``find_and_decode`` to locate the FEFF-prefixed wrapper and recover the + JUMBF manifest store. +2. Verifies the COSE ``Sign1`` signature and actions. +3. Uses ``compute_normalized_hash`` with the recorded exclusions to recompute the + hard-binding digest. -These mechanisms ensure the integrity and authenticity of both the content and its provenance information. +Any mismatch in the wrapper structure, manifest signature, or content hash causes +verification to fail, surfacing provenance tampering to downstream consumers. diff --git a/docs/package/api-reference/unicode_metadata.md b/docs/package/api-reference/unicode_metadata.md index aa55a78..1fa0115 100644 --- a/docs/package/api-reference/unicode_metadata.md +++ b/docs/package/api-reference/unicode_metadata.md @@ -21,9 +21,20 @@ Unicode variation selectors (ranges U+FE00-FE0F and U+E0100-E01EF) are special c - These selectors are invisible when rendered in text - The encoded data travels with the content as part of the text itself +### C2PA Text Wrapper + +When `metadata_format="c2pa"`, the module emits a `C2PATextManifestWrapper` as defined by the latest text-embedding +specification. The wrapper: + +- Is prefixed with a single Zero-Width No-Break Space (`U+FEFF`). +- Stores the manifest inside a JUMBF container following the `magic | version | length | payload` layout. +- Appears as a single block of variation selectors appended to the end of the visible text. + +This behaviour is specific to the C2PA format—legacy formats continue to use the legacy targets described below. + ### Embedding Targets -The module supports several embedding targets: +The module supports several embedding targets for legacy formats (`basic`, `manifest`, and `cbor_manifest`): | Target | Description | Use Case | |--------|-------------|----------| @@ -35,23 +46,31 @@ The module supports several embedding targets: | `FILE_END` | Appends variation selectors at the very end of the text | Useful when you prefer not to alter in-text positions | | `FILE_END_ZWNBSP` | Appends a zero-width no-break space (U+FEFF) followed by variation selectors at the end | Improves robustness in some pipelines that trim trailing selectors | +> **Note:** C2PA manifests always use the wrapper block described above and ignore the legacy target configuration. + ### Embedding Approaches -The module supports two embedding approaches: +The module supports two embedding approaches for legacy formats: 1. **Single-Point Embedding** (default): All metadata is embedded after a single target character (typically the first whitespace) 2. **Distributed Embedding**: Metadata is distributed across multiple target characters throughout the text -Single-point embedding is generally recommended as it minimizes the impact on text processing and is easier to manage. +Single-point embedding is generally recommended as it minimizes the impact on text processing and is easier to manage. When +embedding C2PA manifests we always append a single wrapper block instead of inserting selectors inside the visible content. ## Content Hash Coverage -When using the C2PA manifest format, a content hash assertion is included in the manifest: +When using the C2PA manifest format, a content hash assertion (`c2pa.hash.data.v1`) is included in the manifest: + +- The text is normalised to NFC before hashing. +- SHA-256 is used as the hashing algorithm. +- The hash is computed on the UTF-8 bytes of the normalised text **after removing the wrapper span** recorded in the + `exclusions` list. +- Exclusions are expressed as byte offsets `{ "start": , "length": }` so validators can strip the same + region before recomputing the digest. -- The hash covers the plain text content only (not HTML markup or other formatting) -- SHA-256 is used as the hashing algorithm -- The hash is computed before embedding the metadata -- This creates a cryptographic fingerprint of the original content +This creates a cryptographic fingerprint of the original, human-visible content while keeping the wrapper bytes outside the hash +coverage. This content hash enables tamper detection - if the text is modified after embedding, the current hash will no longer match the stored hash. @@ -66,8 +85,7 @@ def embed_metadata( text: str, private_key: PrivateKeyTypes, signer_id: str, - metadata_format: Literal["basic", "manifest", "cbor_manifest", "c2pa"] = "basic", - c2pa_manifest: Optional[Dict[str, Any]] = None, + metadata_format: Literal["basic", "manifest", "cbor_manifest", "c2pa"] = "manifest", model_id: Optional[str] = None, timestamp: Optional[Union[str, datetime, date, int, float]] = None, target: Optional[Union[str, MetadataTarget]] = None, @@ -76,7 +94,9 @@ def embed_metadata( actions: Optional[List[Dict[str, Any]]] = None, ai_info: Optional[Dict[str, Any]] = None, custom_claims: Optional[Dict[str, Any]] = None, + omit_keys: Optional[List[str]] = None, distribute_across_targets: bool = False, + add_hard_binding: bool = True, ) -> str: ``` @@ -90,14 +110,21 @@ Embeds metadata into text using Unicode variation selectors, signing with a priv - `basic`: A simple key-value payload. - `manifest`: A legacy C2PA-like manifest. - `cbor_manifest`: A CBOR-encoded version of the legacy manifest. - - `c2pa`: The C2PA-compliant format using COSE Sign1. -- `c2pa_manifest`: A dictionary representing the full C2PA manifest. Required when `metadata_format` is `c2pa`. + - `c2pa`: The C2PA-compliant format using COSE Sign1. This mode builds the + manifest internally using the provided `claim_generator` and `actions`. - `model_id`: Model identifier (used in 'basic' payload). - `timestamp`: Optional timestamp (datetime, ISO string, int/float epoch). When omitted, the outer payload omits `timestamp`, and C2PA action assertions that normally include `when` will omit that field. - `target`: Where to embed metadata. Options: `"whitespace"`, `"punctuation"`, `"first_letter"`, `"last_letter"`, `"all_characters"`, `"file_end"`, `"file_end_zwnbsp"`. - `file_end` and `file_end_zwnbsp` append the encoded selectors at the end of the text (the latter prefixes a zero-width no-break space U+FEFF before the selectors). + - This parameter is ignored when `metadata_format="c2pa"`; C2PA manifests always append a wrapper block at end-of-file. - `custom_metadata`: Dictionary for custom fields (used in 'basic' payload). -- `claim_generator`, `actions`, `ai_info`, `custom_claims`: Used for legacy 'manifest' formats. +- `claim_generator`: Optional identifier for the software agent. +- `actions`: Optional list of action dictionaries to seed `c2pa.actions.v1`. +- `ai_info`, `custom_claims`: Used for legacy 'manifest' formats. +- `omit_keys`: List of metadata keys to remove from legacy payloads before signing. +- `distribute_across_targets`: If True, distribute bits across multiple targets (legacy formats only). +- `add_hard_binding`: When `metadata_format="c2pa"`, controls whether the + `c2pa.hash.data.v1` assertion is included. - `omit_keys`: List of metadata keys to remove from the payload before signing. - `distribute_across_targets`: If True, distribute bits across multiple targets. @@ -201,7 +228,7 @@ This example demonstrates embedding a full C2PA v2.2 manifest. ```python from encypher.core.unicode_metadata import UnicodeMetadata from encypher.core.keys import generate_ed25519_key_pair -import hashlib +from encypher.interop.c2pa import compute_normalized_hash # 1. Generate keys private_key, public_key = generate_ed25519_key_pair() @@ -209,48 +236,42 @@ signer_id = "example-c2pa-key-001" # 2. Define the text content and create its hash clean_text = "This is the article content that we want to protect." -clean_text_hash = hashlib.sha256(clean_text.encode('utf-8')).hexdigest() - -# 3. Create the C2PA manifest -c2pa_manifest = { - "claim_generator": "EncypherAI/2.3.0", - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "Example Article", - "author": {"@type": "Person", "name": "Jane Doe"} - } - }, - { - "label": "c2pa.hash.data.v1", - "data": { - "hash": clean_text_hash, - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] -} +hash_result = compute_normalized_hash(clean_text) +clean_text_hash = hash_result.hexdigest +print("Baseline NFC hash:", clean_text_hash) + +# 3. Provide optional custom actions (the library adds c2pa.watermarked automatically) +custom_actions = [ + { + "label": "c2pa.created", + "softwareAgent": "EncypherAI/examples", + "when": datetime.now().isoformat(), + } +] # 4. Embed the manifest into the text embedded_text = UnicodeMetadata.embed_metadata( text=clean_text, private_key=private_key, signer_id=signer_id, - metadata_format='c2pa', - c2pa_manifest=c2pa_manifest, + metadata_format="c2pa", + claim_generator="EncypherAI/examples", + actions=custom_actions, ) # 5. Verify the embedded manifest -is_verified, _, payload = UnicodeMetadata.verify_metadata( +is_verified, _, manifest = UnicodeMetadata.verify_metadata( text=embedded_text, public_key_resolver=lambda kid: public_key if kid == signer_id else None ) print(f"C2PA Verification: {is_verified}") + +content_hash_assertion = next( + assertion for assertion in manifest["assertions"] if assertion["label"] == "c2pa.hash.data.v1" +) +print(content_hash_assertion["data"]) +# Validators remove the recorded exclusion span before recomputing the hash. ``` ### Embedding without a timestamp diff --git a/docs/package/changelog.md b/docs/package/changelog.md index 0040cdb..2e0592d 100644 --- a/docs/package/changelog.md +++ b/docs/package/changelog.md @@ -2,6 +2,22 @@ This document provides a chronological list of notable changes for each version of EncypherAI. +## Unreleased + +### Added +- Introduced a shared `text_hashing` helper in `encypher.interop.c2pa` that performs NFC normalisation, exclusion filtering, and SHA-256 hashing so embedding and verification reuse the exact same pipeline. +- Documented the end-of-text `C2PATextManifestWrapper` flow, including the FEFF prefix, contiguous variation selector block, and wrapper exclusion handling mandated by the latest C2PA text embedding proposal. + +### Changed +- Updated the C2PA embedding path to always append a single FEFF-prefixed wrapper containing a JUMBF manifest store encoded with the `magic | version | length | payload` layout. +- Refactored hard-binding exclusion tracking to record `{start, length}` byte ranges derived from the NFC-normalised text and to stabilise those offsets prior to signing. + +### Fixed +- Ensured validators normalise text, remove wrapper exclusions, and recompute the content hash using the shared helper, eliminating discrepancies between embedding and verification. + +### Documentation +- Refreshed C2PA API references, tutorials, and provenance guides to explain the FEFF-prefixed wrapper workflow, normalised hashing routine, and the new helper utilities. + ## 2.8.1 (2025-01-03) ### Fixed diff --git a/docs/package/examples/advanced-usage.md b/docs/package/examples/advanced-usage.md index 2c626fb..90a25cb 100644 --- a/docs/package/examples/advanced-usage.md +++ b/docs/package/examples/advanced-usage.md @@ -19,7 +19,7 @@ from encypher.core.unicode_metadata import UnicodeMetadata from encypher.core.keys import generate_key_pair from cryptography.hazmat.primitives.asymmetric.types import PublicKeyTypes from typing import Optional, Dict, Any, Tuple -import hashlib +from encypher.interop.c2pa import compute_normalized_hash import time import json @@ -54,7 +54,7 @@ class EnhancedMetadataHandler: """Embed metadata with additional content hash.""" # Add content hash if enabled if self.include_hash: - content_hash = hashlib.sha256(text.encode()).hexdigest() + content_hash = compute_normalized_hash(text).hexdigest metadata["content_hash"] = content_hash # Use UnicodeMetadata to perform the embedding @@ -86,7 +86,7 @@ class EnhancedMetadataHandler: original_text_for_hash_check = UnicodeMetadata.extract_original_text(text) # Calculate hash of original text - current_hash = hashlib.sha256(original_text_for_hash_check.encode()).hexdigest() + current_hash = compute_normalized_hash(original_text_for_hash_check).hexdigest # Compare with stored hash hash_verification = current_hash == verified_payload_dict.get("content_hash") diff --git a/docs/package/examples/basic_text_embedding.md b/docs/package/examples/basic_text_embedding.md index b72e36f..640650d 100644 --- a/docs/package/examples/basic_text_embedding.md +++ b/docs/package/examples/basic_text_embedding.md @@ -1,437 +1,124 @@ # Basic Text Embedding Tutorial -This tutorial provides a step-by-step guide for embedding C2PA-compliant provenance metadata into text content using EncypherAI's Unicode variation selector approach. +This tutorial walks through the minimal steps required to embed a C2PA manifest +into plain text with EncypherAI. The workflow follows the latest +``C2PATextManifestWrapper`` proposal: the manifest is encoded as a JUMBF +container, converted to Unicode variation selectors, and appended to the end of +the text after a ``U+FEFF`` marker. The visible content never changes, but copy +and paste operations retain the provenance wrapper. ## Prerequisites Before starting, ensure you have: -- EncypherAI Python package installed (`uv add encypher-ai`) -- Basic understanding of Python programming -- Familiarity with content provenance concepts - -## Step 1: Set Up Your Environment - -First, import the necessary modules: +- The EncypherAI Python package installed (``uv add encypher-ai``) +- An understanding of Python basics and Ed25519 keys ```python -from encypher.core.unicode_metadata import UnicodeMetadata -from encypher.core.keys import generate_ed25519_key_pair, load_ed25519_key_pair -from encypher.interop.c2pa import c2pa_like_dict_to_encypher_manifest -import hashlib -import json from datetime import datetime -import os -``` - -### Embedding without a timestamp (optional) - -If you prefer not to include a timestamp, remove it from the manifest and do not pass it to `embed_metadata()`: - -```python -# Create a C2PA manifest WITHOUT a timestamp -c2pa_manifest_no_time = { - "claim_generator": "EncypherAI/2.3.0", - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "Sample Article Title", - "author": {"@type": "Person", "name": "John Doe"}, - "publisher": {"@type": "Organization", "name": "Example Publisher"}, - "description": "A sample article demonstrating text embedding without a timestamp" - } - } - ] -} - -# Convert to EncypherAI format -encypher_manifest_no_time = c2pa_like_dict_to_encypher_manifest(c2pa_manifest_no_time) - -# Embed without passing a timestamp (it will be omitted in the payload) -embedded_paragraph_no_time = UnicodeMetadata.embed_metadata( - text=first_paragraph, - private_key=private_key, - signer_id=signer_id, - metadata_format='cbor_manifest', - claim_generator=encypher_manifest_no_time.get("claim_generator"), - actions=encypher_manifest_no_time.get("assertions") -) +from encypher.core.keys import generate_ed25519_key_pair +from encypher.core.unicode_metadata import UnicodeMetadata +from encypher.interop.c2pa import compute_normalized_hash ``` -## Step 2: Generate or Load Keys - -You'll need a key pair for signing the metadata. You can either generate a new pair or load an existing one: +## Step 1: Generate a Signing Key ```python -# Option 1: Generate a new key pair private_key, public_key = generate_ed25519_key_pair() signer_id = "example-key-001" - -# Save keys for future use -keys_dict = { - "private_key": private_key.hex(), - "public_key": public_key.hex(), - "signer_id": signer_id -} - -with open("keys.json", "w") as f: - json.dump(keys_dict, f) - -# Option 2: Load existing keys -if os.path.exists("keys.json"): - with open("keys.json", "r") as f: - keys_dict = json.load(f) - - private_key = bytes.fromhex(keys_dict["private_key"]) - public_key = bytes.fromhex(keys_dict["public_key"]) - signer_id = keys_dict["signer_id"] ``` -## Step 3: Prepare Your Text Content +The private key signs the manifest. The public key is used later to verify the +embedded metadata. -Define the text content you want to embed metadata into: +## Step 2: Prepare the Text Content ```python -# Full article text -article_text = """# Sample Article Title - -This is the first paragraph of the sample article. This paragraph will contain -the embedded metadata using Unicode variation selectors. - -This is the second paragraph with additional content. The content hash will -cover all paragraphs in the article, ensuring the integrity of the entire text. - -## Subsection - -This is a subsection with more content. The embedding process will not affect -the visual appearance of the text, even though the metadata is embedded directly -within it. -""" - -# For demonstration, we'll embed metadata into the first paragraph -first_paragraph = article_text.split("\n\n")[1] -``` - -## Step 4: Calculate Content Hash - -Calculate a hash of the full content to enable tamper detection: - -```python -# Calculate content hash (using plain text without formatting) -plain_text = "\n".join([line.strip() for line in article_text.split("\n")]) -content_hash = hashlib.sha256(plain_text.encode("utf-8")).hexdigest() -``` - -## Step 5: Create C2PA Manifest - -Create a C2PA-compliant manifest with relevant assertions: - -```python -# Current timestamp -timestamp = datetime.now().isoformat() - -# Create C2PA manifest -c2pa_manifest = { - "claim_generator": "EncypherAI/2.3.0", - "timestamp": timestamp, - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "Sample Article Title", - "author": {"@type": "Person", "name": "John Doe"}, - "publisher": {"@type": "Organization", "name": "Example Publisher"}, - "datePublished": timestamp.split("T")[0], - "description": "A sample article demonstrating text embedding" - } - }, - { - "label": "stds.c2pa.content.hash", - "data": { - "hash": content_hash, - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] -} - -# Convert to EncypherAI format -encypher_manifest = c2pa_like_dict_to_encypher_manifest(c2pa_manifest) +text = "EncypherAI now emits FEFF-prefixed text manifests." ``` -> Note: Timestamp optional -> -> - The `timestamp` is optional across all metadata formats, including C2PA. -> - When omitted, C2PA action assertions (e.g., `c2pa.created`, `c2pa.watermarked`) will simply omit their `when` fields. -> - You can provide a timestamp (recommended) or skip it depending on your needs. - -## Step 6: Embed Metadata into Text - -Embed the manifest into the first paragraph of your text: +## Step 3: Embed the C2PA Manifest ```python -# Embed metadata into the first paragraph -embedded_paragraph = UnicodeMetadata.embed_metadata( - text=first_paragraph, +embedded_text = UnicodeMetadata.embed_metadata( + text=text, private_key=private_key, signer_id=signer_id, - metadata_format='cbor_manifest', - claim_generator=encypher_manifest.get("claim_generator"), - actions=encypher_manifest.get("assertions"), - timestamp=encypher_manifest.get("timestamp") + metadata_format="c2pa", + add_hard_binding=True, + actions=[{"label": "c2pa.created", "when": datetime.now().isoformat()}], ) - -# Replace the first paragraph in the article -embedded_article = article_text.replace(first_paragraph, embedded_paragraph) - -# Save the embedded article -with open("embedded_article.txt", "w", encoding="utf-8") as f: - f.write(embedded_article) ``` -## Step 7: Verify and Extract Metadata - -To verify the embedded metadata and check for tampering: +What happens under the hood: -```python -# Define a key provider function -def key_provider(kid): - if kid == signer_id: - return public_key - return None +1. ``UnicodeMetadata`` constructs a C2PA manifest with ``c2pa.actions.v1``, + ``c2pa.soft_binding.v1``, and ``c2pa.hash.data.v1`` assertions. +2. The manifest is serialised to CBOR, wrapped in a COSE ``Sign1`` signature, and + packaged as a compact JUMBF container. +3. ``encode_wrapper`` converts the header + manifest bytes into variation selectors + and returns ``"\ufeff" + ``. +4. The wrapper is appended to the original text so the output renders identically + to the input. -# Extract the first paragraph (which contains the embedded metadata) -embedded_first_paragraph = embedded_article.split("\n\n")[1] - -# Verify and extract metadata -is_verified, extracted_signer_id, extracted_manifest = UnicodeMetadata.verify_and_extract_metadata( - text=embedded_first_paragraph, - public_key_provider=key_provider -) - -if is_verified: - print(f"✓ Signature verification successful!") - print(f"✓ Signer ID: {extracted_signer_id}") - - # Check for content tampering - current_content_hash = hashlib.sha256(plain_text.encode("utf-8")).hexdigest() - - # Find content hash assertion - stored_hash = None - for assertion in extracted_manifest.get("assertions", []): - if assertion.get("label") == "stds.c2pa.content.hash": - stored_hash = assertion["data"]["hash"] - break - - if stored_hash == current_content_hash: - print("✓ Content hash verification successful!") - else: - print("✗ Content hash verification failed - content may have been tampered with.") - print(f" Stored hash: {stored_hash}") - print(f" Current hash: {current_content_hash}") -else: - print("✗ Signature verification failed!") -``` - -## Step 8: Simulate Tampering (Optional) - -To demonstrate tamper detection, you can modify the content and verify again: +You can confirm that the wrapper exists without being visible: ```python -# Simulate content tampering -tampered_article = embedded_article.replace("sample article", "modified article") - -# Calculate new content hash -tampered_plain_text = "\n".join([line.strip() for line in tampered_article.split("\n")]) -tampered_content_hash = hashlib.sha256(tampered_plain_text.encode("utf-8")).hexdigest() - -# Extract the first paragraph (which contains the embedded metadata) -tampered_first_paragraph = tampered_article.split("\n\n")[1] - -# Verify and extract metadata -is_verified, extracted_signer_id, extracted_manifest = UnicodeMetadata.verify_and_extract_metadata( - text=tampered_first_paragraph, - public_key_provider=key_provider -) - -if is_verified: - print("\nTampered Content Test:") - print(f"✓ Signature verification successful!") - - # Find content hash assertion - stored_hash = None - for assertion in extracted_manifest.get("assertions", []): - if assertion.get("label") == "stds.c2pa.content.hash": - stored_hash = assertion["data"]["hash"] - break - - if stored_hash == tampered_content_hash: - print("✓ Content hash verification successful (unexpected!)") - else: - print("✓ Content hash verification failed - tampering detected!") - print(f" Stored hash: {stored_hash}") - print(f" Current hash: {tampered_content_hash}") -else: - print("\nTampered Content Test:") - print("✗ Signature verification failed!") +assert embedded_text.startswith(text) +assert embedded_text[len(text)] == "\ufeff" +print(len(embedded_text) - len(text), "additional code points appended") ``` -## Complete Example +## Step 4: Inspect the Hard-Binding Hash -Here's the complete code for the basic text embedding workflow: +The ``c2pa.hash.data.v1`` assertion records an SHA-256 digest of the NFC +normalised text with the wrapper bytes excluded. You can reproduce it with the +shared helper: ```python -from encypher.core.unicode_metadata import UnicodeMetadata -from encypher.core.keys import generate_ed25519_key_pair -from encypher.interop.c2pa import c2pa_like_dict_to_encypher_manifest -import hashlib -from datetime import datetime - -# 1. Generate keys -private_key, public_key = generate_ed25519_key_pair() -signer_id = "example-key-001" - -# 2. Prepare article text -article_text = """# Sample Article Title - -This is the first paragraph of the sample article. This paragraph will contain -the embedded metadata using Unicode variation selectors. +from encypher.interop.c2pa import find_and_decode, normalize_text -This is the second paragraph with additional content. The content hash will -cover all paragraphs in the article, ensuring the integrity of the entire text. +manifest_bytes, _, span = find_and_decode(embedded_text) +assert manifest_bytes and span is not None -## Subsection +wrapper_segment = embedded_text[span[0] : span[1]] +normalized_full = normalize_text(embedded_text) +normalized_index = normalized_full.rfind(wrapper_segment) +assert normalized_index >= 0 -This is a subsection with more content. The embedding process will not affect -the visual appearance of the text, even though the metadata is embedded directly -within it. -""" +# Remove the wrapper span from the normalised text when hashing +exclusion_start = len(normalized_full[:normalized_index].encode("utf-8")) +exclusion_length = len(wrapper_segment.encode("utf-8")) -# Extract first paragraph for embedding -first_paragraph = article_text.split("\n\n")[1] - -# 3. Calculate content hash -plain_text = "\n".join([line.strip() for line in article_text.split("\n")]) -content_hash = hashlib.sha256(plain_text.encode("utf-8")).hexdigest() - -# 4. Create C2PA manifest -c2pa_manifest = { - "claim_generator": "EncypherAI/2.3.0", - "timestamp": datetime.now().isoformat(), - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "Sample Article Title", - "author": {"@type": "Person", "name": "John Doe"}, - "publisher": {"@type": "Organization", "name": "Example Publisher"}, - "datePublished": datetime.now().date().isoformat(), - "description": "A sample article demonstrating text embedding" - } - }, - { - "label": "stds.c2pa.content.hash", - "data": { - "hash": content_hash, - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] -} - -# 5. Convert to EncypherAI format -encypher_manifest = c2pa_like_dict_to_encypher_manifest(c2pa_manifest) - -# 6. Embed into first paragraph -embedded_paragraph = UnicodeMetadata.embed_metadata( - text=first_paragraph, - private_key=private_key, - signer_id=signer_id, - metadata_format='cbor_manifest', - claim_generator=encypher_manifest.get("claim_generator"), - actions=encypher_manifest.get("assertions"), - timestamp=encypher_manifest.get("timestamp") -) - -# 7. Replace first paragraph in article -embedded_article = article_text.replace(first_paragraph, embedded_paragraph) - -# 8. Save the embedded article -with open("embedded_article.txt", "w", encoding="utf-8") as f: - f.write(embedded_article) - -# 9. Define key provider function -def key_provider(kid): - if kid == signer_id: - return public_key - return None - -# 10. Extract first paragraph (which contains the embedded metadata) -embedded_first_paragraph = embedded_article.split("\n\n")[1] - -# 11. Verify and extract metadata -is_verified, extracted_signer_id, extracted_manifest = UnicodeMetadata.verify_and_extract_metadata( - text=embedded_first_paragraph, - public_key_provider=key_provider +hash_result = compute_normalized_hash( + embedded_text, + exclusions=[(exclusion_start, exclusion_length)], ) - -# 12. Check verification results -if is_verified: - print(f"✓ Signature verification successful!") - print(f"✓ Signer ID: {extracted_signer_id}") - - # Check for content tampering - current_content_hash = hashlib.sha256(plain_text.encode("utf-8")).hexdigest() - - # Find content hash assertion - stored_hash = None - for assertion in extracted_manifest.get("assertions", []): - if assertion.get("label") == "stds.c2pa.content.hash": - stored_hash = assertion["data"]["hash"] - break - - if stored_hash == current_content_hash: - print("✓ Content hash verification successful!") - else: - print("✗ Content hash verification failed - content may have been tampered with.") -else: - print("✗ Signature verification failed!") +print("NFC hash:", hash_result.hexdigest) ``` -## Output Example +## Step 5: Verify the Embedded Manifest -When running the verification code on an untampered article, you should see: +```python +def resolver(requested_signer_id: str): + return public_key if requested_signer_id == signer_id else None +verified, recovered_signer, manifest = UnicodeMetadata.verify_metadata( + text=embedded_text, + public_key_resolver=resolver, +) +assert verified and recovered_signer == signer_id ``` -✓ Signature verification successful! -✓ Signer ID: example-key-001 -✓ Content hash verification successful! -``` - -If the content has been tampered with, you'll see: -``` -✓ Signature verification successful! -✓ Signer ID: example-key-001 -✗ Content hash verification failed - content may have been tampered with. - Stored hash: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 - Current hash: a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0u1v2w3x4y5z6a7b8c9d0 -``` +Verification automatically: -## Next Steps +1. Locates the FEFF-prefixed wrapper via ``find_and_decode``. +2. Validates the COSE ``Sign1`` signature and actions. +3. Recomputes the hard-binding hash with ``compute_normalized_hash`` and the + recorded exclusion offsets. -Now that you've learned the basics of text embedding with EncypherAI, you can: +## Step 6: Share the Provenance-Enabled Text -1. Integrate this workflow into your content management system -2. Explore more complex C2PA assertions for richer provenance information -3. Implement a user-friendly verification interface -4. Check out the advanced C2PA text demo for HTML integration examples +The string in ``embedded_text`` can be copied into documents, chat systems, or +CMS platforms. The invisible wrapper survives round-trips, allowing downstream +consumers to authenticate the text without requiring sidecar files. diff --git a/docs/package/examples/c2pa_text_demo.md b/docs/package/examples/c2pa_text_demo.md index b387967..4fe9b23 100644 --- a/docs/package/examples/c2pa_text_demo.md +++ b/docs/package/examples/c2pa_text_demo.md @@ -1,12 +1,9 @@ -Supported targets: - -- `whitespace` — embed after whitespace characters (default) -- `punctuation` — embed after punctuation marks -- `first_letter` — embed after the first letter of words -- `last_letter` — embed after the last letter of words -- `all_characters` — embed after any character -- `file_end` — append variation selectors at the very end of the text -- `file_end_zwnbsp` — append a zero-width no-break space (U+FEFF) followed by the variation selectors at the end; improves robustness in some pipelines that trim trailing selectors +> **Note:** Legacy embedding targets (``whitespace``, ``punctuation`` and similar) +> remain available for the ``basic``, ``manifest``, and ``cbor_manifest`` formats. +> When ``metadata_format="c2pa"`` the library automatically appends a +> FEFF-prefixed ``C2PATextManifestWrapper`` to the end of the visible text so the +> manifest stays contiguous, satisfying the latest specification. + # Advanced C2PA Text Demo This guide provides a comprehensive walkthrough of the C2PA text demo located in `demos/c2pa_demo/`. The demo showcases how to embed C2PA manifests into HTML articles and implement a verification UI with visual indicators. @@ -54,35 +51,39 @@ Before running the demo: ### Embedding Process -The embedding process uses BeautifulSoup for robust HTML parsing and embeds the C2PA manifest into the first paragraph of the article: +The embedding process uses BeautifulSoup for robust HTML parsing and appends the C2PA wrapper to the first paragraph of the article: ```python # From embed_manifest_improved.py from bs4 import BeautifulSoup from encypher.core.unicode_metadata import UnicodeMetadata -from encypher.interop.c2pa import c2pa_like_dict_to_encypher_manifest +from encypher.interop.c2pa import compute_normalized_hash, normalize_text # Parse HTML -soup = BeautifulSoup(html_content, 'html.parser') +soup = BeautifulSoup(html_content, "html.parser") # Find first paragraph within content column -first_p = soup.select_one('.content-column p') - -# Create C2PA manifest -c2pa_manifest = create_c2pa_manifest(article_text) - -# Convert to EncypherAI format -encypher_manifest = c2pa_like_dict_to_encypher_manifest(c2pa_manifest) +first_p = soup.select_one(".content-column p") + +# Prepare optional custom actions for the c2pa.actions.v1 assertion +custom_actions = [ + { + "label": "c2pa.created", + "when": datetime.now().isoformat(), + "softwareAgent": "encypher-ai/demo", + } +] -# Embed metadata into paragraph text +# Embed metadata into paragraph text (wrapper appended to the end) +paragraph_text = normalize_text(first_p.get_text()) embedded_text = UnicodeMetadata.embed_metadata( - text=first_p.get_text(), + text=paragraph_text, private_key=private_key, signer_id=signer_id, - metadata_format='cbor_manifest', - claim_generator=encypher_manifest.get("claim_generator"), - actions=encypher_manifest.get("assertions"), - timestamp=encypher_manifest.get("timestamp") + metadata_format="c2pa", + claim_generator="encypher-ai/demo", + actions=custom_actions, + add_hard_binding=True, ) # Replace paragraph text with embedded text @@ -91,44 +92,19 @@ first_p.string = embedded_text ### Content Hash Calculation -The content hash covers the plain text content of the article: +``UnicodeMetadata`` normalises the paragraph, appends the wrapper, and records the +wrapper's byte span in the ``c2pa.hash.data.v1`` assertion automatically. If you +need to pre-compute the digest for auditing or logging, use the shared helper: ```python -# From embed_manifest_improved.py -def create_c2pa_manifest(article_text): - # Calculate content hash - content_hash = hashlib.sha256(article_text.encode('utf-8')).hexdigest() - - # Create C2PA manifest with content hash assertion - c2pa_manifest = { - "claim_generator": "EncypherAI/2.3.0", - "timestamp": datetime.now().isoformat(), - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "The Future of AI", - "author": {"@type": "Person", "name": "Dr. Jane Smith"}, - "publisher": {"@type": "Organization", "name": "Tech Insights"}, - "datePublished": "2025-06-15" - } - }, - { - "label": "stds.c2pa.content.hash", - "data": { - "hash": content_hash, - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] - } - - return c2pa_manifest +paragraph_hash = compute_normalized_hash(paragraph_text) +print("Pre-embed NFC hash:", paragraph_hash.hexdigest) ``` +After embedding you can reproduce the recorded digest by removing the wrapper +bytes from the normalised text and re-running ``compute_normalized_hash`` (see +the basic tutorial for a standalone example). + ### UI Integration The demo includes a Streamlit dashboard that displays the article with verification indicators: @@ -224,7 +200,7 @@ def test_content_tampering(): article_text = '\n'.join([p.get_text() for p in paragraphs]) print("1. Original content hash calculation:") - original_hash = hashlib.sha256(article_text.encode('utf-8')).hexdigest() + original_hash = compute_normalized_hash(article_text).hexdigest print(f" Original hash: {original_hash[:10]}...{original_hash[-10:]}") # Tamper with content (change second paragraph) @@ -242,7 +218,7 @@ def test_content_tampering(): # Calculate new hash tampered_paragraphs = soup.find_all('p') tampered_article_text = '\n'.join([p.get_text() for p in tampered_paragraphs]) - tampered_hash = hashlib.sha256(tampered_article_text.encode('utf-8')).hexdigest() + tampered_hash = compute_normalized_hash(tampered_article_text).hexdigest print("\n3. New content hash after tampering:") print(f" Tampered hash: {tampered_hash[:10]}...{tampered_hash[-10:]}") @@ -346,27 +322,27 @@ The original `article.html` is a simple HTML article with a title, author, and m ### Step 2: Embed the C2PA Manifest -Running `embed_manifest_improved.py` embeds a C2PA manifest into the first paragraph of the article: +Running `embed_manifest_improved.py` appends a C2PA text wrapper to the article: -1. The script extracts the text content of the article -2. Calculates a SHA-256 hash of the content -3. Creates a C2PA manifest with the content hash and metadata -4. Embeds the manifest into the first paragraph using Unicode variation selectors -5. Saves the result as `encoded_article.html` +1. The script extracts and normalises the text content of the article. +2. Calculates a SHA-256 hash of the normalised content. +3. Creates a C2PA manifest with the content hash and metadata (including the wrapper exclusion offsets). +4. Packages the manifest store into a JUMBF box and encodes it as a FEFF-prefixed block of variation selectors. +5. Appends the wrapper block to the end of the visible text and saves the result as `encoded_article.html`. ### Step 3: View the Encoded Article -The encoded article looks visually identical to the original, but the first paragraph now contains invisible Unicode variation selectors that encode the C2PA manifest. +The encoded article looks visually identical to the original, but it now terminates with an invisible block of Unicode variation selectors that contains the C2PA manifest. ### Step 4: Verify the Encoded Article Running `temp_verify.py` verifies the embedded metadata and content hash: -1. Extracts the embedded metadata from the first paragraph -2. Verifies the digital signature using the public key -3. Extracts the stored content hash from the manifest -4. Calculates the current content hash -5. Compares the stored and current hashes +1. Locates the FEFF-prefixed wrapper at the end of the text and decodes the JUMBF manifest store. +2. Verifies the COSE signature using the public key. +3. Reads the recorded exclusion offsets and removes the wrapper bytes from the normalised text. +4. Calculates the current content hash. +5. Compares the stored and current hashes. If the verification is successful, you'll see: @@ -398,57 +374,43 @@ The Streamlit dashboard displays: ## Customization Options -### Embedding Target +### Embedding Location -You can customize where the metadata is embedded by modifying the target parameter: +C2PA manifests always append a FEFF-prefixed wrapper to the end of the text. +The ``target`` parameter is ignored for ``metadata_format="c2pa"`` because the +specification requires the wrapper to remain contiguous. + +### Manifest Content + +Customise the manifest by adjusting the ``claim_generator`` string, providing +pre-existing action entries, or toggling ``add_hard_binding``. The library +constructs the remaining assertions automatically. ```python +custom_actions = [ + { + "label": "c2pa.created", + "softwareAgent": "YourApp/1.0.0", + "when": datetime.now().isoformat(), + }, + { + "label": "c2pa.captured", + "softwareAgent": "YourApp/1.0.0", + "description": "Article prepared with EncypherAI", + }, +] + embedded_text = UnicodeMetadata.embed_metadata( text=first_p.get_text(), private_key=private_key, signer_id=signer_id, - metadata_format='cbor_manifest', - target="whitespace", # Options: "whitespace", "punctuation", "first_letter", "last_letter", - # "all_characters", "file_end", "file_end_zwnbsp" - # Other parameters... + metadata_format="c2pa", + claim_generator="YourApp/1.0.0", + actions=custom_actions, + add_hard_binding=True, ) ``` -### Manifest Content - -You can customize the C2PA manifest by modifying the `create_c2pa_manifest` function: - -```python -def create_c2pa_manifest(article_text): - # Calculate content hash - content_hash = hashlib.sha256(article_text.encode('utf-8')).hexdigest() - - # Create C2PA manifest with custom assertions - c2pa_manifest = { - "claim_generator": "YourApp/1.0.0", - "timestamp": datetime.now().isoformat(), - "assertions": [ - # Add your custom assertions here - { - "label": "stds.schema-org.CreativeWork", - "data": { - # Your metadata here - } - }, - { - "label": "stds.c2pa.content.hash", - "data": { - "hash": content_hash, - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] - } - - return c2pa_manifest -``` - ### UI Customization You can customize the verification UI by modifying the CSS and HTML in `demo_dashboard.py`: diff --git a/docs/package/examples/jupyter.md b/docs/package/examples/jupyter.md index 72a95b9..cb66db8 100644 --- a/docs/package/examples/jupyter.md +++ b/docs/package/examples/jupyter.md @@ -441,7 +441,7 @@ from cryptography.hazmat.primitives.asymmetric.types import PublicKeyTypes from typing import Optional, Dict import time import json -import hashlib +from encypher.interop.c2pa import compute_normalized_hash # Create a custom metadata encoder class class CustomMetadataEncoder: @@ -458,7 +458,7 @@ class CustomMetadataEncoder: metadata[f"{self.custom_prefix}_timestamp"] = int(time.time()) # Add a hash of the text - text_hash = hashlib.sha256(text.encode()).hexdigest() + text_hash = compute_normalized_hash(text).hexdigest metadata[f"{self.custom_prefix}_text_hash"] = text_hash # Use the parent class to encode diff --git a/docs/package/technical/content-hash-and-embedding.md b/docs/package/technical/content-hash-and-embedding.md index de607c5..f51dcea 100644 --- a/docs/package/technical/content-hash-and-embedding.md +++ b/docs/package/technical/content-hash-and-embedding.md @@ -1,163 +1,125 @@ # Content Hash Coverage and Embedding Technical Details -This document provides a detailed technical explanation of how EncypherAI's C2PA text embedding approach works, specifically focusing on content hash coverage and the embedding mechanism. +This document explains how EncypherAI embeds Coalition for Content Provenance and Authenticity (C2PA) manifests inside plain +text while respecting the updated `C2PATextManifestWrapper` specification. It focuses on two critical implementation details: -## What the Content Hash Covers +1. **How we compute and record the hard-binding content hash** +2. **How the manifest store is wrapped, encoded as Unicode variation selectors, and appended to the text** -The content hash in our implementation covers the plain text content of the article - specifically: +The goal is to make the manifest portable with the text itself—copy and paste operations keep the provenance intact—while +remaining fully compatible with the C2PA validation model. -### Text Extraction Process +## Content Hash Normalisation and Exclusions -1. The code extracts all paragraph text from the article - - It looks for paragraphs in content columns first, then falls back to direct paragraph search - - All paragraph texts are joined with double newlines (`"\n\n"`) - - This extracted plain text is saved to `clean_text_for_hashing.txt` as a reference +C2PA requires producers and consumers to normalise text to Unicode Normalisation Form C (NFC) before hashing. To guarantee that +every code path performs the exact same sequence of operations we rely on the shared helper `compute_normalized_hash()` from +`encypher.interop.c2pa.text_hashing`. -### Hash Generation +Our embedding pipeline follows these rules: -1. A SHA-256 hash is calculated on this extracted text -2. The hash is computed on the UTF-8 encoded version of the text -3. This happens before any metadata embedding occurs +1. **Normalise and hash via the helper**: `compute_normalized_hash(original_text)` returns the NFC-normalised string, + its UTF-8 bytes, and the SHA-256 digest used for the `c2pa.hash.data.v1` assertion. +2. **Append the wrapper** (described later) to the end of the original text. The wrapper occupies a contiguous range of bytes + that do not belong to the visible content. +3. **Record exclusion offsets** for the wrapper. Offsets are expressed as byte positions within the NFC-normalised text, using + the structure `{"start": , "length": }`. This exclusion list is stored in the `c2pa.hash.data.v1` + assertion so that validators know which bytes to ignore before hashing. -### Hash Usage +During verification we repeat the same procedure: -1. The hash is included in the manifest as a `stds.c2pa.content.hash` assertion -2. This assertion includes both the hash value and the algorithm used (sha256) +- Detect the wrapper span and pass the full text plus the exclusion tuple to `compute_normalized_hash()`. +- Apply the exclusion offsets from the manifest and ensure they match the detected wrapper span. +- Compare the calculated hash against the manifest assertion. Any mismatch triggers tamper detection. -### Important Distinction +This guarantees that copy/paste operations (which keep the wrapper) and validators (which must remove it before hashing) remain +synchronised. -- The hash covers only the plain text content, not the HTML markup -- The hash does not include the embedded metadata itself (the Unicode variation selectors) -- This creates a "snapshot" of the original content at the time of signing +## `C2PATextManifestWrapper` Layout -This approach allows for tamper detection - if the text content is modified after embedding, the hash of the current content will no longer match the hash stored in the embedded manifest. +All manifests embedded in unstructured text conform to the binary layout mandated by the specification: -## How Our C2PA-like Embedding Actually Works +```text +aligned(8) class C2PATextManifestWrapper { + unsigned int(64) magic = 0x4332504154585400; // "C2PATXT\0" + unsigned int(8) version = 1; + unsigned int(32) manifestLength; + unsigned int(8) jumbfContainer[manifestLength]; +} +``` -### Single-Point Embedding with Zero-Width Characters +Key points: -1. The metadata (manifest) is embedded as a sequence of Unicode variation selectors -2. These are zero-width, non-printing characters (code points in ranges U+FE00-FE0F and U+E0100-E01EF) -3. All metadata is attached to a single character in the text (by default, the first whitespace) -4. The original character is preserved, and the variation selectors are inserted immediately after it +- The wrapper is **prefixed with a single U+FEFF** (Zero-Width No-Break Space). This marker makes it easy for validators to + locate the wrapper even if other variation selectors appear in the text for unrelated reasons. +- `manifestLength` records the size of the embedded C2PA manifest store. +- `jumbfContainer` carries the manifest store encoded as a JUMBF box. We serialise the store with canonical JSON ordering to + obtain deterministic bytes before signing. -### This is Still Hard Binding Because +## Variation Selector Encoding -- The manifest is directly embedded within the content itself -- The manifest travels with the content as part of the same file -- The binding is inseparable from the content +Every byte of the header and manifest store is converted to an invisible Unicode variation selector so that the wrapper travels +with the text: -### Not a Hybrid Approach Because +- Bytes 0–15 map to `U+FE00`–`U+FE0F`. +- Bytes 16–255 map to `U+E0100`–`U+E01EF`. -In a true hybrid approach, you would have: -1. A manifest stored separately from the content (soft binding component) -2. A small reference embedded in the content pointing to the external manifest (hard binding component) +Decoding performs the inverse mapping and rejects any code points outside these ranges, ensuring corrupted wrappers are detected. -Our implementation embeds the entire manifest directly in the content. The content hash we include is just an assertion within the hard-bound manifest. +## Embedding Workflow -## Implementation Details +The high-level embedding steps executed by `UnicodeMetadata._embed_c2pa` are: -### Embedding Process +1. **Build the manifest**: Construct the C2PA manifest with mandatory assertions (actions, optional AI metadata, etc.). If a + hard binding is requested we insert a `c2pa.hash.data.v1` assertion whose `exclusions` list initially matches the last + computed offsets. +2. **Sign the manifest**: Serialise the manifest to CBOR, produce a COSE `Sign1` structure with the Ed25519 private key, and + package the result inside a minimal JUMBF box. +3. **Encode the wrapper**: Pack the `magic`, `version`, and `manifestLength` fields with the JUMBF bytes, convert them to + variation selectors, and prefix the block with U+FEFF. +4. **Append the block**: Place the wrapper after the visible text as a single contiguous run. The plain text itself is not + otherwise modified; the wrapper is the only addition. +5. **Stabilise exclusion offsets**: Because the wrapper length depends on the manifest, we recompute the exclusion list until it + stabilises (usually immediately). The final manifest is re-signed once the offsets are correct. -```python -from encypher.core.payloads import serialize_jumbf_payload, deserialize_jumbf_payload - -def embed_metadata(text, metadata, metadata_format="json"): - """ - Embeds metadata into text using Unicode variation selectors. - - Args: - text (str): The text to embed metadata into - metadata (dict or bytes): The metadata to embed - metadata_format (str): Format of the metadata - ("json", "cbor_manifest", or "jumbf") - - Returns: - str: Text with embedded metadata - """ - # Serialize metadata based on format - if metadata_format == "json": - serialized = json.dumps(metadata).encode("utf-8") - elif metadata_format == "cbor_manifest": - if isinstance(metadata, dict): - serialized = cbor2.dumps(metadata) - else: - serialized = metadata # Already serialized - elif metadata_format == "jumbf": - serialized = serialize_jumbf_payload(metadata) - else: - raise ValueError(f"Unsupported metadata format: {metadata_format}") - - # Convert to binary and encode using variation selectors - binary_data = base64.b64encode(serialized).decode("ascii") - encoded_metadata = _encode_to_variation_selectors(binary_data) - - # Find position to insert (typically after first character) - if len(text) > 0: - return text[0] + encoded_metadata + text[1:] - else: - return encoded_metadata -``` +The resulting string looks like `visible_text + "\uFEFF" + `. When rendered, the wrapper is invisible but it +remains part of the Unicode stream. -### Extraction Process +### Example ```python -def extract_metadata(text, metadata_format="json"): - """ - Extracts metadata from text with embedded Unicode variation selectors. - - Args: - text (str): Text with embedded metadata - metadata_format (str): Format of the metadata ("json", "cbor_manifest", or "jumbf") - - Returns: - dict or bytes: Extracted metadata - """ - # Extract variation selectors - encoded_data = "" - for char in text: - if 0xFE00 <= ord(char) <= 0xFE0F or 0xE0100 <= ord(char) <= 0xE01EF: - encoded_data += char - - if not encoded_data: - return None - - # Decode from variation selectors to binary - binary_data = _decode_from_variation_selectors(encoded_data) - serialized = base64.b64decode(binary_data) - - # Deserialize based on format - if metadata_format == "json": - return json.loads(serialized.decode("utf-8")) - elif metadata_format == "cbor_manifest": - return cbor2.loads(serialized) - elif metadata_format == "jumbf": - return deserialize_jumbf_payload(serialized) - else: - raise ValueError(f"Unsupported metadata format: {metadata_format}") +from encypher.core.unicode_metadata import UnicodeMetadata +from encypher.core.keys import generate_ed25519_key_pair + +text = "Provenance-enabled article" +private_key, _ = generate_ed25519_key_pair() +wrapper_ready_text = UnicodeMetadata.embed_metadata( + text=text, + private_key=private_key, + signer_id="demo-signer", + metadata_format="c2pa", +) +assert wrapper_ready_text.endswith("\ufeff") is False # The FEFF is followed by variation selectors ``` -## Verification Process - -The verification process involves two key steps: +## Extraction and Verification Workflow -1. **Signature Verification**: Ensures the manifest itself hasn't been tampered with - - Extracts the embedded metadata using Unicode variation selectors - - Verifies the digital signature using the provided public key - - If the signature is invalid, verification fails immediately +Validators follow the inverse process: -2. **Content Hash Verification**: Ensures the text content hasn't been modified - - Extracts the stored content hash from the manifest - - Calculates a fresh hash of the current content using the same algorithm - - Compares the stored hash with the freshly calculated hash - - If they don't match, the content has been tampered with +1. **Locate the wrapper** by scanning for U+FEFF followed by a contiguous run of variation selectors. +2. **Decode the header**, verify the `C2PATXT\0` magic value, check the version, and ensure the manifest length matches the + decoded byte count. +3. **Recover the JUMBF manifest store** and feed it into the COSE verification flow. +4. **Normalise and hash the visible text**, excluding the byte range recorded in the manifest, and compare the digest against the + stored `c2pa.hash.data.v1` assertion. -This two-step verification process provides comprehensive tamper detection for both the manifest and the content it describes. +If multiple wrappers appear the verifier rejects the content with the `manifest.text.multipleWrappers` failure code. If decoding +fails part-way through the block we emit `manifest.text.corruptedWrapper`. -## Advantages of This Approach +## Advantages of the Updated Flow -1. **Invisibility**: The embedding doesn't visibly alter the text appearance -2. **Portability**: The metadata travels with the content -3. **Robustness**: Works across different text formats and platforms -4. **Standards Alignment**: Compatible with C2PA concepts and structures -5. **Tamper Detection**: Provides comprehensive verification of both metadata and content integrity +- **Specification alignment**: The wrapper structure, FEFF prefix, and exclusion handling match the proposed C2PA text + embedding rules. +- **Copy/paste resilience**: The wrapper stays attached to the text while remaining invisible to readers. +- **Deterministic hashing**: NFC normalisation and explicit exclusion offsets guarantee that hard-binding hashes are stable + across producers and consumers. +- **Interoperability**: By serialising a complete manifest store in JUMBF we are compatible with the broader C2PA ecosystem. diff --git a/docs/package/user-guide/binding_approaches.md b/docs/package/user-guide/binding_approaches.md index b9d463a..3fb8bb5 100644 --- a/docs/package/user-guide/binding_approaches.md +++ b/docs/package/user-guide/binding_approaches.md @@ -145,7 +145,7 @@ Our implementation embeds a complete C2PA manifest directly into the text, inclu ```python from encypher.core.unicode_metadata import UnicodeMetadata from encypher.core.keys import generate_ed25519_key_pair -import hashlib +from encypher.interop.c2pa import compute_normalized_hash # Generate keys private_key, public_key = generate_ed25519_key_pair() @@ -159,7 +159,7 @@ metadata = { "author": "John Doe", "publisher": "Example Publisher", "timestamp": "2025-06-16T15:00:00Z", - "content_hash": hashlib.sha256(text.encode('utf-8')).hexdigest() + "content_hash": compute_normalized_hash(text).hexdigest } # Embed metadata (hard binding) diff --git a/docs/package/user-guide/c2pa-relationship.md b/docs/package/user-guide/c2pa-relationship.md index bc0673d..675219a 100644 --- a/docs/package/user-guide/c2pa-relationship.md +++ b/docs/package/user-guide/c2pa-relationship.md @@ -28,9 +28,9 @@ Our implementation fully supports the core security features of the C2PA standar This self-contained example demonstrates the end-to-end workflow: creating a manifest, embedding it, and verifying it. ```python -import hashlib from encypher.core.keys import generate_ed25519_key_pair from encypher.core.unicode_metadata import UnicodeMetadata +from encypher.interop.c2pa import compute_normalized_hash def run_c2pa_text_demo(): """Demonstrates embedding and verifying a C2PA manifest in text.""" @@ -44,6 +44,8 @@ def run_c2pa_text_demo(): # 3. Create a C2PA manifest dictionary # This includes a hard-binding content hash to protect against tampering. + hash_result = compute_normalized_hash(original_text) + c2pa_manifest = { "claim_generator": "EncypherAI-SDK/1.1.0", "assertions": [ @@ -61,7 +63,7 @@ def run_c2pa_text_demo(): { "label": "c2pa.hash.data.v1", "data": { - "hash": hashlib.sha256(original_text.encode("utf-8")).hexdigest(), + "hash": hash_result.hexdigest, "alg": "sha256", }, "kind": "ContentHash", diff --git a/docs/package/user-guide/text_provenance.md b/docs/package/user-guide/text_provenance.md index a6fe0c4..8937345 100644 --- a/docs/package/user-guide/text_provenance.md +++ b/docs/package/user-guide/text_provenance.md @@ -37,9 +37,13 @@ Our approach uses Unicode variation selectors (ranges U+FE00-FE0F and U+E0100-E0 - The embedded data travels with the text as part of the content itself - The visual appearance of the text remains unchanged +When you embed a full C2PA manifest (`metadata_format="c2pa"`), the bytes follow the `C2PATextManifestWrapper` layout. The +wrapper is prefixed with `U+FEFF`, contains a JUMBF manifest store, and is appended to the end of the text as a contiguous run +of variation selectors. + ### Single-Point Embedding -The default embedding strategy places all metadata after a single target character (typically the first whitespace or the first letter): +For legacy formats, the default embedding strategy places all metadata after a single target character (typically the first whitespace or the first letter): ``` Original: This is example text. @@ -48,14 +52,16 @@ Embedded: This⁠︀︁︂︃︄︅︆︇︈︉︊︋︌︍︎️ is example tex The variation selectors (represented by ⁠︀︁︂︃︄︅︆︇︈︉︊︋︌︍︎️ above, though invisible in actual use) are attached to the first character, encoding the entire manifest. +When using the C2PA format we instead append the FEFF-prefixed wrapper to the end of the text so validators can easily locate it and remove the wrapper before hashing. + ### Content Hash Coverage A critical component of our implementation is the content hash assertion: -- The hash covers the plain text content (all paragraphs concatenated) -- It does not include HTML markup or the variation selectors themselves -- SHA-256 is used as the hashing algorithm -- The hash is computed before embedding the metadata +- The text is normalised to NFC before hashing. +- The hash covers the plain text content (all paragraphs concatenated) with the wrapper bytes excluded. +- SHA-256 is used as the hashing algorithm. +- The hash is computed before embedding the metadata, and the wrapper byte range is recorded in the manifest `exclusions` list. This content hash enables tamper detection - if the text is modified after embedding, the current hash will no longer match the stored hash. @@ -121,8 +127,7 @@ A robust verification process should: ```python from encypher.core.unicode_metadata import UnicodeMetadata from encypher.core.keys import generate_ed25519_key_pair -from encypher.interop.c2pa import c2pa_like_dict_to_encypher_manifest -import hashlib +from encypher.interop.c2pa import compute_normalized_hash from datetime import datetime # 1. Generate keys (or load existing keys) @@ -134,97 +139,74 @@ article_text = """This is the full article text. It contains multiple paragraphs. All of this text will be hashed for the content hash assertion.""" -# 3. Calculate content hash -content_hash = hashlib.sha256(article_text.encode('utf-8')).hexdigest() - -# 4. Create C2PA manifest -c2pa_manifest = { - "claim_generator": "EncypherAI/2.3.0", - "timestamp": datetime.now().isoformat(), - "assertions": [ - { - "label": "stds.schema-org.CreativeWork", - "data": { - "@context": "https://schema.org/", - "@type": "CreativeWork", - "headline": "Example Article", - "author": {"@type": "Person", "name": "John Doe"}, - "publisher": {"@type": "Organization", "name": "Example Publisher"}, - "datePublished": "2025-06-15" - } - }, - { - "label": "stds.c2pa.content.hash", - "data": { - "hash": content_hash, - "alg": "sha256" - }, - "kind": "ContentHash" - } - ] -} - -# 5. Convert to EncypherAI format -encypher_manifest = c2pa_like_dict_to_encypher_manifest(c2pa_manifest) - -# 6. Extract first paragraph for embedding -first_paragraph = article_text.split('\n')[0] -# 7. Embed into first paragraph -embedded_paragraph = UnicodeMetadata.embed_metadata( - text=first_paragraph, +# 3. (Optional) Inspect the baseline hash before embedding +baseline_hash = compute_normalized_hash(article_text).hexdigest +print("Baseline NFC hash:", baseline_hash) + +# 4. Define optional action entries that will appear in c2pa.actions.v1 +custom_actions = [ + { + "label": "c2pa.created", + "softwareAgent": "EncypherAI/guide", + "when": datetime.now().isoformat(), + } +] + +# 5. Embed the manifest as a FEFF-prefixed wrapper at the end of the article +custom_actions = [ + { + "label": "c2pa.created", + "softwareAgent": "EncypherAI/guide", + "when": datetime.now().isoformat(), + } +] + +embedded_article = UnicodeMetadata.embed_metadata( + text=article_text, private_key=private_key, signer_id=signer_id, - metadata_format='cbor_manifest', - claim_generator=encypher_manifest.get("claim_generator"), - actions=encypher_manifest.get("assertions"), - timestamp=encypher_manifest.get("timestamp") + metadata_format="c2pa", + claim_generator="EncypherAI/guide", + actions=custom_actions, + add_hard_binding=True, ) - -# 8. Replace first paragraph in article -embedded_article = article_text.replace(first_paragraph, embedded_paragraph) ``` ### Verification Example ```python from encypher.core.unicode_metadata import UnicodeMetadata -from encypher.interop.c2pa import encypher_manifest_to_c2pa_like_dict -import hashlib +from encypher.interop.c2pa import compute_normalized_hash -# Define key provider function -def key_provider(kid): - if kid == signer_id: + +def key_provider(requested_signer_id: str): + if requested_signer_id == signer_id: return public_key return None -# Extract first paragraph (which contains the embedded metadata) -first_paragraph = embedded_article.split('\n')[0] -# Verify and extract metadata -is_verified, extracted_signer_id, extracted_manifest = UnicodeMetadata.verify_and_extract_metadata( - text=first_paragraph, - public_key_provider=key_provider +is_verified, extracted_signer_id, manifest = UnicodeMetadata.verify_metadata( + text=embedded_article, + public_key_resolver=key_provider, ) -if is_verified: - # Convert back to C2PA format - c2pa_extracted = encypher_manifest_to_c2pa_like_dict(extracted_manifest) - - # Verify content hash - current_content_hash = hashlib.sha256(article_text.encode('utf-8')).hexdigest() - - # Find content hash assertion - stored_hash = None - for assertion in c2pa_extracted.get("assertions", []): - if assertion.get("label") == "stds.c2pa.content.hash": - stored_hash = assertion["data"]["hash"] - break - - if stored_hash == current_content_hash: +if is_verified and manifest is not None: + # Locate the hard-binding assertion + content_hash_assertion = next( + assertion + for assertion in manifest.get("assertions", []) + if assertion.get("label") == "c2pa.hash.data.v1" + ) + exclusions = [ + (item["start"], item["length"]) + for item in content_hash_assertion["data"].get("exclusions", []) + ] + current_hash = compute_normalized_hash(embedded_article, exclusions).hexdigest + if current_hash == content_hash_assertion["data"]["hash"]: print("Content hash verification successful!") else: - print("Content hash verification failed - content may have been tampered with.") + print("Content hash verification failed – content may have been tampered with.") else: print("Signature verification failed!") ``` diff --git a/encypher/core/unicode_metadata.py b/encypher/core/unicode_metadata.py index e064726..87e208d 100644 --- a/encypher/core/unicode_metadata.py +++ b/encypher/core/unicode_metadata.py @@ -11,6 +11,7 @@ import hashlib import json import re +import unicodedata import uuid from datetime import date, datetime, timezone from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union, cast @@ -22,6 +23,9 @@ from cryptography.hazmat.primitives.asymmetric.types import PrivateKeyTypes from pycose.messages import CoseMessage +from encypher.interop.c2pa.text_hashing import compute_normalized_hash, normalize_text +from encypher.interop.c2pa.text_wrapper import encode_wrapper, find_and_decode + from encypher import __version__ from .constants import MetadataTarget @@ -777,139 +781,130 @@ def _embed_c2pa( if not isinstance(private_key, ed25519.Ed25519PrivateKey): raise TypeError("For C2PA embedding, 'private_key' must be an Ed25519PrivateKey instance.") - # --- 1. Construct the C2PA Manifest --- - c2pa_manifest: C2PAPayload = { - "@context": "https://c2pa.org/schemas/v2.2/c2pa.jsonld", - "instance_id": str(uuid.uuid4()), - "claim_generator": claim_generator or f"encypher-ai/{__version__}", - "assertions": [], - } + base_hash_result = compute_normalized_hash(text) + content_hash = base_hash_result.hexdigest + + current_exclusions: List[Dict[str, int]] = [] - # 0. Compute text hash (hard-binding) before wrapper is attached - cls._compute_text_hash(text, algorithm="sha256") + base_actions: List[Dict[str, Any]] = copy.deepcopy(actions) if actions else [] + claim_gen = claim_generator or f"encypher-ai/{__version__}" + instance_id = str(uuid.uuid4()) - # 1. Build mandatory C2PA manifest skeleton - # a) c2pa.actions.v1 assertion - actions_data: Dict[str, Any] = {"actions": actions if actions is not None else []} - if not any(a.get("label") == "c2pa.created" for a in actions_data["actions"]): - created_action = { + if not any(a.get("label") == "c2pa.created" for a in base_actions): + created_action: Dict[str, Any] = { "label": "c2pa.created", "digitalSourceType": "http://cv.iptc.org/newscodes/digitalsourcetype/trainedAlgorithmicMedia", - "softwareAgent": c2pa_manifest["claim_generator"], + "softwareAgent": claim_gen, } if iso_timestamp is not None: created_action["when"] = iso_timestamp - actions_data["actions"].insert(0, created_action) - c2pa_manifest["assertions"].append({"label": "c2pa.actions.v1", "data": actions_data, "kind": "Actions"}) - - # b) c2pa.hash.data.v1 (Hard Binding) - if add_hard_binding: - clean_text_hash = hashlib.sha256(text.encode("utf-8")).hexdigest() - c2pa_manifest["assertions"].append( - {"label": "c2pa.hash.data.v1", "data": {"hash": clean_text_hash, "alg": "sha256", "exclusions": []}, "kind": "ContentHash"} - ) + base_actions.insert(0, created_action) - # --- 3. Prepare for Soft Binding (Deterministic Hashing) --- - # a) Create a temporary manifest copy that includes a placeholder soft binding. - manifest_for_hashing = copy.deepcopy(c2pa_manifest) - - # b) Add the placeholder soft binding and watermarked action to the copy. - placeholder_soft_binding: "C2PAAssertion" = { - "label": "c2pa.soft_binding.v1", - "data": {"alg": "encypher.unicode_variation_selector.v1", "hash": ""}, - "kind": "SoftBinding", - } - manifest_for_hashing["assertions"].append(placeholder_soft_binding) - - actions_data_copy = next((a["data"] for a in manifest_for_hashing["assertions"] if a["label"] == "c2pa.actions.v1"), None) - if actions_data_copy and isinstance(actions_data_copy.get("actions"), list): - wm_action_copy = { - "label": "c2pa.watermarked", - "softwareAgent": c2pa_manifest["claim_generator"], - "description": "Text embedded with Unicode variation selectors.", - } - if iso_timestamp is not None: - wm_action_copy["when"] = iso_timestamp - actions_data_copy["actions"].append(wm_action_copy) - - # c) Serialize the modified manifest and calculate the definitive hash. - cbor_for_hashing = serialize_c2pa_payload_to_cbor(manifest_for_hashing) - actual_soft_binding_hash = hashlib.sha256(cbor_for_hashing).hexdigest() - - # d) Create the final soft binding assertion with the real hash. - final_soft_binding_assertion: "C2PAAssertion" = { - "label": "c2pa.soft_binding.v1", - "data": {"alg": "encypher.unicode_variation_selector.v1", "hash": actual_soft_binding_hash}, - "kind": "SoftBinding", - } - c2pa_manifest["assertions"].append(final_soft_binding_assertion) - - # e) Add the 'watermarked' action to the original manifest. - wm_action = { + wm_action: Dict[str, Any] = { "label": "c2pa.watermarked", - "softwareAgent": c2pa_manifest["claim_generator"], + "softwareAgent": claim_gen, "description": "Text embedded with Unicode variation selectors.", } if iso_timestamp is not None: wm_action["when"] = iso_timestamp - actions_data["actions"].append(wm_action) - # --- 4. Finalize, Serialize, and Sign --- - # Re-serialize the final manifest, which now includes the correct soft binding hash. - final_cbor_payload_bytes = serialize_c2pa_payload_to_cbor(c2pa_manifest) + wrapper_text = "" + cose_sign1_bytes: Optional[bytes] = None - # Sign the final CBOR payload using COSE_Sign1. - cose_sign1_bytes = sign_c2pa_cose(private_key, final_cbor_payload_bytes) + MAX_ITERATIONS = 6 + for _ in range(MAX_ITERATIONS): + c2pa_manifest: C2PAPayload = { + "@context": "https://c2pa.org/schemas/v2.2/c2pa.jsonld", + "instance_id": instance_id, + "claim_generator": claim_gen, + "assertions": [], + } - # --- 5. Package and Embed --- - # The outer structure contains the signed COSE object. - outer_payload_to_embed = { - "cose_sign1": base64.b64encode(cose_sign1_bytes).decode("utf-8"), - "signer_id": signer_id, - "format": "c2pa", - } - outer_bytes = serialize_payload(dict(outer_payload_to_embed)) - selector_chars = cls._bytes_to_variation_selectors(outer_bytes) + actions_data: Dict[str, Any] = {"actions": copy.deepcopy(base_actions)} + c2pa_manifest["assertions"].append({"label": "c2pa.actions.v1", "data": actions_data, "kind": "Actions"}) + + if add_hard_binding: + hard_binding_data = { + "hash": content_hash, + "alg": "sha256", + "exclusions": copy.deepcopy(current_exclusions), + } + c2pa_manifest["assertions"].append( + {"label": "c2pa.hash.data.v1", "data": hard_binding_data, "kind": "ContentHash"} + ) - # --- 6. Find Targets and Embed --- - embedding_target = target if target is not None else MetadataTarget.WHITESPACE - target_indices = cls.find_targets(text, embedding_target) - target_display = embedding_target.value if hasattr(embedding_target, "value") else embedding_target + manifest_for_hashing = copy.deepcopy(c2pa_manifest) + placeholder_soft_binding: C2PAAssertion = { + "label": "c2pa.soft_binding.v1", + "data": {"alg": "encypher.unicode_variation_selector.v1", "hash": ""}, + "kind": "SoftBinding", + } + manifest_for_hashing["assertions"].append(placeholder_soft_binding) - if not target_indices: - raise ValueError(f"No suitable targets found in text using target '{target_display}'.") + actions_data_copy = next( + (a["data"] for a in manifest_for_hashing["assertions"] if a.get("label") == "c2pa.actions.v1"), + None, + ) + if actions_data_copy and isinstance(actions_data_copy.get("actions"), list): + actions_data_copy["actions"].append(copy.deepcopy(wm_action)) - # Ensure we have the original text in the output - logger.debug(f"Embedding {len(selector_chars)} variation selectors into text") + cbor_for_hashing = serialize_c2pa_payload_to_cbor(manifest_for_hashing) + actual_soft_binding_hash = hashlib.sha256(cbor_for_hashing).hexdigest() - if distribute_across_targets: - if len(target_indices) < len(selector_chars): - raise ValueError(f"Not enough targets ({len(target_indices)}) found to distribute {len(selector_chars)} selectors.") - result_parts = [] - last_text_idx = 0 - for i, target_idx in enumerate(target_indices): - if i < len(selector_chars): - result_parts.append(text[last_text_idx : target_idx + 1]) # Include the target character - result_parts.append(selector_chars[i]) # Add the selector after the target - last_text_idx = target_idx + 1 - else: - break - result_parts.append(text[last_text_idx:]) # Add remaining text - return "".join(result_parts) - else: - # Insert all selectors after the first target character - # Ensure the original text is preserved by keeping it intact and only adding selectors - target_idx = target_indices[0] - result = text[: target_idx + 1] + "".join(selector_chars) + text[target_idx + 1 :] + final_soft_binding_assertion: C2PAAssertion = { + "label": "c2pa.soft_binding.v1", + "data": {"alg": "encypher.unicode_variation_selector.v1", "hash": actual_soft_binding_hash}, + "kind": "SoftBinding", + } + c2pa_manifest["assertions"].append(final_soft_binding_assertion) + actions_data["actions"].append(copy.deepcopy(wm_action)) - # Verify the original text is preserved in the result - if text not in result: - logger.warning("Original text not preserved in embedding result. Adjusting embedding strategy.") - # Alternative approach: append selectors at the end of text - result = text + "".join(selector_chars) + final_cbor_payload_bytes = serialize_c2pa_payload_to_cbor(c2pa_manifest) + cose_sign1_bytes = sign_c2pa_cose(private_key, final_cbor_payload_bytes) - return result + jumbf_payload = { + "format": "c2pa", + "signer_id": signer_id, + "cose_sign1": base64.b64encode(cose_sign1_bytes).decode("utf-8"), + } + jumbf_bytes = serialize_jumbf_payload(jumbf_payload) + wrapper_text = encode_wrapper(jumbf_bytes) + + final_text = text + wrapper_text + + exclusion_tuples: List[Tuple[int, int]] = [] + new_exclusions: List[Dict[str, int]] = [] + + if add_hard_binding: + wrapper_length_bytes = len(wrapper_text.encode("utf-8")) + normalized_final_text = normalize_text(final_text) + wrapper_index = normalized_final_text.rfind(wrapper_text) + if wrapper_index < 0: + raise RuntimeError("Failed to locate C2PA wrapper inside normalized text") + exclusion_start = len(normalized_final_text[:wrapper_index].encode("utf-8")) + exclusion_tuples = [(exclusion_start, wrapper_length_bytes)] + new_exclusions = [{"start": exclusion_start, "length": wrapper_length_bytes}] + + hash_result = compute_normalized_hash(final_text, exclusion_tuples) + actual_hash = hash_result.hexdigest + + if add_hard_binding: + if actual_hash != content_hash: + content_hash = actual_hash + current_exclusions = new_exclusions + continue + if new_exclusions != current_exclusions: + current_exclusions = new_exclusions + continue + break + else: + raise RuntimeError("Failed to stabilise C2PA wrapper exclusion offsets") + + if not wrapper_text: + raise RuntimeError("Failed to produce C2PA wrapper text") + logger.info("Successfully embedded C2PA manifest for signer '%s'.", signer_id) + return text + wrapper_text @classmethod def verify_metadata( cls, @@ -960,13 +955,33 @@ def verify_metadata( # --- Format-Specific Verification Dispatch --- if payload_format == "c2pa": - clean_text = cls._strip_variation_selectors(text) + try: + manifest_bytes, clean_text, span = find_and_decode(text) + except ValueError as err: + logger.warning(f"Failed to decode C2PA wrapper during verification: {err}") + return False, signer_id, None + + if manifest_bytes is None or span is None: + logger.warning("C2PA format indicated but no text wrapper found.") + return False, signer_id, None + + wrapper_segment = text[span[0] : span[1]] + normalized_full_text = unicodedata.normalize("NFC", text) + normalized_index = normalized_full_text.rfind(wrapper_segment) + if normalized_index < 0: + logger.warning("Unable to locate wrapper segment in normalized text during verification.") + return False, signer_id, None + + exclusion_start = len(normalized_full_text[:normalized_index].encode("utf-8")) + exclusion_length = len(wrapper_segment.encode("utf-8")) + return cls._verify_c2pa( - text=clean_text, + original_text=text, outer_payload=outer_payload, public_key_resolver=public_key_resolver, return_payload_on_failure=return_payload_on_failure, require_hard_binding=require_hard_binding, + wrapper_exclusion=(exclusion_start, exclusion_length), ) # --- Legacy Format Verification ('basic', 'manifest', 'cbor_manifest') --- @@ -1076,11 +1091,12 @@ def verify_metadata( @classmethod def _verify_c2pa( cls, - text: str, + original_text: str, outer_payload: OuterPayload, public_key_resolver: Callable[[str], Optional[Ed25519PublicKey]], return_payload_on_failure: bool, require_hard_binding: bool, # New parameter + wrapper_exclusion: Optional[Tuple[int, int]], ) -> Tuple[bool, Optional[str], Union[C2PAPayload, None]]: """ Verifies a C2PA-compliant manifest. @@ -1090,7 +1106,7 @@ def _verify_c2pa( soft binding, and hard binding checks. Args: - text: The clean text content, stripped of metadata. + original_text: The full text asset (including the wrapper) provided for verification. outer_payload: The deserialized outer payload containing the manifest. public_key_resolver: A function to retrieve the public key for a given signer ID. return_payload_on_failure: Flag to control returning the payload on failure. @@ -1200,7 +1216,39 @@ def _verify_c2pa( logger.warning("C2PA verification: Hard binding assertion not found.") return False, signer_id if signer_id is not None else None, c2pa_manifest expected_hard_hash = hard_binding_assertion["data"].get("hash") - actual_hard_hash = hashlib.sha256(text.encode("utf-8")).hexdigest() + + exclusions_data = hard_binding_assertion["data"].get("exclusions") + expected_exclusion: Optional[Tuple[int, int]] = None + if isinstance(exclusions_data, list) and exclusions_data: + first = exclusions_data[0] + if isinstance(first, dict): + start_val = first.get("start") + length_val = first.get("length") + elif isinstance(first, (list, tuple)) and len(first) >= 2: + start_val, length_val = first[0], first[1] + else: + start_val = length_val = None + if start_val is not None and length_val is not None: + try: + expected_exclusion = (int(start_val), int(length_val)) + except (TypeError, ValueError): + expected_exclusion = None + + if wrapper_exclusion is not None: + if expected_exclusion != wrapper_exclusion: + logger.warning( + "C2PA verification: Hard binding exclusion range mismatch. Expected %s, got %s.", + expected_exclusion, + wrapper_exclusion, + ) + return False, signer_id, c2pa_manifest + elif expected_exclusion: + logger.warning("C2PA verification: Manifest recorded exclusions but none were detected in text.") + return False, signer_id, c2pa_manifest + + exclusion_ranges = [wrapper_exclusion] if wrapper_exclusion is not None else [] + hard_hash_result = compute_normalized_hash(original_text, exclusion_ranges) + actual_hard_hash = hard_hash_result.hexdigest if expected_hard_hash != actual_hard_hash: logger.warning( @@ -1243,7 +1291,41 @@ def _extract_outer_payload(cls, text: str) -> Optional[OuterPayload]: Raises: (Indirectly via called methods) UnicodeDecodeError, json.JSONDecodeError, TypeError """ - # 1. Extract Bytes: + # 0. Prefer C2PA text wrappers (FEFF-prefixed contiguous blocks) + try: + manifest_bytes, _clean_text, _span = find_and_decode(text) + except ValueError as err: + logger.warning(f"Failed to decode C2PA text wrapper: {err}") + return None + + if manifest_bytes is not None: + logger.debug("Detected C2PA text wrapper – attempting JUMBF decode.") + try: + manifest_store = deserialize_jumbf_payload(manifest_bytes) + except Exception as exc: + logger.warning(f"Failed to deserialize JUMBF payload from text wrapper: {exc}") + return None + + if not isinstance(manifest_store, dict): + logger.warning("Decoded JUMBF manifest store is not a dictionary.") + return None + + signer_id = manifest_store.get("signer_id") + cose_sign1 = manifest_store.get("cose_sign1") + if not signer_id or not isinstance(cose_sign1, str): + logger.warning("C2PA manifest store missing signer_id or cose_sign1.") + return None + + outer_payload: OuterPayload = { + "format": "c2pa", + "signer_id": signer_id, + "payload": base64.b64encode(manifest_bytes).decode("utf-8"), + "signature": "c2pa_manifest_store", + } + outer_payload["cose_sign1"] = cose_sign1 + return outer_payload + + # 1. Extract Bytes for legacy/other formats: logger.debug("Attempting to extract bytes from text.") outer_bytes = cls.extract_bytes(text) if not outer_bytes: @@ -1352,12 +1434,29 @@ def extract_metadata(cls, text: str) -> Optional[Union[BasicPayload, ManifestPay payload_format = outer_payload.get("format") if payload_format == "c2pa": + cose_sign1_b64 = outer_payload.get("cose_sign1") + try: + if isinstance(cose_sign1_b64, str): + cose_sign1_bytes = base64.b64decode(cose_sign1_b64) + cbor_bytes = extract_payload_from_cose_sign1(cose_sign1_bytes) + if cbor_bytes is not None: + return deserialize_c2pa_payload_from_cbor(cbor_bytes) + except (binascii.Error, ValueError, cbor2.CBORDecodeError) as exc: + logger.warning(f"Failed to decode COSE payload during C2PA extraction: {exc}") + return None + if isinstance(inner_payload, str): try: - cbor_bytes = base64.b64decode(inner_payload) - return deserialize_c2pa_payload_from_cbor(cbor_bytes) - except (binascii.Error, ValueError, cbor2.CBORDecodeError): - logger.warning("Failed to decode C2PA payload during non-verifying extraction.") + manifest_store_bytes = base64.b64decode(inner_payload) + manifest_store = deserialize_jumbf_payload(manifest_store_bytes) + cose_embedded = manifest_store.get("cose_sign1") if isinstance(manifest_store, dict) else None + if isinstance(cose_embedded, str): + cose_sign1_bytes = base64.b64decode(cose_embedded) + cbor_bytes = extract_payload_from_cose_sign1(cose_sign1_bytes) + if cbor_bytes is not None: + return deserialize_c2pa_payload_from_cbor(cbor_bytes) + except (binascii.Error, ValueError, cbor2.CBORDecodeError) as exc: + logger.warning(f"Failed to decode C2PA manifest store during extraction: {exc}") return None return None diff --git a/encypher/interop/c2pa/__init__.py b/encypher/interop/c2pa/__init__.py index d738eb4..cc722a0 100644 --- a/encypher/interop/c2pa/__init__.py +++ b/encypher/interop/c2pa/__init__.py @@ -18,4 +18,11 @@ ) # noqa: F401 # Text manifest wrapper utilities (public re-exports) -from .text_wrapper import ALGORITHM_IDS, MAGIC, VERSION, encode_wrapper, find_and_decode # noqa: F401 +from .text_wrapper import MAGIC, VERSION, encode_wrapper, find_and_decode # noqa: F401 + +# Normalisation + hashing helpers +from .text_hashing import ( # noqa: F401 + NormalizedHashResult, + compute_normalized_hash, + normalize_text, +) diff --git a/encypher/interop/c2pa/text_hashing.py b/encypher/interop/c2pa/text_hashing.py new file mode 100644 index 0000000..ac38631 --- /dev/null +++ b/encypher/interop/c2pa/text_hashing.py @@ -0,0 +1,118 @@ +"""Normalized hashing helpers for C2PA text assets. + +This module centralises the NFC normalisation and hash computation rules +mandated by the C2PA text manifest specification. Both the embedding and +verification flows call into these helpers so that offsets, exclusions, +and hash algorithms remain perfectly aligned. +""" + +from __future__ import annotations + +from dataclasses import dataclass +import hashlib +from typing import List, Sequence, Tuple +import unicodedata + + +@dataclass(frozen=True) +class NormalizedHashResult: + """Container returned by :func:`compute_normalized_hash`. + + Attributes + ---------- + normalized_text: + NFC-normalised version of the input text. + normalized_bytes: + UTF-8 bytes for :attr:`normalized_text` (before exclusions are + applied). + filtered_bytes: + UTF-8 bytes remaining after removing the requested exclusion ranges. + hexdigest: + Hex encoded digest of :attr:`filtered_bytes`. + """ + + normalized_text: str + normalized_bytes: bytes + filtered_bytes: bytes + hexdigest: str + + @property + def filtered_text(self) -> str: + """Return the post-exclusion text as a Unicode string.""" + + return self.filtered_bytes.decode("utf-8") + + +def normalize_text(text: str) -> str: + """Return the NFC-normalised variant of *text*.""" + + return unicodedata.normalize("NFC", text) + + +def _coerce_ranges(exclusions: Sequence[Tuple[int, int]]) -> List[Tuple[int, int]]: + coerced: List[Tuple[int, int]] = [] + for start, length in exclusions: + coerced_start = int(start) + coerced_length = int(length) + if coerced_start < 0 or coerced_length < 0: + raise ValueError("Exclusion ranges must be non-negative") + coerced.append((coerced_start, coerced_length)) + return sorted(coerced, key=lambda item: item[0]) + + +def _apply_exclusions(normalized_bytes: bytes, exclusions: Sequence[Tuple[int, int]]) -> bytes: + if not exclusions: + return normalized_bytes + + filtered = bytearray() + position = 0 + for start, length in _coerce_ranges(exclusions): + end = start + length + if start < position: + raise ValueError("Exclusion ranges must be non-overlapping and sorted") + if end > len(normalized_bytes): + raise ValueError("Exclusion range exceeds the length of the normalised data") + filtered.extend(normalized_bytes[position:start]) + position = end + filtered.extend(normalized_bytes[position:]) + return bytes(filtered) + + +def compute_normalized_hash( + text: str, + exclusions: Sequence[Tuple[int, int]] | None = None, + *, + algorithm: str = "sha256", +) -> NormalizedHashResult: + """Compute the hash mandated by the text C2PA specification. + + Parameters + ---------- + text: + The textual asset to normalise and hash. + exclusions: + Iterable of ``(start, length)`` byte ranges within the normalised UTF-8 + representation that must be removed prior to hashing. + algorithm: + Name of the hashing algorithm to use. ``sha256`` is the only value the + specification currently allows but the parameter remains configurable + for completeness. + """ + + normalized = normalize_text(text) + normalized_bytes = normalized.encode("utf-8") + filtered_bytes = _apply_exclusions(normalized_bytes, exclusions or []) + try: + digest = hashlib.new(algorithm.replace("-", "")) + except ValueError as exc: + raise ValueError(f"Unsupported hash algorithm '{algorithm}' for C2PA") from exc + digest.update(filtered_bytes) + return NormalizedHashResult( + normalized_text=normalized, + normalized_bytes=normalized_bytes, + filtered_bytes=filtered_bytes, + hexdigest=digest.hexdigest(), + ) + + +__all__ = ["NormalizedHashResult", "compute_normalized_hash", "normalize_text"] diff --git a/encypher/interop/c2pa/text_wrapper.py b/encypher/interop/c2pa/text_wrapper.py index 7b67cda..85f3992 100644 --- a/encypher/interop/c2pa/text_wrapper.py +++ b/encypher/interop/c2pa/text_wrapper.py @@ -11,18 +11,12 @@ MAGIC = b"C2PATXT\0" # 8-byte magic sequence VERSION = 1 # Current wrapper version we emit / accept -ALGORITHM_IDS = { - "sha256": 1, - "sha384": 2, - "sha512": 3, - "sha3-256": 4, - "sha3-384": 5, - "sha3-512": 6, -} +_HEADER_STRUCT = struct.Struct("!8sBI") +_HEADER_SIZE = _HEADER_STRUCT.size ZWNBSP = "\ufeff" _VS_CHAR_CLASS = "[\ufe00-\ufe0f\U000e0100-\U000e01ef]" -_WRAPPER_RE = re.compile(ZWNBSP + f"({_VS_CHAR_CLASS}{{15,}})") +_WRAPPER_RE = re.compile(ZWNBSP + f"({_VS_CHAR_CLASS}{{{_HEADER_SIZE},}})") def _byte_to_vs(byte: int) -> str: @@ -41,18 +35,8 @@ def _vs_to_byte(codepoint: int) -> Optional[int]: return None -def encode_wrapper(manifest_bytes: bytes, alg: str = "sha256") -> str: - if alg not in ALGORITHM_IDS: - raise ValueError(f"Unsupported algorithm '{alg}'.") - - header = b"".join( - [ - MAGIC, - struct.pack("!B", VERSION), - struct.pack("!H", ALGORITHM_IDS[alg]), - struct.pack("!I", len(manifest_bytes)), - ] - ) +def encode_wrapper(manifest_bytes: bytes) -> str: + header = _HEADER_STRUCT.pack(MAGIC, VERSION, len(manifest_bytes)) payload = header + manifest_bytes vs = [_byte_to_vs(b) for b in payload] return ZWNBSP + "".join(vs) @@ -74,17 +58,21 @@ def attach_wrapper_to_text(text: str, manifest_bytes: bytes, alg: str = "sha256" If *at_end* is True (default) the wrapper is appended; otherwise it is prepended before the first line break. """ - wrapper = encode_wrapper(manifest_bytes, alg) + # The ``alg`` parameter is retained for backwards compatibility with + # earlier APIs that allowed selecting a hash algorithm, but the updated + # wrapper format encodes only the manifest bytes. + wrapper = encode_wrapper(manifest_bytes) return text + wrapper if at_end else wrapper + text -def extract_from_text(text: str) -> Tuple[Optional[bytes], Optional[str], str, Optional[Tuple[int, int]]]: +def extract_from_text(text: str) -> Tuple[Optional[bytes], str, Optional[Tuple[int, int]]]: """Extract wrapper from text. - Returns (manifest_bytes, alg_name, clean_text, span) where *clean_text* is NFC normalised text with wrapper removed. - If wrapper not found returns (None, None, normalised_text, None). + Returns ``(manifest_bytes, clean_text, span)`` where ``clean_text`` is NFC + normalised text with the wrapper removed. If no wrapper is present the + function returns ``(None, normalised_text, None)``. """ - """Alias for find_and_decode for external callers.""" + return find_and_decode(text) @@ -93,11 +81,11 @@ def _normalize(text: str) -> str: return unicodedata.normalize("NFC", text) -def find_and_decode(text: str) -> Tuple[Optional[bytes], Optional[str], str, Optional[Tuple[int, int]]]: +def find_and_decode(text: str) -> Tuple[Optional[bytes], str, Optional[Tuple[int, int]]]: # Search for first wrapper m = _WRAPPER_RE.search(text) if not m: - return None, None, _normalize(text), None + return None, _normalize(text), None # Ensure there is no second wrapper occurrence (spec §4.2) second = _WRAPPER_RE.search(text, pos=m.end()) @@ -107,21 +95,15 @@ def find_and_decode(text: str) -> Tuple[Optional[bytes], Optional[str], str, Opt try: raw = _decode_vs_sequence(seq) except ValueError: - return None, None, _normalize(text), None - if len(raw) < 15: - return None, None, _normalize(text), None - magic, version, alg_id, length = struct.unpack("!8sBHI", raw[:15]) - if magic != MAGIC or version != VERSION or len(raw) < 15 + length: - raise ValueError("Invalid C2PA text wrapper header or length") - # Map algorithm id - alg_name = None - for name, _id in ALGORITHM_IDS.items(): - if _id == alg_id: - alg_name = name - break - if alg_name is None: - raise ValueError(f"Unknown hash algorithm id {alg_id}") - manifest_bytes = raw[15 : 15 + length] + raise ValueError("Invalid variation selector sequence in wrapper") + if len(raw) < _HEADER_SIZE: + raise ValueError("C2PA text wrapper shorter than required header length") + magic, version, length = _HEADER_STRUCT.unpack(raw[:_HEADER_SIZE]) + if magic != MAGIC or version != VERSION: + raise ValueError("Invalid C2PA text wrapper header values") + if len(raw) < _HEADER_SIZE + length: + raise ValueError("C2PA text wrapper truncated before manifest bytes") + manifest_bytes = raw[_HEADER_SIZE : _HEADER_SIZE + length] start, end = m.span() clean_text = _normalize(text[:start] + text[end:]) - return manifest_bytes, alg_name, clean_text, (start, end) + return manifest_bytes, clean_text, (start, end) diff --git a/tests/integration/test_c2pa_text_embedding.py b/tests/integration/test_c2pa_text_embedding.py index b9c68c4..273d466 100644 --- a/tests/integration/test_c2pa_text_embedding.py +++ b/tests/integration/test_c2pa_text_embedding.py @@ -1,8 +1,14 @@ +import struct +import unicodedata import unittest from encypher.core.keys import generate_ed25519_key_pair from encypher.core.unicode_metadata import UnicodeMetadata -from encypher.interop.c2pa import c2pa_like_dict_to_encypher_manifest, encypher_manifest_to_c2pa_like_dict +from encypher.interop.c2pa import ( + c2pa_like_dict_to_encypher_manifest, + encypher_manifest_to_c2pa_like_dict, +) +from encypher.interop.c2pa.text_wrapper import find_and_decode class TestC2PATextEmbedding(unittest.TestCase): @@ -428,6 +434,55 @@ def public_key_resolver(kid: str): # Compare the dictionaries (excluding timestamp) self.assertEqual(comparison_dict, original_comparison, "Round-trip conversion with single-assertion CBOR manifest does not match original.") + def test_c2pa_text_wrapper_appended_with_feff(self): + private_key, public_key = generate_ed25519_key_pair() + key_id = "c2pa-wrapper-key" + sample_text = "Café document for wrapper" + + embedded_text = UnicodeMetadata.embed_metadata( + text=sample_text, + private_key=private_key, + signer_id=key_id, + metadata_format="c2pa", + claim_generator="EncypherAI/WrapperTest/1.0", + ) + + self.assertNotEqual(embedded_text, sample_text) + + manifest_bytes, clean_text, span = find_and_decode(embedded_text) + self.assertIsNotNone(manifest_bytes) + self.assertEqual(clean_text, unicodedata.normalize("NFC", sample_text)) + self.assertIsNotNone(span) + self.assertEqual(span[1], len(embedded_text)) + + wrapper_segment = embedded_text[span[0] : span[1]] + self.assertTrue(wrapper_segment.startswith("")) + + self.assertGreaterEqual(len(manifest_bytes), 8) + length, box_type = struct.unpack(">I4s", manifest_bytes[:8]) + self.assertEqual(box_type, b"jumb") + self.assertEqual(length, len(manifest_bytes)) + + def resolver(kid: str): + return public_key if kid == key_id else None + + verified, extracted_signer, manifest = UnicodeMetadata.verify_metadata( + text=embedded_text, + public_key_resolver=resolver, + return_payload_on_failure=True, + ) + + self.assertTrue(verified) + self.assertEqual(extracted_signer, key_id) + self.assertIsNotNone(manifest) + + hard_binding = next((a for a in manifest["assertions"] if a.get("label") == "c2pa.hash.data.v1"), None) + self.assertIsNotNone(hard_binding) + exclusions = hard_binding["data"].get("exclusions") + expected_start = len(unicodedata.normalize("NFC", sample_text).encode("utf-8")) + expected_length = len(wrapper_segment.encode("utf-8")) + self.assertEqual(exclusions, [{"start": expected_start, "length": expected_length}]) + if __name__ == "__main__": unittest.main() diff --git a/tests/interop/test_text_hashing.py b/tests/interop/test_text_hashing.py new file mode 100644 index 0000000..502fa09 --- /dev/null +++ b/tests/interop/test_text_hashing.py @@ -0,0 +1,51 @@ +import hashlib +import unicodedata + +import pytest + +from encypher.interop.c2pa.text_hashing import compute_normalized_hash, normalize_text + + +def test_compute_normalized_hash_normalizes_text(): + text = "e\u0301clair" # contains combining accent + result = compute_normalized_hash(text) + + expected_normalized = unicodedata.normalize("NFC", text) + expected_hash = hashlib.sha256(expected_normalized.encode("utf-8")).hexdigest() + + assert result.normalized_text == expected_normalized + assert result.filtered_text == expected_normalized + assert result.hexdigest == expected_hash + + +def test_compute_normalized_hash_with_exclusion_removes_wrapper(): + visible = "C2PA document" + wrapper = "\ufeff" + "".join(chr(0xFE00 + i) for i in range(4)) + full_text = visible + wrapper + + normalized_full = normalize_text(full_text) + wrapper_index = normalized_full.rfind(wrapper) + assert wrapper_index >= 0 + + exclusion_start = len(normalized_full[:wrapper_index].encode("utf-8")) + exclusion_length = len(wrapper.encode("utf-8")) + + result = compute_normalized_hash(full_text, [(exclusion_start, exclusion_length)]) + + expected_clean = normalize_text(visible) + expected_hash = hashlib.sha256(expected_clean.encode("utf-8")).hexdigest() + + assert result.filtered_text == expected_clean + assert result.hexdigest == expected_hash + + +def test_compute_normalized_hash_rejects_invalid_exclusions(): + text = "sample" + with pytest.raises(ValueError): + compute_normalized_hash(text, [(-1, 2)]) + with pytest.raises(ValueError): + compute_normalized_hash(text, [(0, -2)]) + with pytest.raises(ValueError): + compute_normalized_hash(text, [(0, 10)]) + with pytest.raises(ValueError): + compute_normalized_hash(text, [(0, 3), (2, 1)])