Skip to content

Commit 0ccd1eb

Browse files
authored
Merge pull request #19 from iscc/develop
Develop → Main: CDC optimization, docs, CI fixes
2 parents d5dc430 + 6cfae5e commit 0ccd1eb

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+2599
-506
lines changed

.claude/agent-memory/advance/MEMORY.md

Lines changed: 81 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ iterations.
2424
## Build and Tooling
2525

2626
- `cargo build -p iscc-jni` must run before `mvn test` (native library prerequisite)
27+
- Maven POM is at `crates/iscc-jni/java/pom.xml` — run `mvn test` from `crates/iscc-jni/java/`
2728
- CI workflow at `.github/workflows/ci.yml` has 9 jobs: version-check, rust, python, nodejs, wasm,
2829
c-ffi, java, go, bench. The `bench` job runs `cargo bench --no-run` (compile-only, no execution)
2930
- `version-check` job: lightweight (checkout + setup-python only), runs
@@ -40,66 +41,57 @@ iterations.
4041
- wasm-opt release flags: `[package.metadata.wasm-pack.profile.release]` with
4142
`wasm-opt = ["-O", "--enable-bulk-memory", "--enable-nontrapping-float-to-int"]`
4243

43-
## Go Pure Go Rewrite
44-
45-
- Pure Go codec: `packages/go/codec.go` — type enums (`MainType`, `SubType`, `Version` with `iota`),
46-
varnibble header encoding/decoding, base32/base64, `EncodeComponent`, `IsccDecompose`,
47-
`IsccDecode`. Zero external dependencies
48-
- Go type naming: `MTMeta`..`MTFlake`, `STNone`..`STWide`, `STText = STNone`, `VSV0 Version = 0`
49-
- Internal helpers are unexported (lowercase): `encodeHeader`, `decodeHeader`, etc.
50-
- `IsccDecode` uses `DecodeResult` struct defined in `codec.go`
51-
- Base32: `base32.StdEncoding.WithPadding(base32.NoPadding)`. Base64: `base64.RawURLEncoding`
52-
- Pure Go text utils: `TextClean` (NFKC + control-char + empty-line collapse), `TextCollapse` (NFD +
53-
lowercase + filter C/M/P + NFKC), `TextTrim` (UTF-8 byte-boundary), `TextRemoveNewlines`
54-
(strings.Fields join). Uses `golang.org/x/text/unicode/norm`
55-
- CDC: `cdcGear` table is `var` not `const` (Go no const arrays). `min()` builtin Go 1.21+
56-
- MinHash: `minhashFn` naming (avoids conflict). `maxi64`/`mprime`/`maxH` are `var` not `const`
57-
- SimHash: `AlgSimhash` returns `([]byte, error)`, `SlidingWindow` returns `([]string, error)`. Uses
58-
`[]rune` for Unicode-correct SlidingWindow
59-
- CDC integer ceiling: `(minSize + 1) / 2` (Go has no div_ceil method)
60-
- DCT: `algDct` (unexported) + `dctRecursive` helper. Only uses `math` stdlib. Nayuki recursive
61-
divide-and-conquer. Input must be power of 2 — checked via `n > 0 && n&(n-1) == 0`
62-
- WTA-Hash: `AlgWtahash` (exported) + `wtaVideoIdPermutations` `[256][2]int` table. No external deps
63-
- Gen functions: `code_content_text.go` (GenTextCodeV0 + softHashTextV0), `code_meta.go`
64-
(GenMetaCodeV0 + metaNameSimhash + softHashMetaV0 + softHashMetaV0WithBytes + interleaveDigests
65-
\+ slidingWindowBytes + decodeDataURL + parseMetaJSON + jsonHasContext + buildMetaDataURL +
66-
multiHashBlake3), `code_data.go` (GenDataCodeV0 + DataHasher with Push/Finalize),
67-
`code_instance.go` (GenInstanceCodeV0 + InstanceHasher with Push/Finalize),
68-
`code_content_image.go` (GenImageCodeV0 + softHashImageV0 + transposeMatrix + flatten8x8 +
69-
computeMedian), `code_content_audio.go` (GenAudioCodeV0 + softHashAudioV0 + arraySplit[T]).
70-
Result types: `TextCodeResult`, `MetaCodeResult`, `DataCodeResult`, `InstanceCodeResult`,
71-
`ImageCodeResult`, `AudioCodeResult`, `VideoCodeResult`, `MixedCodeResult`, `IsccCodeResult`
72-
- xxh32: `xxh32.go` — standalone xxHash32 impl (~80 lines). Used by softHashTextV0 for n-gram
73-
feature hashing. Unexported: `xxh32(data, seed)`, `xxh32Round`, `rotl32`, `readU32LE`
74-
- JCS canonicalization: uses Go stdlib `json.Marshal` (sorts keys, compact format). Works for
75-
string/null values in conformance vectors. For full RFC 8785 float compliance, would need a
76-
dedicated library
77-
- BLAKE3 dependency: `github.com/zeebo/blake3` (SIMD-optimized). `blake3.Sum256(data)` returns
78-
`[32]byte`
79-
- Test naming for gen functions: `TestPureGo*` prefix (historical — could be renamed to `Test*` in
80-
future cleanup)
81-
- Go docs: `packages/go/README.md` and `docs/howto/go.md` describe pure Go API (no WASM/wazero).
82-
Examples use `iscc.Function(...)` pattern with typed result structs (`*MetaCodeResult`, etc.)
83-
- Image-Code helpers: `transposeMatrix`, `flatten8x8`, `computeMedian` are unexported in
84-
`code_content_image.go`. `bitsToBytes` reused from `codec.go`
85-
- Audio-Code: `arraySplit[T any]` is generic (Go 1.18+), used for splitting digests into quarters/
86-
thirds. `AlgSimhash` on 4-byte digests returns 4 bytes (output = input digest length)
87-
- `sort.Slice` for int32: `func(i, j int) bool { return s[i] < s[j] }` (no built-in int32 sort)
88-
- Video-Code: `SoftHashVideoV0` exported (matching Rust `pub fn`). Dedup via
89-
`fmt.Sprintf("%v", sig)` string keys in `map[string][]int32`. Column-wise int64 sums →
90-
`AlgWtahash`
91-
- Mixed-Code: `softHashCodesV0` unexported (matching Rust non-pub). Preserves first header byte for
92-
type info in SimHash entries. Uses `decodeHeader`/`decodeLength` to validate Content MainType
93-
and bit length. `AlgSimhash` error safely discarded (all entries identical length)
94-
- Go module dependencies: `github.com/zeebo/blake3` (BLAKE3, SIMD), `golang.org/x/text` (Unicode).
95-
No wazero or WASM dependencies. `github.com/klauspost/cpuid/v2` indirect (blake3 SIMD detection)
96-
- Test naming: `TestCodec*`, `TestUtils*`, `TestCdc*`, `TestMinhash*`, `TestSimhash*`,
97-
`TestAlgDct*`, `TestAlgWtahash*`, `TestPermutation*`
98-
- Conformance tests (per-function): `os.ReadFile("../../crates/iscc-lib/tests/data.json")`
99-
- Conformance selftest: `//go:embed testdata/data.json` in conformance.go.
100-
`ConformanceSelftest() (bool, error)` — package-level function (no receiver). Uses
101-
`vectorEntry` struct + 9 `run*Tests` section runners. `decodeStream` shared helper for
102-
Data/Instance hex decoding
44+
## Go Pure Go Rewrite (Summary)
45+
46+
- Pure Go in `packages/go/` — all 10 gen functions + codec + algorithms. Zero WASM deps
47+
- Dependencies: `github.com/zeebo/blake3`, `golang.org/x/text`. Indirect: `cpuid/v2`
48+
- Go idioms: unexported helpers (lowercase), `var` for arrays/large uint64 (Go const limitations),
49+
`[]rune` for Unicode SlidingWindow, generics for `arraySplit[T]`
50+
- Conformance: `//go:embed testdata/data.json`, per-function tests use
51+
`os.ReadFile("../../crates/iscc-lib/tests/data.json")`
52+
- 151 Go tests total. CI: 4 steps (checkout, setup-go, test, vet) — no Rust deps
53+
54+
## gen_sum_code_v0
55+
56+
- `gen_sum_code_v0(path: &Path, bits: u32, wide: bool) -> IsccResult<SumCodeResult>` in `lib.rs`
57+
- Single-pass file I/O: opens file, reads in `IO_READ_SIZE` chunks, feeds both `DataHasher` and
58+
`InstanceHasher`, composes ISCC-CODE via `gen_iscc_code_v0`
59+
- `SumCodeResult { iscc, datahash, filesize }` in `types.rs` — same `#[non_exhaustive]` pattern
60+
- File I/O errors mapped to `IsccError::InvalidInput("Cannot open/read file: {e}")`
61+
- `units: Vec<String>` field deferred (not in scope for initial core implementation)
62+
- 32nd and final Tier 1 symbol for Rust core — all 32 symbols now implemented
63+
- Python binding: PyO3 wrapper in `crates/iscc-py/src/lib.rs` accepts `&str` path, `SumCodeResult`
64+
class in `__init__.py`, public wrapper accepts `str | os.PathLike` via `os.fspath()`, 6 tests in
65+
`tests/test_smoke.py`
66+
- Node.js binding: `NapiSumCodeResult` struct (`#[napi(object)]`) + `gen_sum_code_v0` napi fn in
67+
`crates/iscc-napi/src/lib.rs`. Uses `i64` for `filesize` (napi-rs no u64 support). 6 tests in
68+
`__tests__/functions.test.mjs`
69+
- WASM binding: `WasmSumCodeResult` struct (`#[wasm_bindgen(getter_with_clone)]`) +
70+
`gen_sum_code_v0` fn in `crates/iscc-wasm/src/lib.rs`. Accepts `&[u8]` (no filesystem in WASM).
71+
Uses `f64` for `filesize` (wasm-bindgen `u64` maps to `BigInt`, awkward for JS). Composes
72+
internally via `DataHasher` + `InstanceHasher` + `gen_iscc_code_v0`. 6 tests in `tests/unit.rs`,
73+
75 total WASM tests (9 conformance + 66 unit; 1 behind `conformance` feature gate)
74+
- C FFI binding: `IsccSumCodeResult` repr(C) struct with `ok: bool`, `iscc: *mut c_char`,
75+
`datahash: *mut c_char`, `filesize: u64`. `iscc_gen_sum_code_v0(path, bits, wide)` extern "C"
76+
function + `iscc_free_sum_code_result` free function in `crates/iscc-ffi/src/lib.rs`. Follows
77+
`IsccDecodeResult` struct-return pattern. 4 Rust tests + 3 C tests. 82 total Rust tests, 57
78+
total C test assertions
79+
- JNI binding: `SumCodeResult.java` (immutable, `String iscc`, `String datahash`, `long filesize`)
80+
- `Java_io_iscc_iscc_1lib_IsccLib_genSumCodeV0` in `crates/iscc-jni/src/lib.rs`. Returns `jobject`
81+
via `env.find_class("io/iscc/iscc_lib/SumCodeResult")` + `env.new_object()` with signature
82+
`(Ljava/lang/String;Ljava/lang/String;J)V`. 4 Maven tests. 62 total Maven tests
83+
- Go binding: `packages/go/code_sum.go``SumCodeResult` struct (`Iscc`, `Datahash`, `Filesize`) +
84+
`GenSumCodeV0(path string, bits uint32, wide bool)`. Single-pass file I/O with `os.Open` +
85+
`DataHasher` + `InstanceHasher` + `GenIsccCodeV0`. 4 tests in `code_sum_test.go`. 151 total Go
86+
tests. ALL 7 bindings complete for issue #15
87+
88+
## Benchmarks
89+
90+
- `crates/iscc-lib/benches/benchmarks.rs` — all 10 `gen_*_v0` + DataHasher streaming + CDC chunks
91+
- `bench_sum_code` uses `tempfile::NamedTempFile` since `gen_sum_code_v0` takes `&Path` (not
92+
`&[u8]`)
93+
- Temp files created outside bench closure (setup cost excluded from measurement)
94+
- `tempfile` is a dev-dependency only (workspace dep `tempfile = "3"`)
10395

10496
## Codec Internals
10597

@@ -148,9 +140,37 @@ iterations.
148140

149141
- All 4 Reference pages complete: Rust API, Python API, C FFI, Java API
150142

143+
## Binding Constant Export Patterns
144+
145+
- NAPI: `#[napi(js_name = "CONST_NAME")] pub const CONST_NAME: u32 = iscc_lib::CONST_NAME as u32;`
146+
- WASM: `#[wasm_bindgen(js_name = "CONST_NAME")] pub fn const_name() -> u32 { ... }` (getter fn, not
147+
const — wasm-bindgen limitation)
148+
- C FFI: `#[unsafe(no_mangle)] pub extern "C" fn iscc_const_name() -> u32 { ... }` + inline
149+
`#[test]` in same file. cbindgen auto-generates the C header
150+
- NAPI JS tests: `describe('CONST_NAME', () => { it('equals X'); it('is a number'); })`
151+
- WASM tests: `#[wasm_bindgen_test]` in `tests/unit.rs` (requires wasm-pack to run)
152+
- C tests: `ASSERT_EQ(iscc_const_name(), value, "label")` in `tests/test_iscc.c`
153+
- 5 constants currently exported: META_TRIM_NAME, META_TRIM_DESCRIPTION, META_TRIM_META,
154+
IO_READ_SIZE, TEXT_NGRAM_SIZE
155+
156+
## Documentation Sweep Patterns
157+
158+
- "N gen" count references exist in: READMEs (9 files), docs/ (14 files), howto/ (6 files), crate
159+
CLAUDE.md files (5), notes/ (2), source comments (.rs, .py, .mjs, .pyi), benchmarks/ (2)
160+
- The Edit tool requires a full Read call (not offset/limit) before the first edit per file
161+
- mdformat auto-reformats after edits — always run `mise run format` twice after doc changes
162+
- iscc-core-ts is external and may have different function counts than iscc-lib
163+
151164
## Gotchas
152165

153166
- JNI package underscore encoding: `iscc_lib``iscc_1lib` in function names
154167
- mdformat auto-formats markdown — keep backtick expressions short to avoid wrapping crashes
155168
- `from __future__ import annotations` in `__init__.py` — use `|` union syntax, not `Union`
156-
- Python `__all__` has 45 entries (30 API + 10 result types + `__version__` + MT, ST, VS, core_opts)
169+
- Python `__all__` has 48 entries (32 API + 11 result types + `__version__` + MT, ST, VS, core_opts)
170+
- `gen_sum_code_v0` wide mode only differs from normal when `bits >= 128` (wide requires 128-bit+
171+
codes)
172+
- After adding new symbols to `crates/iscc-py/src/lib.rs`, MUST rebuild the `.so` with
173+
`uv run maturin develop -m crates/iscc-py/Cargo.toml` before `pytest` will work
174+
- JSON `{"x":""}` overhead is 8 bytes (not 7) — relevant for boundary tests on META_TRIM_META
175+
- META_TRIM_META validation: pre-decode check uses `META_TRIM_META * 4/3 + 256` (base64 inflation +
176+
media type header), post-decode check uses `META_TRIM_META` directly

.claude/agent-memory/define-next/MEMORY.md

Lines changed: 29 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -7,80 +7,57 @@ iterations.
77

88
## Scope Calibration Principles
99

10-
- CI job additions are small, single-file changes that provide high value. Pattern: copy existing
11-
job structure, swap language-specific setup action and build/test commands
1210
- Critical issues always take priority regardless of feature trajectory
1311
- Multiple small issues in the same crate are a natural batch (e.g., 3 fixes touching 2 files)
14-
- README files are "create" operations — less risky than code changes. Doc files excluded from
15-
3-file limit
12+
- Doc files are excluded from the 3-file modification limit — can batch all 6 howto guides in one
13+
step since they follow identical patterns
1614
- When CI is red, formatting/lint fixes are always the first priority regardless of handoff "Next"
1715
- Prefer concrete deliverables over research tasks when both are available
18-
- File deletions don't count toward the 3-file modification limit — they are simpler than edits
19-
- After a major rewrite (e.g., Go pure rewrite), docs/CI lag behind — schedule a cleanup step to
20-
bring all stale references in sync before moving to the next feature
21-
- State assessments can go stale — always verify claimed gaps by reading the actual files. The state
22-
may say "met" for something that still has stale content
23-
- When state says "all automatable work complete," cross-check the spec's verification criteria
24-
against actual files — state assessment may miss spec requirements that were never implemented
25-
(e.g., missing Reference pages)
16+
- State assessments can go stale — always verify claimed gaps by reading the actual files
17+
- New Tier 1 symbols: always implement in Rust core first, then propagate to bindings in separate
18+
steps. Core + tests in one step, bindings in subsequent steps
19+
- When previous next.md already contains correct scoping, verify line references are still accurate
20+
and refresh rather than rewrite from scratch — avoid unnecessary churn
21+
- Repetitive doc additions across language guides: all 6 howto files follow identical structure
22+
(heading, 1-line description, fenced code block). Safe to batch all in one step
2623

2724
## Architecture Decisions
2825

29-
- Java conformance tests use `data.json` via relative path from Maven's working directory
30-
- Maven Surefire `-Djava.library.path` points to `target/debug/` for finding native cdylib
3126
- Go bindings are pure Go (no WASM, no wazero, no binary artifacts)
3227
- All binding conformance tests follow the same structure: load data.json, iterate per-function
3328
groups, decode inputs per signature, compare `.iscc` output
3429
- `gen_iscc_code_v0` test vectors have no `wide` parameter — always pass `false`
3530
- `"stream:<hex>"` prefix denotes hex-encoded byte data for Data/Instance-Code tests
3631

37-
## Documentation Reference Pages Status (Iteration 19)
32+
## Benchmark Patterns
3833

39-
Documentation spec requires 4 Reference pages:
34+
- `benchmarks.rs` uses `criterion_group!` macro to register all bench functions
35+
- Data/Instance/ISCC-Code benchmarks use `BenchmarkId` + `Throughput::Bytes` for throughput metrics
36+
- `deterministic_bytes(size)` helper generates reproducible test data
37+
- `gen_sum_code_v0` requires `&Path` (temp file needed) — unlike `gen_data_code_v0` which takes
38+
`&[u8]` directly. Temp file must be created OUTSIDE the bench closure
4039

41-
1. Rust API (`rust-api.md`) — ✓ exists
42-
2. Python API (`api.md`) — ✓ exists
43-
3. C FFI reference (`c-ffi-api.md`) — ✓ exists (added iteration 18, 694 lines)
44-
4. Java API — ✗ missing (scoped for iteration 19)
40+
## Documentation Sweep Patterns
4541

46-
## Java API Reference Page Facts
47-
48-
- `IsccLib.java`: 382 lines, 30 `public static native` methods + 4 `public static final int`
49-
constants + private constructor + static NativeLoader block
50-
- `IsccDecodeResult.java`: 42 lines, 5 `public final` fields (maintype, subtype, version, length,
51-
digest)
52-
- Streaming hashers use opaque `long` handles (JNI pointers) — must document free lifecycle
53-
- All methods throw `IllegalArgumentException` on invalid input
54-
- Package: `io.iscc.iscc_lib`, Maven coordinates: `io.iscc:iscc-lib:0.0.2`
42+
- `crates/iscc-wasm/pkg/README.md` must always be identical to `crates/iscc-wasm/README.md` — both
43+
are published to npm
44+
- When updating "9 gen functions" to "10", distinguish context: data.json has 9 function sections
45+
(no gen_sum_code_v0), so conformance/benchmark code correctly says "9"
46+
- Two docs pages (architecture.md, development.md) share identical directory tree and crate summary
47+
table — edits must be synced between them
5548

5649
## CI/Release Patterns
5750

58-
- Release workflow has `workflow_dispatch` inputs: `crates-io`, `pypi`, `npm`, `maven` (booleans)
59-
- All publish jobs have idempotency checks (version-existence pre-check, `skip` output)
60-
- `scripts/version_sync.py` uses only stdlib — can run as `python scripts/version_sync.py --check`
61-
62-
## Project Near-Completion State (Iteration 19)
63-
64-
All 7 bindings at 30/30, CI green with 9 jobs. PR #10 exists from develop→main.
65-
66-
**Remaining automated gaps:**
67-
68-
1. Java API reference page — SCOPED (iteration 19, LAST automatable task)
69-
2. Tab order standardization — LOW priority, needs human review
70-
71-
**Human-only tasks remaining after iteration 19:**
72-
73-
- Merge PR #10 (develop → main)
74-
- Trigger releases for npm, PyPI
75-
- Maven Central publishing (GPG, Sonatype)
76-
- Tab order decision
77-
- OIDC for crates.io
51+
- v0.0.3 released to all registries. Next release after remaining gaps closed.
52+
- Release workflow has `workflow_dispatch` with per-registry checkboxes
7853

7954
## Gotchas
8055

8156
- JNI function names encode Java package underscores as `_1`
8257
- WASM howto uses `@iscc/wasm` (not `@iscc/iscc-wasm`). npm lib is `@iscc/lib`
83-
- ISCC Foundation URL is `https://iscc.io`
8458
- Java `byte` is signed — values 128-255 wrap, JNI handles correctly
85-
- Two docs pages (architecture.md, development.md) share identical directory tree and crate summary
86-
table — edits must be synced between them
59+
60+
## Project Status
61+
62+
- Iteration 16: Only gap is bench_sum_code. After that, only issue #16 (low priority, feature flags)
63+
remains. Project approaching full target completion.

0 commit comments

Comments
 (0)