Skip to content

feat: add seed sidecar bootstrap workflow#10

Open
huntharo wants to merge 3 commits intomainfrom
codex/seed-sidecar-bootstrap
Open

feat: add seed sidecar bootstrap workflow#10
huntharo wants to merge 3 commits intomainfrom
codex/seed-sidecar-bootstrap

Conversation

@huntharo
Copy link
Copy Markdown
Contributor

Summary

This PR adds the first end-to-end seed sidecar workflow for openclaw/openclaw:

  • relaxes ghcrawl init so GitHub-only setup is valid
  • adds seed-install for importing a published starter sidecar into a local repo
  • adds dev-only seed-export for maintainers to generate the sidecar locally
  • adds seed-audit to validate a generated sidecar before publishing it anywhere
  • updates docs/tests around the starter-data flow

How It Works

Install path

ghcrawl seed-install openclaw/openclaw now:

  1. does a metadata sync unless --no-sync is requested
  2. downloads a published sidecar asset
  3. verifies the SHA-256 checksum from the checked-in manifest
  4. validates the sidecar manifest against the current CLI version and embed model
  5. matches sidecar rows onto local threads by stable GitHub identity plus content_hash
  6. imports published embeddings and similarity edges
  7. rebuilds a normal local cluster run from those imported edges

The sidecar import is intentionally narrow:

  • it only imports embeddings and derived cluster-edge data
  • it does not overwrite thread text, comments, summaries, or sync cursor state
  • it is currently scoped to openclaw/openclaw

Export path

Maintainers can generate a seed locally with:

pnpm seed-export openclaw/openclaw --output /tmp/ghcrawl-seeds

The sidecar format is a streamed gzip NDJSON archive with:

  • one manifest record
  • thread embedding rows
  • similarity edge rows

This exporter now streams directly from SQLite so large OpenClaw exports no longer OOM.

Audit path

Before publishing a generated seed, maintainers can audit it with:

pnpm seed-audit --asset /tmp/ghcrawl-seeds/<snapshot>.seed.json.gz --repo openclaw/openclaw --sources title,body

The audit is a streaming validation pass that fails if:

  • the manifest targets the wrong repo
  • any thread or edge row points outside the expected repo
  • unexpected payload keys show up
  • source kinds drift outside the expected set
  • manifest counts do not match the observed row counts

Seed Contents

The exported seed currently contains:

  • title embeddings
  • body embeddings
  • derived similarity edges from the latest completed cluster run

It explicitly does not export dedupe_summary embeddings.

Where Seeds Are Supposed To Go

This PR does not publish any seed automatically.

The intended model is:

  • a maintainer runs seed-export on a populated local dev box
  • the maintainer runs seed-audit locally
  • the resulting .seed.json.gz and .sha256 are uploaded manually to a large-file distribution target
  • the checked-in known-seed manifest is updated to point at the published URL and checksum

The code already supports downloading from an arbitrary HTTPS URL, so the published artifact can live in:

  • a GitHub Release asset
  • object storage such as Cloudflare R2
  • another stable HTTPS file host

There is intentionally no CI or merge-triggered seed publishing in this PR.

Compatibility

The sidecar manifest carries compatibility metadata:

  • schemaVersion
  • format
  • snapshotId
  • compatibleCli
  • embedModel
  • sourceKinds

seed-install validates that metadata before import and rejects incompatible assets.

Notes

  • The checked-in known-seed manifest is still a placeholder until a real seed is published.
  • The init wizard only offers starter data when a real published seed URL is configured.
  • If this PR were merged as-is, it would add support for the workflow but would not publish any starter asset by itself.

Testing

  • pnpm typecheck
  • pnpm --filter @ghcrawl/api-core test
  • pnpm --filter ghcrawl test
  • manual real-world export: openclaw/openclaw sidecar generated successfully at ~607 MB compressed with the streaming exporter
  • manual audit: passing audit for openclaw/openclaw and intentional failing audit for a mismatched repo

@github-actions
Copy link
Copy Markdown

Cluster Performance

  • Status: PASS
  • Fixture median: 510.3 ms (12 samples, 3 cluster rebuilds/sample)
  • Fixture baseline: 535.1 ms
  • Fixture delta: -24.8 ms (-4.6%)
  • Projected openclaw/openclaw duration: 9m 32.2s
  • Projected openclaw/openclaw baseline: 10m 0.0s
  • Projected delta: -27816.5 ms (-4.6%)
  • Regression threshold: +50.0%
  • Fixture shape: 512 threads x 3 source kinds
  • Sample durations: 507.8 ms, 525.6 ms, 512.7 ms, 512.9 ms, 487.7 ms, 522.2 ms, 514.6 ms, 490.2 ms, 507.0 ms, 515.7 ms, 488.5 ms, 490.5 ms

Run: workflow run for 4f9a6f4


const snapshotId =
params.snapshotId ??
`${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-').replace(/Z$/, 'Z')}`;

Check warning

Code scanning / CodeQL

Replacement of a substring with itself Medium

This replaces 'Z' with itself.

Copilot Autofix

AI 15 days ago

In general, to fix this issue you should either remove the no-op replace or change it to perform the intended transformation (for example, removing the Z or replacing it with a different suffix). Since we must avoid changing existing functionality where possible, the safest correction is to remove the redundant replace(/Z$/, 'Z') call, leaving the snapshot ID construction unchanged in observable behavior (because the call had no effect) while eliminating the CodeQL warning.

Concretely, in packages/api-core/src/service.ts around line 1398, the snapshotId default is built as:

`${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-').replace(/Z$/, 'Z')}`;

We should remove the .replace(/Z$/, 'Z') segment so that the code becomes:

`${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-')}`;

No additional imports, methods, or definitions are needed, and this does not change the resulting string, since the removed operation was a no-op.

Suggested changeset 1
packages/api-core/src/service.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/packages/api-core/src/service.ts b/packages/api-core/src/service.ts
--- a/packages/api-core/src/service.ts
+++ b/packages/api-core/src/service.ts
@@ -1397,7 +1397,7 @@
 
     const snapshotId =
       params.snapshotId ??
-      `${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-').replace(/Z$/, 'Z')}`;
+      `${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-')}`;
     const archiveManifest: SeedSidecarArchiveWriterInput = {
       manifest: {
         schemaVersion: 1,
EOF
@@ -1397,7 +1397,7 @@

const snapshotId =
params.snapshotId ??
`${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-').replace(/Z$/, 'Z')}`;
`${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-')}`;
const archiveManifest: SeedSidecarArchiveWriterInput = {
manifest: {
schemaVersion: 1,
Copilot is powered by AI and may make mistakes. Always verify output.
@huntharo huntharo added this to ghcrawl Mar 19, 2026
@huntharo huntharo moved this to In Review in ghcrawl Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant