Skip to content

fix(ccusage): avoid RangeError when parsing large transcript JSONL files#875

Open
MumuTW wants to merge 4 commits intoryoppippi:mainfrom
MumuTW:fix-ccusage-file-history-snapshot-873
Open

fix(ccusage): avoid RangeError when parsing large transcript JSONL files#875
MumuTW wants to merge 4 commits intoryoppippi:mainfrom
MumuTW:fix-ccusage-file-history-snapshot-873

Conversation

@MumuTW
Copy link

@MumuTW MumuTW commented Mar 6, 2026

Summary

  • replace calculateContextTokens full-file readFile parsing with streaming readline-based parsing
  • skip transcript files early when first non-empty line has type: "file-history-snapshot"
  • add regression test for file-history-snapshot transcript inputs

Testing

  • pnpm --dir ccusage --filter ccusage test

Fixes #873

Summary by CodeRabbit

  • Performance

    • Reduced memory use and improved speed for large transcript processing via incremental streaming parsing.
  • Bug Fixes

    • More robust error handling and resilience during data loading.
    • More accurate context-token calculations and usage-percentage reporting.
    • Early-skip handling for specific transcript types to avoid incorrect results.
  • Behavior Changes

    • JSON output mode now suppresses standard log output for quieter machine-readable results.
  • Tests

    • Added coverage for edge-case transcript processing and early-skip behavior.

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

Rewrites ccusage transcript parsing to stream-read JSONL files line-by-line with an early skip for file-history-snapshot entries and incremental assistant-usage extraction to compute context token percentages. Separately, opencode CLI commands now silence logs when JSON output is requested.

Changes

Cohort / File(s) Summary
Streaming JSONL parser
apps/ccusage/src/data-loader.ts
Replaces full-file reads with createReadStream + readline streaming; fast-prefix check to early-return on file-history-snapshot; per-line JSON parse + transcriptMessageSchema validation; track latest assistant usage (tokens, cacheTokens); request context limit from PricingFetcher when modelId present; robust per-line error handling; preserves null semantics for no usable data.
CLI JSON output logging changes
apps/opencode/src/commands/daily.ts, apps/opencode/src/commands/weekly.ts, apps/opencode/src/commands/monthly.ts, apps/opencode/src/commands/session.ts
When --json / jsonOutput is set, set logger.level = 0 at start to silence normal logging before loading/output.

Sequence Diagram(s)

sequenceDiagram
  participant FS as File System
  participant Stream as Stream Reader
  participant Parser as Per-line Parser/Validator
  participant Aggregator as Usage Aggregator
  participant Pricing as PricingFetcher

  FS->>Stream: open JSONL (createReadStream)
  Stream->>Parser: emit next line
  Parser-->>Stream: parsed object or error
  alt first-line indicates file-history-snapshot
    Parser->>Aggregator: signal skip -> return null
  else assistant usage line found
    Parser->>Aggregator: update latestUsage (inputTokens, cacheTokens)
    Aggregator->>Stream: continue reading
  end
  Stream->>Aggregator: EOF
  Aggregator->>Pricing: request contextLimit (modelId)
  Pricing-->>Aggregator: contextLimit or failure
  Aggregator->>Caller: compute percentage or return null
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ryoppippi

Poem

🐰 I hop through lines and parse with care,

Skipping snapshot mountains, light as air.
I count the tokens, gently, one by one,
Streaming safe and tidy — job well done. 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Changes to apps/opencode/src/commands files (daily.ts, monthly.ts, session.ts, weekly.ts) adding logger.level=0 for JSON output appear out of scope relative to the linked issue #873, which focuses solely on fixing RangeError in ccusage data-loader.ts. Remove logger-level changes from opencode commands or link a separate issue documenting this JSON logging behavior as a requirement.
Linked Issues check ❓ Inconclusive The PR addresses the core requirements from issue #873: streaming-based parsing with early file-type detection [#873], skip file-history-snapshot files [#873], and prevent RangeError crashes [#873]. However, changes to opencode commands (daily.ts, monthly.ts, session.ts, weekly.ts) adding logger.level=0 for JSON mode appear unrelated to the linked issue. Clarify whether logger-level changes in opencode commands are part of issue #873 or a separate concern, as they appear unrelated to the stated objective of fixing file parsing crashes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: replacing file-read approach with streaming to avoid RangeError when parsing large JSONL files, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/ccusage/src/data-loader.ts (1)

1295-1314: Type assignment may not fully narrow input_tokens to required.

The assignment at line 1309 assigns obj.message.usage (where input_tokens is optional per transcriptUsageSchema) to latestUsage (where input_tokens is required). While the check at line 1307 ensures input_tokens != null at runtime, TypeScript's property narrowing may not fully narrow the parent object type.

Consider using a type assertion or explicit object construction to ensure type safety:

💡 Suggested refactor for explicit type construction
 if (
     obj.type === 'assistant' &&
     obj.message != null &&
     obj.message.usage != null &&
     obj.message.usage.input_tokens != null
 ) {
-    latestUsage = obj.message.usage;
+    latestUsage = {
+        input_tokens: obj.message.usage.input_tokens,
+        cache_creation_input_tokens: obj.message.usage.cache_creation_input_tokens,
+        cache_read_input_tokens: obj.message.usage.cache_read_input_tokens,
+    };
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 1295 - 1314, The assignment of
obj.message.usage to latestUsage can leave TypeScript unconvinced that
input_tokens is present because transcriptUsageSchema marks it optional; to fix,
explicitly construct or cast a value with the required shape before assigning to
latestUsage — e.g., after the runtime check (obj.message.usage != null &&
obj.message.usage.input_tokens != null) create a new object with the needed
properties (or use a type assertion to the required type) and assign that to
latestUsage; update the logic around transcriptMessageSchema,
transcriptUsageSchema, obj, input_tokens, and latestUsage in the try block so
the compiler sees a value that definitely satisfies latestUsage's required
fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 1295-1314: The assignment of obj.message.usage to latestUsage can
leave TypeScript unconvinced that input_tokens is present because
transcriptUsageSchema marks it optional; to fix, explicitly construct or cast a
value with the required shape before assigning to latestUsage — e.g., after the
runtime check (obj.message.usage != null && obj.message.usage.input_tokens !=
null) create a new object with the needed properties (or use a type assertion to
the required type) and assign that to latestUsage; update the logic around
transcriptMessageSchema, transcriptUsageSchema, obj, input_tokens, and
latestUsage in the try block so the compiler sees a value that definitely
satisfies latestUsage's required fields.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f65d700-2b4a-40b7-bd2d-fc00166209e9

📥 Commits

Reviewing files that changed from the base of the PR and between c40ea6e and 0896c64.

📒 Files selected for processing (1)
  • apps/ccusage/src/data-loader.ts

@MumuTW
Copy link
Author

MumuTW commented Mar 6, 2026

Follow-up for #829: silenced logger output in JSON mode for ccusage-opencode.\n\nWhat changed:\n- Set when is active in:\n - \n - \n - \n - \n\nValidation:\n- vinext |  WARN  The field "pnpm.peerDependencyRules" was found in /home/opc/.paperclip/instances/default/workspaces/7948d02f-b91e-4189-b9eb-32bf0b5923d2/vinext/package.json. This will not take effect. You should configure "pnpm.peerDependencyRules" at the root of the workspace instead.
vinext |  WARN  The field "pnpm.onlyBuiltDependencies" was found in /home/opc/.paperclip/instances/default/workspaces/7948d02f-b91e-4189-b9eb-32bf0b5923d2/vinext/package.json. This will not take effect. You should configure "pnpm.onlyBuiltDependencies" at the root of the workspace instead.

@ccusage/opencode@18.0.8 test /home/opc/.paperclip/instances/default/workspaces/7948d02f-b91e-4189-b9eb-32bf0b5923d2/ccusage/apps/opencode
TZ=UTC vitest

RUN v4.0.15 /home/opc/.paperclip/instances/default/workspaces/7948d02f-b91e-4189-b9eb-32bf0b5923d2/ccusage/apps/opencode

✓ src/data-loader.ts (2 tests) 4ms
✓ src/commands/weekly.ts (4 tests) 4ms

Test Files 2 passed (2)
Tests 6 passed (6)
Start at 05:10:22
Duration 463ms (transform 238ms, setup 0ms, import 435ms, tests 8ms, environment 0ms)\n- (from )\n\nCommit:

@MumuTW
Copy link
Author

MumuTW commented Mar 6, 2026

Follow-up for #829: silenced logger output in JSON mode for ccusage-opencode.

What changed:

  • Set logger.level = 0 when --json is active in:
    • apps/opencode/src/commands/daily.ts
    • apps/opencode/src/commands/monthly.ts
    • apps/opencode/src/commands/session.ts
    • apps/opencode/src/commands/weekly.ts

Validation:

  • pnpm --filter @ccusage/opencode test
  • bun ./src/index.ts daily --json | jq . (run from apps/opencode)

Commit: 9995939

@ryoppippi
Copy link
Owner

thanks! lmc

@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 6, 2026

Open in StackBlitz

@ccusage/amp

npm i https://pkg.pr.new/ryoppippi/ccusage/@ccusage/amp@875

ccusage

npm i https://pkg.pr.new/ryoppippi/ccusage@875

@ccusage/codex

npm i https://pkg.pr.new/ryoppippi/ccusage/@ccusage/codex@875

@ccusage/mcp

npm i https://pkg.pr.new/ryoppippi/ccusage/@ccusage/mcp@875

@ccusage/opencode

npm i https://pkg.pr.new/ryoppippi/ccusage/@ccusage/opencode@875

@ccusage/pi

npm i https://pkg.pr.new/ryoppippi/ccusage/@ccusage/pi@875

commit: 9995939

…ement

The latestUsage variable requires input_tokens as a non-optional number,
but obj.message.usage has it as optional. Explicitly construct the object
after the null check so TypeScript can see the narrowed type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 1269-1288: The current fast-path checks the first non-empty line
via readline (createInterface) which forces Node to buffer the entire line and
crashes on huge single-line records; fix by reading a bounded prefix from the
file before creating the readline reader: open transcriptPath with fs (e.g.,
fs.open + filehandle.read or createReadStream with { start: 0, end: N-1 }), read
a small prefix (e.g., 4 KiB), trim leading whitespace, attempt to parse only
that prefix (or regex-extract the initial {"type":...} token) to detect if type
=== "file-history-snapshot", and if so log via logger.debug and return null;
otherwise close the temp handle/stream and then create the original
createReadStream + createInterface and continue as before. Ensure you properly
close file handles/streams (or destroy the temp stream) and preserve the
existing variables firstNonEmptyLineSeen and the rest of the processing flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fc0d4887-4b7a-4c69-9bb9-34c1e89f1600

📥 Commits

Reviewing files that changed from the base of the PR and between 9995939 and b9fd7cb.

📒 Files selected for processing (1)
  • apps/ccusage/src/data-loader.ts

…tion

Read only the first 4 KiB of the file to detect file-history-snapshot
type instead of using readline, which buffers the entire first line
and crashes on huge single-line records (e.g. 734 MB).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/ccusage/src/data-loader.ts (1)

1299-1320: Keep the new transcript path in the repo’s Result style.

This adds fresh try/catch JSON parsing plus repeated Result.isSuccess(...) checks. Switching the throwable parse to Result.try() and branching on Result.isFailure(contextLimitResult) would match the project’s byethrow conventions and keep the happy path flatter. As per coding guidelines, "Prefer @praha/byethrow Result type over traditional try-catch for functional error handling", "Use Result.try() for wrapping operations that may throw (JSON parsing, etc.)", and "Use Result.isFailure() for checking errors (more readable than !Result.isSuccess())".

Also applies to: 1342-1352

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 1267-1284: The fast-path currently assumes the first 4KiB begins
with {"type":...} and misses snapshots when "type" is not the first property;
change the probe to extract the first line from prefixBuf (find first newline
CR/LF within PREFIX_SIZE), parse that first-line substring as JSON (safe
try/catch) and read its top-level "type" property (instead of regex anchored to
the start) to detect "file-history-snapshot" and short-circuit (symbols:
PREFIX_SIZE, prefixBuf, readSync, transcriptPath, typeMatch/logger.debug); if no
newline is present in the prefix keep the existing fallback to readline; add a
regression test that writes a snapshot line where "type" is not the first field
to ensure detection still works.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c75f78d-d9c4-4ca0-9955-a44ec79acec1

📥 Commits

Reviewing files that changed from the base of the PR and between b9fd7cb and a3bc6cb.

📒 Files selected for processing (1)
  • apps/ccusage/src/data-loader.ts

Comment on lines +1267 to +1284
// Fast-path: read a small prefix to detect file-history-snapshot without
// buffering a potentially huge first line via readline (see #873).
const PREFIX_SIZE = 4096;
const prefixBuf = Buffer.alloc(PREFIX_SIZE);
const fd = openSync(transcriptPath, 'r');
let bytesRead: number;
try {
bytesRead = readSync(fd, prefixBuf, 0, PREFIX_SIZE, 0);
} finally {
closeSync(fd);
}
if (bytesRead > 0) {
const prefix = prefixBuf.subarray(0, bytesRead).toString('utf-8').trimStart();
const typeMatch = prefix.match(/^\s*\{\s*"type"\s*:\s*"([^"]+)"/);
if (typeMatch != null && typeMatch[1] === 'file-history-snapshot') {
logger.debug('Skipping file-history-snapshot transcript file for context tokens');
return null;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't make the snapshot fast-path depend on "type" being the first key.

The new probe only recognizes file-history-snapshot when the first record starts with {"type": ...} and that field appears inside the first 4 KiB. A valid snapshot line with another leading property will miss this check, fall back to readline, and reopen the huge-line crash path this PR is fixing. Please extract the first line’s top-level type without assuming field order, and add a regression where type is not serialized first.

Also applies to: 4780-4799

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 1267 - 1284, The fast-path
currently assumes the first 4KiB begins with {"type":...} and misses snapshots
when "type" is not the first property; change the probe to extract the first
line from prefixBuf (find first newline CR/LF within PREFIX_SIZE), parse that
first-line substring as JSON (safe try/catch) and read its top-level "type"
property (instead of regex anchored to the start) to detect
"file-history-snapshot" and short-circuit (symbols: PREFIX_SIZE, prefixBuf,
readSync, transcriptPath, typeMatch/logger.debug); if no newline is present in
the prefix keep the existing fallback to readline; add a regression test that
writes a snapshot line where "type" is not the first field to ensure detection
still works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] RangeError: Invalid string length caused by file-history-snapshot JSONL files (734MB)

2 participants