Skip to content

Commit c62bf7a

Browse files
committed
fix(translate): chunk large pages to avoid GPT-5.2 structured-output token limit
Pages like "Creating a New Observation" (~486K tokens) exceed OpenAI's 272K token limit for json_schema strict mode, causing translation to fail for both pt-BR and es. Fix: translateText now splits oversized markdown into chunks before calling the API, then reassembles the translated pieces transparently. Callers and function signatures are unchanged. Key details: - TRANSLATION_CHUNK_MAX_CHARS = 500_000 (~143K tokens, conservative buffer) - Fence-aware section splitter: # inside code blocks is never a boundary - 3-level fallback: headings -> paragraphs -> lines -> character slicing - Leading oversized tokens are correctly split even when no content has been accumulated yet in the current chunk - token_overflow error code (non-critical) enables targeted recovery - Adaptive fallback halves any chunk that still overflows after splitting - 11 tests including lossless round-trip and leading-token edge cases
1 parent 2af1767 commit c62bf7a

File tree

3 files changed

+390
-14
lines changed

3 files changed

+390
-14
lines changed

scripts/constants.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,10 @@ export const ENGLISH_DIR_SAVE_ERROR =
129129
// Translation retry configuration
130130
export const TRANSLATION_MAX_RETRIES = 3;
131131
export const TRANSLATION_RETRY_BASE_DELAY_MS = 750;
132+
/** Max characters per translation chunk.
133+
* Targets ~143K tokens (500K chars / 3.5 chars per token).
134+
* Leaves generous buffer within OpenAI's 272K structured-output limit. */
135+
export const TRANSLATION_CHUNK_MAX_CHARS = 500_000;
132136

133137
// URL handling
134138
export const INVALID_URL_PLACEHOLDER =

scripts/notion-translate/translateFrontMatter.test.ts

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,4 +54,115 @@ describe("notion-translate translateFrontMatter", () => {
5454
})
5555
);
5656
});
57+
58+
it("classifies token overflow errors as non-critical token_overflow code", async () => {
59+
const { translateText } = await import("./translateFrontMatter");
60+
61+
mockOpenAIChatCompletionCreate.mockRejectedValueOnce({
62+
status: 400,
63+
message:
64+
"Input tokens exceed the configured limit of 272000 tokens. Your messages resulted in 486881 tokens.",
65+
});
66+
67+
await expect(translateText("# Body", "Title", "pt-BR")).rejects.toEqual(
68+
expect.objectContaining({
69+
code: "token_overflow",
70+
isCritical: false,
71+
})
72+
);
73+
});
74+
75+
it("takes the single-call fast path for small content", async () => {
76+
const { translateText } = await import("./translateFrontMatter");
77+
78+
const result = await translateText(
79+
"# Small page\n\nJust a paragraph.",
80+
"Small",
81+
"pt-BR"
82+
);
83+
84+
expect(mockOpenAIChatCompletionCreate).toHaveBeenCalledTimes(1);
85+
expect(result.title).toBe("Mock Title");
86+
expect(result.markdown).toBe("# translated\n\nMock content");
87+
});
88+
89+
it("chunks large content and calls the API once per chunk", async () => {
90+
const { translateText, splitMarkdownIntoChunks } = await import(
91+
"./translateFrontMatter"
92+
);
93+
94+
// Build content that is larger than the chunk threshold
95+
const bigSection1 = "# Section One\n\n" + "word ".repeat(100_000);
96+
const bigSection2 = "\n# Section Two\n\n" + "word ".repeat(100_000);
97+
const bigContent = bigSection1 + bigSection2;
98+
99+
// Sanity: verify it would be split
100+
const chunks = splitMarkdownIntoChunks(bigContent, 500_000);
101+
expect(chunks.length).toBeGreaterThan(1);
102+
103+
// translateText should call the API once per chunk
104+
const result = await translateText(bigContent, "Big Page", "pt-BR");
105+
106+
expect(
107+
mockOpenAIChatCompletionCreate.mock.calls.length
108+
).toBeGreaterThanOrEqual(2);
109+
expect(result.title).toBe("Mock Title"); // taken from first chunk
110+
expect(typeof result.markdown).toBe("string");
111+
expect(result.markdown.length).toBeGreaterThan(0);
112+
});
113+
114+
it("splitMarkdownIntoChunks does not split on headings inside fenced code blocks", async () => {
115+
const { splitMarkdownIntoChunks } = await import("./translateFrontMatter");
116+
117+
const content =
118+
"# Real Heading\n\n```\n# not a heading\n```\n\n# Another Heading\n\ntext\n";
119+
120+
// With a small limit, only the real headings should be split boundaries
121+
const chunks = splitMarkdownIntoChunks(content, 40);
122+
123+
// The "# not a heading" line inside the fence should stay in one chunk
124+
const joined = chunks.join("");
125+
expect(joined).toBe(content); // round-trip must be lossless
126+
const fenceChunk = chunks.find((c) => c.includes("```"));
127+
expect(fenceChunk).toBeDefined();
128+
expect(fenceChunk).toContain("# not a heading");
129+
});
130+
131+
it("splitMarkdownIntoChunks reassembly is lossless", async () => {
132+
const { splitMarkdownIntoChunks } = await import("./translateFrontMatter");
133+
134+
const original =
135+
"# Heading 1\n\nParagraph one.\n\n# Heading 2\n\nParagraph two.\n";
136+
const chunks = splitMarkdownIntoChunks(original, 30);
137+
const reassembled = chunks.join("");
138+
expect(reassembled).toBe(original);
139+
});
140+
141+
it("splitMarkdownIntoChunks splits an oversized leading paragraph (no current accumulation bug)", async () => {
142+
const { splitMarkdownIntoChunks } = await import("./translateFrontMatter");
143+
144+
// Leading paragraph exceeds the chunk limit with no preceding content
145+
const bigParagraph = "a".repeat(200);
146+
const chunks = splitMarkdownIntoChunks(bigParagraph, 50);
147+
148+
// Every chunk must respect the limit
149+
for (const chunk of chunks) {
150+
expect(chunk.length).toBeLessThanOrEqual(50);
151+
}
152+
// Round-trip must be lossless
153+
expect(chunks.join("")).toBe(bigParagraph);
154+
});
155+
156+
it("splitMarkdownIntoChunks splits an oversized leading line (splitByLines leading bug)", async () => {
157+
const { splitMarkdownIntoChunks } = await import("./translateFrontMatter");
158+
159+
// A single very long line with no newlines (worst case for splitByLines)
160+
const longLine = "x".repeat(300);
161+
const chunks = splitMarkdownIntoChunks(longLine, 100);
162+
163+
for (const chunk of chunks) {
164+
expect(chunk.length).toBeLessThanOrEqual(100);
165+
}
166+
expect(chunks.join("")).toBe(longLine);
167+
});
57168
});

0 commit comments

Comments
 (0)