feat: add chunking to ChangeTone, ContextWrite, ProofreadProvider, ReformulateProvider, TopicsProvider, and TranslateProvider #240

lukasdotcom · 2025-07-18T19:45:35Z

The implementation for TopicsProvider isn't that great, but is passable.
Also I added progress to all the chunked providers because it made debugging easier.

…formulateProvider, TopicsProvider, and TranslateProvider Signed-off-by: Lukas Schaefer <[email protected]>

Signed-off-by: Lukas Schaefer <[email protected]>

julien-nc

For some task types like proofread, the output can be weird as one chunk could be correct and another one incorrect. So one completion request might produce there are no spelling or grammar mistakes and the other one would list the mistakes, ending up in a confusing/contradicting answer.

Maybe the proofread prompt could be adjusted to something like "do not output anything if there is no mistake or correction suggestion". And in the end we assemble the results and if all the chunks were clean, we artificially set the response to there are no spelling or grammar mistakes. Wdyt?

lukasdotcom · 2025-07-29T12:39:57Z

For some task types like proofread, the output can be weird as one chunk could be correct and another one incorrect. So one completion request might produce there are no spelling or grammar mistakes and the other one would list the mistakes, ending up in a confusing/contradicting answer.

Maybe the proofread prompt could be adjusted to something like "do not output anything if there is no mistake or correction suggestion". And in the end we assemble the results and if all the chunks were clean, we artificially set the response to there are no spelling or grammar mistakes. Wdyt?

I did try to change the prompt to attempt to output nothing, but it really doesn't want to do that. It will either ignore that instruction or literally say 'empty string' or 'nothing' or it will output an empty json list. This happens with many different models. Also I don't think we can just set the response to there are no spelling and grammar mistakes when none of them give an output in case the user doesn't expect/speak english.

lib/TaskProcessing/ChangeToneProvider.php

marcelklehr · 2025-07-29T13:22:11Z

We could use another LLM request to combine the outputs

julien-nc · 2025-07-29T13:37:18Z

We could use another LLM request to combine the outputs

@lukasdotcom That sounds safer indeed. This will hopefully get rid of the contradiction 😁

kyteinsky · 2025-07-29T14:05:23Z

For some task types like proofread, the output can be weird as one chunk could be correct and another one incorrect
We could use another LLM request to combine the outputs

one other method could be to use structured outputs and instruct the LLM to output changes in different JSON keys. Even smaller models are capable of that now.
for example for the summary task type, appending this text works nicely:

output the summary in the following JSON format:
{"summary": "string"}

the actual text can be wrapped in something like """ """

lukasdotcom · 2025-07-29T14:19:27Z

For some task types like proofread, the output can be weird as one chunk could be correct and another one incorrect
We could use another LLM request to combine the outputs

one other method could be to use structured outputs and instruct the LLM to output changes in different JSON keys. Even smaller models are capable of that now. for example for the summary task type, appending this text works nicely:
output the summary in the following JSON format:
{"summary": "string"}
the actual text can be wrapped in something like """ """

The advantage I see of @marcelklehr method is that it would also allow for duplicate recommendations to be removed. Like if the person misspelled the same word multiple times in different chunks, but the JSON could save an llm request.

marcelklehr · 2025-07-29T14:30:41Z

What would be the benefit of having the LLM wrap the output in json? We still need to combine potentially conflicting outputs somehow

lukasdotcom · 2025-07-29T14:32:21Z

What would be the benefit of having the LLM wrap the output in json? We still need to combine potentially conflicting outputs somehow

That llms will actually output an empty string.

lukasdotcom · 2025-07-29T18:07:17Z

What would be the benefit of having the LLM wrap the output in json? We still need to combine potentially conflicting outputs somehow

Implemented another LLM request when multiple chunks exist to merge all the feedback into one list. Also made the formatting nicer because a lot of the time the LLM spit out a numbered list and it looks weird when it counts 1, 2, 3, 1, 2...

Signed-off-by: Lukas Schaefer <[email protected]>

julien-nc

Works nicely with gpt-3.5.

We should improve the system prompts to increase the chances the answer is given in the same language as the input text. Can be done in another issue/PR.

lukasdotcom requested review from julien-nc and kyteinsky July 18, 2025 19:45

lukasdotcom force-pushed the feat/chunking branch 3 times, most recently from ce995c1 to 72c3dff Compare July 21, 2025 12:41

lukasdotcom added 2 commits July 23, 2025 10:26

feat: add chunking to ChangeTone, ContextWrite, ProofreadProvider, Re…

6b42432

…formulateProvider, TopicsProvider, and TranslateProvider Signed-off-by: Lukas Schaefer <[email protected]>

Translation caching is done by chunk

c9ecd9c

Signed-off-by: Lukas Schaefer <[email protected]>

lukasdotcom force-pushed the feat/chunking branch from b42002d to c9ecd9c Compare July 23, 2025 14:26

julien-nc reviewed Jul 29, 2025

View reviewed changes

julien-nc requested review from edward-ly, janepie and marcelklehr July 29, 2025 10:10

lukasdotcom force-pushed the feat/chunking branch 2 times, most recently from c8b9842 to c9ecd9c Compare July 29, 2025 12:35

marcelklehr reviewed Jul 29, 2025

View reviewed changes

lib/TaskProcessing/ChangeToneProvider.php Show resolved Hide resolved

lukasdotcom force-pushed the feat/chunking branch from e1bd8f9 to 48f1f4f Compare July 29, 2025 18:06

summarize chunks in proofread provider when multiple exist

e58dbf6

Signed-off-by: Lukas Schaefer <[email protected]>

lukasdotcom force-pushed the feat/chunking branch from 48f1f4f to e58dbf6 Compare July 29, 2025 18:07

lukasdotcom requested review from julien-nc and marcelklehr July 29, 2025 18:08

julien-nc approved these changes Jul 30, 2025

View reviewed changes

lukasdotcom merged commit bbc46db into main Jul 30, 2025
29 checks passed

lukasdotcom deleted the feat/chunking branch July 30, 2025 12:06

kyteinsky mentioned this pull request Oct 6, 2025

3.8.0 #274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add chunking to ChangeTone, ContextWrite, ProofreadProvider, ReformulateProvider, TopicsProvider, and TranslateProvider #240

feat: add chunking to ChangeTone, ContextWrite, ProofreadProvider, ReformulateProvider, TopicsProvider, and TranslateProvider #240

Uh oh!

lukasdotcom commented Jul 18, 2025

Uh oh!

julien-nc left a comment

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

Uh oh!

marcelklehr commented Jul 29, 2025

Uh oh!

julien-nc commented Jul 29, 2025

Uh oh!

kyteinsky commented Jul 29, 2025

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

marcelklehr commented Jul 29, 2025

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

julien-nc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: add chunking to ChangeTone, ContextWrite, ProofreadProvider, ReformulateProvider, TopicsProvider, and TranslateProvider #240

feat: add chunking to ChangeTone, ContextWrite, ProofreadProvider, ReformulateProvider, TopicsProvider, and TranslateProvider #240

Uh oh!

Conversation

lukasdotcom commented Jul 18, 2025

Uh oh!

julien-nc left a comment

Choose a reason for hiding this comment

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

Uh oh!

marcelklehr commented Jul 29, 2025

Uh oh!

julien-nc commented Jul 29, 2025

Uh oh!

kyteinsky commented Jul 29, 2025

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

marcelklehr commented Jul 29, 2025

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

lukasdotcom commented Jul 29, 2025

Uh oh!

julien-nc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants