fix(core): implement spaCy sentencization and sliding window chunking#473
fix(core): implement spaCy sentencization and sliding window chunking#473SxBxcoder wants to merge 4 commits intoAOSSIE-Org:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a thread-safe, lazily initialized Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 520-544: The _split_into_segments function currently can append
empty or oversized segments when a single tokenized paragraph exceeds
MAX_TOKENS; fix it by detecting when len(paragraph) > MAX_TOKENS and chunking
that paragraph into sublists of at most MAX_TOKENS (respecting OVERLAP_SIZE
between chunks), ensure you check that current_segment is non-empty before
appending to segments, and after chunking continue merging remaining chunks with
current_segment using the same overlap logic; reference symbols:
_split_into_segments, MAX_TOKENS, OVERLAP_SIZE, tokenized_paragraphs,
current_segment, segments, and self.qg_tokenizer.decode to produce final decoded
segments.
- Around line 510-518: The _split_text function currently filters out sentences
with ≤4 words, which can return an empty list and break downstream functions
like _prepare_qg_inputs and _prepare_qg_inputs_MC; modify _split_text so that
after building the filtered list (sentences = [ ... if len(...) > 4]), if that
filtered list is empty then fall back to returning the original sentences list
(deduplicated) from doc.sents (i.e., preserve the original short sentences
rather than returning []). Keep deduplication behavior and use the same ordering
and strip logic so callers of _prepare_qg_inputs/_prepare_qg_inputs_MC receive
non-empty inputs.
- Around line 405-411: The lazy spaCy loader in the QuestionGenerator.nlp
property is not thread-safe and can double-load the model; fix it by adding a
lock (e.g., a threading.Lock stored as self._nlp_lock or a class-level lock) and
applying double-checked locking inside the nlp property: first check for
hasattr(self, '_nlp'), if missing acquire the lock, re-check hasattr(self,
'_nlp'), and only then call en_core_web_sm.load() to set self._nlp; ensure the
lock is initialized (e.g., in __init__ or as a class attribute) so the
module-level singleton used by backend/server.py will safely lazy-load the model
under concurrent Flask requests.
…back, and oversized chunks
|
Addressed all three architectural edge cases in commit 8063a39:
These changes ensure the backend is resilient to the root causes identified in #443 and #467. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
backend/Generator/main.py (4)
406-415: Thread-safe lazy initialization is correctly implemented.Double-checked locking with
_nlp_lockproperly prevents concurrent first-access from loading the model twice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 406 - 415, No change required: the nlp property already uses thread-safe lazy initialization via double-checked locking with the _nlp_lock and the _nlp attribute; keep the existing implementation of the nlp property, _nlp_lock, and the _nlp attribute as-is.
406-415: Thread-safe lazy initialization is correctly implemented.Double-checked locking with
_nlp_lockproperly prevents concurrent first-access from loading the model twice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 406 - 415, Thread-safe lazy initialization for the spaCy model is implemented correctly: the nlp property uses double-checked locking with the _nlp_lock and checks for _nlp before and inside the lock, so no code changes are required—leave the nlp property, _nlp_lock, _nlp attribute, and en_core_web_sm.load() usage as-is.
514-527: Fallback and deduplication correctly address the empty-sentence edge case.The
all_sentences → sentences → fallbackpattern is sound. Note thatself.nlp(text)runs the fullen_core_web_smpipeline (tagger, parser, NER, etc.) even though only sentence boundary detection is needed. For high-throughput paths, consider a purpose-built lightweight pipeline (e.g.,spacy.blank("en")+sentencizer) instead of reusing the shared full-featured pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 514 - 527, Optimize _split_text by avoiding the full en_core_web_sm pipeline: replace using self.nlp(text) for sentence splitting with a lightweight sentencizer (e.g., spacy.blank("en") + Sentencizer) to only run boundary detection; instantiate and cache this sentencizer once (e.g., self.sentencizer) during class initialization and call self.sentencizer(text) inside _split_text, preserving the same filtering, fallback, and deduplication logic.
514-527: Fallback and deduplication correctly address the empty-sentence edge case.The
all_sentences → sentences → fallbackpattern is sound. Note thatself.nlp(text)runs the full en_core_web_sm pipeline (tagger, parser, NER, etc.) even though only the sentence boundary detection (senter) component is needed. For high-throughput paths, consider a purpose-built lightweight pipeline (e.g.,spacy.blank("en")+ just thesentencizerorsentercomponent) instead of reusing the shared full-featured pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 514 - 527, The current _split_text uses self.nlp (the full en_core_web_sm pipeline) which is heavy for high-throughput sentence splitting; change it to use a lightweight sentencizer-only pipeline (e.g., create and cache an nlp_sent = spacy.blank("en") and add the "sentencizer" or "senter" component) and call that pipeline inside _split_text instead of self.nlp (or fall back to self.nlp if no sentencizer is available). Update or add a lightweight pipeline initializer (cached on the Generator instance) and reference it from _split_text so sentence boundary detection uses only the minimal sentencizer component.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 542-549: The loop handling tokenized_paragraphs drops any
accumulated current_segment when encountering a paragraph longer than
MAX_TOKENS; before splitting the oversized paragraph into chunks, if
current_segment is non-empty append it to segments (and clear it) so previously
accumulated tokens are not lost, or alternatively prepend current_segment to the
first chunk of the oversized paragraph before adding chunks to segments; update
the block around the len(paragraph) > MAX_TOKENS check in main.py to ensure
segments and current_segment are handled accordingly (symbols:
tokenized_paragraphs, MAX_TOKENS, OVERLAP_SIZE, segments, current_segment).
- Around line 542-549: When encountering an oversized paragraph in the loop over
tokenized_paragraphs (the branch where len(paragraph) > MAX_TOKENS), don't
discard the already-accumulated current_segment; flush it to segments before
resetting. Specifically, in the oversized-paragraph branch that currently sets
current_segment = [], first check if current_segment is non-empty and append it
to segments (or otherwise finalize it) so preceding tokens are not lost, then
proceed to chunk the long paragraph into segments using MAX_TOKENS and
OVERLAP_SIZE and finally reset current_segment.
---
Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 406-415: No change required: the nlp property already uses
thread-safe lazy initialization via double-checked locking with the _nlp_lock
and the _nlp attribute; keep the existing implementation of the nlp property,
_nlp_lock, and the _nlp attribute as-is.
- Around line 406-415: Thread-safe lazy initialization for the spaCy model is
implemented correctly: the nlp property uses double-checked locking with the
_nlp_lock and checks for _nlp before and inside the lock, so no code changes are
required—leave the nlp property, _nlp_lock, _nlp attribute, and
en_core_web_sm.load() usage as-is.
- Around line 514-527: Optimize _split_text by avoiding the full en_core_web_sm
pipeline: replace using self.nlp(text) for sentence splitting with a lightweight
sentencizer (e.g., spacy.blank("en") + Sentencizer) to only run boundary
detection; instantiate and cache this sentencizer once (e.g., self.sentencizer)
during class initialization and call self.sentencizer(text) inside _split_text,
preserving the same filtering, fallback, and deduplication logic.
- Around line 514-527: The current _split_text uses self.nlp (the full
en_core_web_sm pipeline) which is heavy for high-throughput sentence splitting;
change it to use a lightweight sentencizer-only pipeline (e.g., create and cache
an nlp_sent = spacy.blank("en") and add the "sentencizer" or "senter" component)
and call that pipeline inside _split_text instead of self.nlp (or fall back to
self.nlp if no sentencizer is available). Update or add a lightweight pipeline
initializer (cached on the Generator instance) and reference it from _split_text
so sentence boundary detection uses only the minimal sentencizer component.
…e oversized chunks
|
Yes good catch on the silent data loss edge case. I have added the logic to flush current_segment to the segments array before chunking the oversized paragraph. (Regarding the sentencizer optimization: leaving it as self.nlp for now to ensure maximum stability with the existing NER/QA pipeline, but keeping it in mind for future perf tickets). Fix pushed in e6c417f. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
backend/Generator/main.py (3)
514-527:_split_textfallback is correctly guarded.The two-stage filter (word-count threshold → fall back to all sentences → deduplicate) covers the empty-output scenario flagged in prior review iterations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 514 - 527, The _split_text function correctly guards against empty output by filtering sentences by word count then falling back to all_sentences and deduplicating; no code change needed—leave the implementation of _split_text (the doc = self.nlp(text) → all_sentences → sentences fallback → dedupe via dict.fromkeys) as-is.
406-415: Thread-safe lazy loading is correctly implemented.The double-checked locking pattern with
_nlp_lockcorrectly prevents concurrent model loads under Flask's multi-threaded request handling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 406 - 415, No change required: the double-checked locking pattern around the nlp property is correct and thread-safe; leave the nlp property and the _nlp_lock initialization as implemented (refer to the nlp property and _nlp_lock attribute) and do not modify this lazy-loading logic.
542-553: Oversized-paragraph flush is correctly applied.
current_segmentis now drained tosegmentsbefore chunking the oversized paragraph, preventing the silent data-loss that was flagged in prior review iterations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 542 - 553, The loop handling oversized paragraphs needs to flush any accumulated tokens before chunking so prior data isn't lost; in the for-loop over tokenized_paragraphs, when len(paragraph) > MAX_TOKENS append the current_segment to segments (using segments.append(current_segment)) and then reset current_segment = [] before creating chunks using step = MAX_TOKENS - OVERLAP_SIZE and slicing paragraph[i:i + MAX_TOKENS]; ensure you keep the variable names current_segment, segments, MAX_TOKENS, OVERLAP_SIZE and preserve types so subsequent logic that relies on current_segment being an empty list continues to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 555-560: The sliding-window flush condition can produce segments
larger than MAX_TOKENS after retaining the overlap; change the flush check in
the block that manipulates current_segment and paragraph so it uses MAX_TOKENS -
OVERLAP_SIZE (not MAX_TOKENS) as the threshold, i.e. flush when
len(current_segment) + len(paragraph) > MAX_TOKENS - OVERLAP_SIZE so that after
trimming to the last OVERLAP_SIZE and extending with paragraph the resulting
current_segment will always be ≤ MAX_TOKENS; verify this change addresses
overflow risks when feeding segments into _encode_qg_input (SEQ_LENGTH) and keep
references to current_segment, paragraph, MAX_TOKENS, OVERLAP_SIZE, and
_encode_qg_input for locating the code.
---
Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 514-527: The _split_text function correctly guards against empty
output by filtering sentences by word count then falling back to all_sentences
and deduplicating; no code change needed—leave the implementation of _split_text
(the doc = self.nlp(text) → all_sentences → sentences fallback → dedupe via
dict.fromkeys) as-is.
- Around line 406-415: No change required: the double-checked locking pattern
around the nlp property is correct and thread-safe; leave the nlp property and
the _nlp_lock initialization as implemented (refer to the nlp property and
_nlp_lock attribute) and do not modify this lazy-loading logic.
- Around line 542-553: The loop handling oversized paragraphs needs to flush any
accumulated tokens before chunking so prior data isn't lost; in the for-loop
over tokenized_paragraphs, when len(paragraph) > MAX_TOKENS append the
current_segment to segments (using segments.append(current_segment)) and then
reset current_segment = [] before creating chunks using step = MAX_TOKENS -
OVERLAP_SIZE and slicing paragraph[i:i + MAX_TOKENS]; ensure you keep the
variable names current_segment, segments, MAX_TOKENS, OVERLAP_SIZE and preserve
types so subsequent logic that relies on current_segment being an empty list
continues to work.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/Generator/main.py (1)
530-533: Consider promoting chunking constants to class-level configuration.Keeping
MAX_TOKENS/OVERLAP_SIZEas class attributes (or constructor params) would make tuning and tests easier without changing behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 530 - 533, Move the hard-coded MAX_TOKENS and OVERLAP_SIZE out of the local method and into class-level configuration on the Generator class (or into the class constructor as parameters with defaults). Change the function that "Groups sentences into overlapping segments to preserve context for the T5 model" to reference self.MAX_TOKENS/self.OVERLAP_SIZE (or instance attributes set by __init__) instead of local constants, update the constructor signature to accept optional max_tokens and overlap_size with the current values as defaults, and ensure any tests or callers construct the class accordingly so tuning and tests can override these values without editing the method body.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 530-533: Move the hard-coded MAX_TOKENS and OVERLAP_SIZE out of
the local method and into class-level configuration on the Generator class (or
into the class constructor as parameters with defaults). Change the function
that "Groups sentences into overlapping segments to preserve context for the T5
model" to reference self.MAX_TOKENS/self.OVERLAP_SIZE (or instance attributes
set by __init__) instead of local constants, update the constructor signature to
accept optional max_tokens and overlap_size with the current values as defaults,
and ensure any tests or callers construct the class accordingly so tuning and
tests can override these values without editing the method body.
|
I adjusted the flush threshold to MAX_TOKENS - OVERLAP_SIZE to strictly guarantee that no segment ever exceeds 450 tokens after the overlap is retained, keeping it safely under the 512 SEQ_LENGTH. Fix pushed. |
|
Since this PR is focused strictly on resolving the immediate crash/empty generation bugs, I'll leave these scoped locally for now to keep the logic encapsulated and minimize the blast radius of changes. We can definitely extract these to a global config in a future refactor ticket. |
Addressed Issues:
Resolves the backend root cause of the "empty generation" crashes reported in the frontend:
Fixes #443
Fixes #467
Screenshots/Recordings:
Note on Local Testing (per pinned guidelines): I am relying on the automated GitHub Actions CI/CD pipeline to verify the 200 OK responses for this specific PR. I was unable to attach a local Postman screenshot because fresh Windows environments (Python 3.14) are currently failing to build the
scipyandtorchwheels due to the outdated version pins inrequirements.txt(Tracked in Issue #413).(Note to maintainers: My pending PR #430 resolves these exact Windows dependency crashes. Once #430 is merged, I can easily provide local curl/Postman logs for all future backend work!)
Additional Notes:
This PR addresses the "insufficient/zero questions generated" issue frequently encountered when users upload PDFs or long documents. The root cause was traced to how the text was pre-processed before hitting the T5 transformer.
The Problems:
_split_textfunction relied onre.findall(".*?[.!\?]", text). This caused false splits on abbreviations (e.g., "U.S.A.", "Dr."), sending unusable, fragmented context to the model._split_into_segmentsfunction enforced a hard token limit. If a core concept spanned across the chunk boundary, the T5 model lost the context required to formulate a valid question.The Solutions:
spaCy's robust sentence boundary detection. It now intelligently handles abbreviations and filters out contextless fragments.50-token overlap between text chunks. This acts as a "context bridge," ensuring the AI never loses the semantic thread between paragraphs.QuestionGeneratorto lazy-load theen_core_web_smmodel via an@propertyto prevent OOM spikes.Checklist
AI Usage Disclosure
Check one of the checkboxes below:
I have used the following AI models and tools: Gemini 3.1 Pro (Used strictly as a pair-programming assistant to safely translate the regex text-splitting logic into the optimized spaCy sentencizer and to structure the sliding-window overlap matrix. All logic was reviewed and manually integrated).
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit