Skip to content

fix(core): implement spaCy sentencization and sliding window chunking#473

Open
SxBxcoder wants to merge 4 commits intoAOSSIE-Org:mainfrom
SxBxcoder:feat/sliding-window-segmentation
Open

fix(core): implement spaCy sentencization and sliding window chunking#473
SxBxcoder wants to merge 4 commits intoAOSSIE-Org:mainfrom
SxBxcoder:feat/sliding-window-segmentation

Conversation

@SxBxcoder
Copy link
Contributor

@SxBxcoder SxBxcoder commented Feb 24, 2026

Addressed Issues:

Resolves the backend root cause of the "empty generation" crashes reported in the frontend:
Fixes #443
Fixes #467

Screenshots/Recordings:

Note on Local Testing (per pinned guidelines): I am relying on the automated GitHub Actions CI/CD pipeline to verify the 200 OK responses for this specific PR. I was unable to attach a local Postman screenshot because fresh Windows environments (Python 3.14) are currently failing to build the scipy and torch wheels due to the outdated version pins in requirements.txt (Tracked in Issue #413).

(Note to maintainers: My pending PR #430 resolves these exact Windows dependency crashes. Once #430 is merged, I can easily provide local curl/Postman logs for all future backend work!)

Additional Notes:

This PR addresses the "insufficient/zero questions generated" issue frequently encountered when users upload PDFs or long documents. The root cause was traced to how the text was pre-processed before hitting the T5 transformer.

The Problems:

  1. Naive Regex Segmentation: The _split_text function relied on re.findall(".*?[.!\?]", text). This caused false splits on abbreviations (e.g., "U.S.A.", "Dr."), sending unusable, fragmented context to the model.
  2. Context Loss at Chunk Seams: The _split_into_segments function enforced a hard token limit. If a core concept spanned across the chunk boundary, the T5 model lost the context required to formulate a valid question.

The Solutions:

  • Linguistic Sentencization: Replaced the regex approach with spaCy's robust sentence boundary detection. It now intelligently handles abbreviations and filters out contextless fragments.
  • Sliding Window Chunking: Implemented a 50-token overlap between text chunks. This acts as a "context bridge," ensuring the AI never loses the semantic thread between paragraphs.
  • Memory Optimization: Refactored QuestionGenerator to lazy-load the en_core_web_sm model via an @property to prevent OOM spikes.

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.

AI Usage Disclosure

Check one of the checkboxes below:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: Gemini 3.1 Pro (Used strictly as a pair-programming assistant to safely translate the regex text-splitting logic into the optimized spaCy sentencizer and to structure the sliding-window overlap matrix. All logic was reviewed and manually integrated).

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • Refactor
    • NLP model now initializes lazily and thread-safely, reducing startup time and memory usage.
    • Text processing unified to use a shared pipeline, improving consistency of question generation and multiple-choice prep.
    • Sentence and long-text segmentation revamped with sliding-window overlap, shorter-fragment filtering, and de-duplication for more accurate, stable outputs.

@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Adds a thread-safe, lazily initialized nlp property to QuestionGenerator that caches a spaCy en_core_web_sm pipeline; replaces per-call spaCy loads with self.nlp/self.nlp.pipe; refactors _split_text to use spaCy with filtering/deduplication and sliding-window overlapping segmentation for long texts.

Changes

Cohort / File(s) Summary
NLP Pipeline & Segmentation
backend/Generator/main.py
Add a cached, thread-safe @property nlp that lazy-loads en_core_web_sm using a _nlp_lock. Replace ad-hoc spaCy loads with self.nlp and self.nlp.pipe. Refactor _split_text to use spaCy sentence/segmenting with short-fragment filtering and order-preserving deduplication. Implement sliding-window overlapping segmentation for long texts (updated MAX_TOKENS / OVERLAP_SIZE) and adjust MC input preparation to use the shared pipeline.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I waited in a threaded nook, then gently woke the nlp book,
Lazy hops through sentences and seams,
Sliding windows chase their dreams,
Cached and calm, the pipeline springs,
Questions flourish on quick rabbit wings.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: implementing spaCy-based sentencization and sliding window chunking for text segmentation.
Linked Issues check ✅ Passed The PR addresses root backend causes for both issues by improving sentence segmentation, preserving cross-chunk context, and adding fallback logic to prevent empty generation.
Out of Scope Changes check ✅ Passed All changes focus on the QuestionGenerator class and text segmentation logic directly related to fixing the identified backend root causes.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 520-544: The _split_into_segments function currently can append
empty or oversized segments when a single tokenized paragraph exceeds
MAX_TOKENS; fix it by detecting when len(paragraph) > MAX_TOKENS and chunking
that paragraph into sublists of at most MAX_TOKENS (respecting OVERLAP_SIZE
between chunks), ensure you check that current_segment is non-empty before
appending to segments, and after chunking continue merging remaining chunks with
current_segment using the same overlap logic; reference symbols:
_split_into_segments, MAX_TOKENS, OVERLAP_SIZE, tokenized_paragraphs,
current_segment, segments, and self.qg_tokenizer.decode to produce final decoded
segments.
- Around line 510-518: The _split_text function currently filters out sentences
with ≤4 words, which can return an empty list and break downstream functions
like _prepare_qg_inputs and _prepare_qg_inputs_MC; modify _split_text so that
after building the filtered list (sentences = [ ... if len(...) > 4]), if that
filtered list is empty then fall back to returning the original sentences list
(deduplicated) from doc.sents (i.e., preserve the original short sentences
rather than returning []). Keep deduplication behavior and use the same ordering
and strip logic so callers of _prepare_qg_inputs/_prepare_qg_inputs_MC receive
non-empty inputs.
- Around line 405-411: The lazy spaCy loader in the QuestionGenerator.nlp
property is not thread-safe and can double-load the model; fix it by adding a
lock (e.g., a threading.Lock stored as self._nlp_lock or a class-level lock) and
applying double-checked locking inside the nlp property: first check for
hasattr(self, '_nlp'), if missing acquire the lock, re-check hasattr(self,
'_nlp'), and only then call en_core_web_sm.load() to set self._nlp; ensure the
lock is initialized (e.g., in __init__ or as a class attribute) so the
module-level singleton used by backend/server.py will safely lazy-load the model
under concurrent Flask requests.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and 8007e48.

📒 Files selected for processing (1)
  • backend/Generator/main.py

@SxBxcoder
Copy link
Contributor Author

Addressed all three architectural edge cases in commit 8063a39:

  1. Thread Safety: Implemented a threading.Lock and double-checked locking in the nlp property to prevent race conditions during model initialization.

  2. Empty Input Fallback: Added a fallback in _split_text to preserve original sentences if filtering removes all context, preventing "zero questions generated" crashes.

  3. Oversized Paragraph Chunking: Implemented an intra-paragraph sliding window in _split_into_segments to safely handle massive blocks of text that exceed the T5 token limit.

These changes ensure the backend is resilient to the root causes identified in #443 and #467.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
backend/Generator/main.py (4)

406-415: Thread-safe lazy initialization is correctly implemented.

Double-checked locking with _nlp_lock properly prevents concurrent first-access from loading the model twice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 406 - 415, No change required: the
nlp property already uses thread-safe lazy initialization via double-checked
locking with the _nlp_lock and the _nlp attribute; keep the existing
implementation of the nlp property, _nlp_lock, and the _nlp attribute as-is.

406-415: Thread-safe lazy initialization is correctly implemented.

Double-checked locking with _nlp_lock properly prevents concurrent first-access from loading the model twice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 406 - 415, Thread-safe lazy
initialization for the spaCy model is implemented correctly: the nlp property
uses double-checked locking with the _nlp_lock and checks for _nlp before and
inside the lock, so no code changes are required—leave the nlp property,
_nlp_lock, _nlp attribute, and en_core_web_sm.load() usage as-is.

514-527: Fallback and deduplication correctly address the empty-sentence edge case.

The all_sentences → sentences → fallback pattern is sound. Note that self.nlp(text) runs the full en_core_web_sm pipeline (tagger, parser, NER, etc.) even though only sentence boundary detection is needed. For high-throughput paths, consider a purpose-built lightweight pipeline (e.g., spacy.blank("en") + sentencizer) instead of reusing the shared full-featured pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 514 - 527, Optimize _split_text by
avoiding the full en_core_web_sm pipeline: replace using self.nlp(text) for
sentence splitting with a lightweight sentencizer (e.g., spacy.blank("en") +
Sentencizer) to only run boundary detection; instantiate and cache this
sentencizer once (e.g., self.sentencizer) during class initialization and call
self.sentencizer(text) inside _split_text, preserving the same filtering,
fallback, and deduplication logic.

514-527: Fallback and deduplication correctly address the empty-sentence edge case.

The all_sentences → sentences → fallback pattern is sound. Note that self.nlp(text) runs the full en_core_web_sm pipeline (tagger, parser, NER, etc.) even though only the sentence boundary detection (senter) component is needed. For high-throughput paths, consider a purpose-built lightweight pipeline (e.g., spacy.blank("en") + just the sentencizer or senter component) instead of reusing the shared full-featured pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 514 - 527, The current _split_text
uses self.nlp (the full en_core_web_sm pipeline) which is heavy for
high-throughput sentence splitting; change it to use a lightweight
sentencizer-only pipeline (e.g., create and cache an nlp_sent =
spacy.blank("en") and add the "sentencizer" or "senter" component) and call that
pipeline inside _split_text instead of self.nlp (or fall back to self.nlp if no
sentencizer is available). Update or add a lightweight pipeline initializer
(cached on the Generator instance) and reference it from _split_text so sentence
boundary detection uses only the minimal sentencizer component.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 542-549: The loop handling tokenized_paragraphs drops any
accumulated current_segment when encountering a paragraph longer than
MAX_TOKENS; before splitting the oversized paragraph into chunks, if
current_segment is non-empty append it to segments (and clear it) so previously
accumulated tokens are not lost, or alternatively prepend current_segment to the
first chunk of the oversized paragraph before adding chunks to segments; update
the block around the len(paragraph) > MAX_TOKENS check in main.py to ensure
segments and current_segment are handled accordingly (symbols:
tokenized_paragraphs, MAX_TOKENS, OVERLAP_SIZE, segments, current_segment).
- Around line 542-549: When encountering an oversized paragraph in the loop over
tokenized_paragraphs (the branch where len(paragraph) > MAX_TOKENS), don't
discard the already-accumulated current_segment; flush it to segments before
resetting. Specifically, in the oversized-paragraph branch that currently sets
current_segment = [], first check if current_segment is non-empty and append it
to segments (or otherwise finalize it) so preceding tokens are not lost, then
proceed to chunk the long paragraph into segments using MAX_TOKENS and
OVERLAP_SIZE and finally reset current_segment.

---

Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 406-415: No change required: the nlp property already uses
thread-safe lazy initialization via double-checked locking with the _nlp_lock
and the _nlp attribute; keep the existing implementation of the nlp property,
_nlp_lock, and the _nlp attribute as-is.
- Around line 406-415: Thread-safe lazy initialization for the spaCy model is
implemented correctly: the nlp property uses double-checked locking with the
_nlp_lock and checks for _nlp before and inside the lock, so no code changes are
required—leave the nlp property, _nlp_lock, _nlp attribute, and
en_core_web_sm.load() usage as-is.
- Around line 514-527: Optimize _split_text by avoiding the full en_core_web_sm
pipeline: replace using self.nlp(text) for sentence splitting with a lightweight
sentencizer (e.g., spacy.blank("en") + Sentencizer) to only run boundary
detection; instantiate and cache this sentencizer once (e.g., self.sentencizer)
during class initialization and call self.sentencizer(text) inside _split_text,
preserving the same filtering, fallback, and deduplication logic.
- Around line 514-527: The current _split_text uses self.nlp (the full
en_core_web_sm pipeline) which is heavy for high-throughput sentence splitting;
change it to use a lightweight sentencizer-only pipeline (e.g., create and cache
an nlp_sent = spacy.blank("en") and add the "sentencizer" or "senter" component)
and call that pipeline inside _split_text instead of self.nlp (or fall back to
self.nlp if no sentencizer is available). Update or add a lightweight pipeline
initializer (cached on the Generator instance) and reference it from _split_text
so sentence boundary detection uses only the minimal sentencizer component.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8007e48 and 8063a39.

📒 Files selected for processing (1)
  • backend/Generator/main.py

@SxBxcoder
Copy link
Contributor Author

Yes good catch on the silent data loss edge case. I have added the logic to flush current_segment to the segments array before chunking the oversized paragraph. (Regarding the sentencizer optimization: leaving it as self.nlp for now to ensure maximum stability with the existing NER/QA pipeline, but keeping it in mind for future perf tickets). Fix pushed in e6c417f.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
backend/Generator/main.py (3)

514-527: _split_text fallback is correctly guarded.

The two-stage filter (word-count threshold → fall back to all sentences → deduplicate) covers the empty-output scenario flagged in prior review iterations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 514 - 527, The _split_text function
correctly guards against empty output by filtering sentences by word count then
falling back to all_sentences and deduplicating; no code change needed—leave the
implementation of _split_text (the doc = self.nlp(text) → all_sentences →
sentences fallback → dedupe via dict.fromkeys) as-is.

406-415: Thread-safe lazy loading is correctly implemented.

The double-checked locking pattern with _nlp_lock correctly prevents concurrent model loads under Flask's multi-threaded request handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 406 - 415, No change required: the
double-checked locking pattern around the nlp property is correct and
thread-safe; leave the nlp property and the _nlp_lock initialization as
implemented (refer to the nlp property and _nlp_lock attribute) and do not
modify this lazy-loading logic.

542-553: Oversized-paragraph flush is correctly applied.

current_segment is now drained to segments before chunking the oversized paragraph, preventing the silent data-loss that was flagged in prior review iterations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 542 - 553, The loop handling
oversized paragraphs needs to flush any accumulated tokens before chunking so
prior data isn't lost; in the for-loop over tokenized_paragraphs, when
len(paragraph) > MAX_TOKENS append the current_segment to segments (using
segments.append(current_segment)) and then reset current_segment = [] before
creating chunks using step = MAX_TOKENS - OVERLAP_SIZE and slicing paragraph[i:i
+ MAX_TOKENS]; ensure you keep the variable names current_segment, segments,
MAX_TOKENS, OVERLAP_SIZE and preserve types so subsequent logic that relies on
current_segment being an empty list continues to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 555-560: The sliding-window flush condition can produce segments
larger than MAX_TOKENS after retaining the overlap; change the flush check in
the block that manipulates current_segment and paragraph so it uses MAX_TOKENS -
OVERLAP_SIZE (not MAX_TOKENS) as the threshold, i.e. flush when
len(current_segment) + len(paragraph) > MAX_TOKENS - OVERLAP_SIZE so that after
trimming to the last OVERLAP_SIZE and extending with paragraph the resulting
current_segment will always be ≤ MAX_TOKENS; verify this change addresses
overflow risks when feeding segments into _encode_qg_input (SEQ_LENGTH) and keep
references to current_segment, paragraph, MAX_TOKENS, OVERLAP_SIZE, and
_encode_qg_input for locating the code.

---

Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 514-527: The _split_text function correctly guards against empty
output by filtering sentences by word count then falling back to all_sentences
and deduplicating; no code change needed—leave the implementation of _split_text
(the doc = self.nlp(text) → all_sentences → sentences fallback → dedupe via
dict.fromkeys) as-is.
- Around line 406-415: No change required: the double-checked locking pattern
around the nlp property is correct and thread-safe; leave the nlp property and
the _nlp_lock initialization as implemented (refer to the nlp property and
_nlp_lock attribute) and do not modify this lazy-loading logic.
- Around line 542-553: The loop handling oversized paragraphs needs to flush any
accumulated tokens before chunking so prior data isn't lost; in the for-loop
over tokenized_paragraphs, when len(paragraph) > MAX_TOKENS append the
current_segment to segments (using segments.append(current_segment)) and then
reset current_segment = [] before creating chunks using step = MAX_TOKENS -
OVERLAP_SIZE and slicing paragraph[i:i + MAX_TOKENS]; ensure you keep the
variable names current_segment, segments, MAX_TOKENS, OVERLAP_SIZE and preserve
types so subsequent logic that relies on current_segment being an empty list
continues to work.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8063a39 and e6c417f.

📒 Files selected for processing (1)
  • backend/Generator/main.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/Generator/main.py (1)

530-533: Consider promoting chunking constants to class-level configuration.

Keeping MAX_TOKENS/OVERLAP_SIZE as class attributes (or constructor params) would make tuning and tests easier without changing behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 530 - 533, Move the hard-coded
MAX_TOKENS and OVERLAP_SIZE out of the local method and into class-level
configuration on the Generator class (or into the class constructor as
parameters with defaults). Change the function that "Groups sentences into
overlapping segments to preserve context for the T5 model" to reference
self.MAX_TOKENS/self.OVERLAP_SIZE (or instance attributes set by __init__)
instead of local constants, update the constructor signature to accept optional
max_tokens and overlap_size with the current values as defaults, and ensure any
tests or callers construct the class accordingly so tuning and tests can
override these values without editing the method body.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 530-533: Move the hard-coded MAX_TOKENS and OVERLAP_SIZE out of
the local method and into class-level configuration on the Generator class (or
into the class constructor as parameters with defaults). Change the function
that "Groups sentences into overlapping segments to preserve context for the T5
model" to reference self.MAX_TOKENS/self.OVERLAP_SIZE (or instance attributes
set by __init__) instead of local constants, update the constructor signature to
accept optional max_tokens and overlap_size with the current values as defaults,
and ensure any tests or callers construct the class accordingly so tuning and
tests can override these values without editing the method body.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6c417f and 7967616.

📒 Files selected for processing (1)
  • backend/Generator/main.py

@SxBxcoder
Copy link
Contributor Author

I adjusted the flush threshold to MAX_TOKENS - OVERLAP_SIZE to strictly guarantee that no segment ever exceeds 450 tokens after the overlap is retained, keeping it safely under the 512 SEQ_LENGTH. Fix pushed.

@SxBxcoder
Copy link
Contributor Author

Since this PR is focused strictly on resolving the immediate crash/empty generation bugs, I'll leave these scoped locally for now to keep the logic encapsulated and minimize the blast radius of changes. We can definitely extract these to a global config in a future refactor ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant