-
Notifications
You must be signed in to change notification settings - Fork 135
Added pre-processing for implementing RAG for Louis:Chatbot #1118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces XML parsing and chunking infrastructure for processing SICP (Structure and Interpretation of Computer Programs) textbook content for RAG (Retrieval-Augmented Generation) ingestion. The implementation includes structure-aware parsing that traverses chapter/section/subsection hierarchies, extracts text and code segments, and produces token-bounded chunks suitable for embedding.
Changes:
- Added
parse_sicp.pyfor recursive XML parsing of chapter structures - Added
chunking.pywith sophisticated text/code segmentation and token-aware chunking - Generated
chapter3_chunks.jsonas sample output from the parsing pipeline
Reviewed changes
Copilot reviewed 2 out of 8 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| parser/parse_sicp.py | Recursively parses SICP XML files, extracting text blocks while following section/subsection references |
| parser/chunking.py | Comprehensive chunking system with XML parsing, text normalization, sentence tokenization, and token-bounded chunk generation |
| parser/chapter3_chunks.json | Generated JSON output containing 2378 parsed chunks from Chapter 3 content |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}") | ||
| section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml") | ||
| print(f"{indent}➡️ Going into section file: {section_file}") | ||
| chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1)) | ||
|
|
||
| # Recurse into subsections | ||
| for ref in subsection_refs: | ||
| subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml") |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section reference extraction logic uses split('.')[0] which only takes the first part before a dot. For references like "§ion1.2.3;", this would incorrectly extract only "1" instead of the full identifier "1.2.3". This will cause the wrong section files to be loaded.
| section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}") | |
| section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml") | |
| print(f"{indent}➡️ Going into section file: {section_file}") | |
| chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1)) | |
| # Recurse into subsections | |
| for ref in subsection_refs: | |
| subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml") | |
| section_folder = os.path.join(os.path.dirname(file_path), f"section{ref}") | |
| section_file = os.path.join(section_folder, f"section{ref}.xml") | |
| print(f"{indent}➡️ Going into section file: {section_file}") | |
| chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1)) | |
| # Recurse into subsections | |
| for ref in subsection_refs: | |
| subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref}.xml") |
|
|
||
| # Recurse into subsections | ||
| for ref in subsection_refs: | ||
| subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml") |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subsection reference extraction uses split('.')[0] which is incorrect. Subsection references typically include the full path (e.g., "1.2.3"), so this will extract only the first number and construct the wrong file path.
| subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml") | |
| subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref}.xml") |
| def parse_file(file_path, parent_title=None, depth=0): | ||
| """ | ||
| Recursively parse any XML file (chapter, section, or subsection). | ||
| """ | ||
| indent = " " * depth # for nice indentation in logs | ||
|
|
||
| if not os.path.exists(file_path): | ||
| print(f"{indent}⚠️ Missing file: {file_path}") | ||
| return [] | ||
|
|
||
| print(f"{indent}📄 Parsing ({depth=}): {file_path}") | ||
|
|
||
| # Parse and unescape | ||
| try: | ||
| tree = ET.parse(file_path) | ||
| root = tree.getroot() | ||
| except Exception as e: | ||
| print(f"{indent}❌ XML parse error in {file_path}: {e}") | ||
| return [] | ||
|
|
||
| xml_text = html.unescape(ET.tostring(root, encoding="unicode")) | ||
| chunks = [] | ||
|
|
||
| # Identify tag type | ||
| tag_type = root.tag.upper() | ||
| if root.find("NAME") is not None: | ||
| title = " ".join(root.find("NAME").itertext()) | ||
| title = re.sub(r"\s+", " ", title).strip() | ||
| else: | ||
| title = "Untitled" | ||
|
|
||
| # Extract text paragraphs | ||
| text_blocks = root.findall(".//TEXT") | ||
| print(f"{indent}🧩 Found {len(text_blocks)} <TEXT> blocks in {os.path.basename(file_path)}") | ||
|
|
||
| for i, t in enumerate(text_blocks, start=1): | ||
| for bad_tag in ["INDEX", "LABEL", "CITATION", "FOOTNOTE", "COMMENT", "WEB_ONLY"]: | ||
| for el in t.findall(f".//{bad_tag}"): | ||
| el.clear() | ||
|
|
||
| text_content = " ".join(t.itertext()).strip() | ||
| text_content = re.sub(r"\s+", " ", text_content) | ||
|
|
||
| if text_content: | ||
| chunks.append({ | ||
| "source_file": os.path.basename(file_path), | ||
| "tag_type": tag_type, | ||
| "title": title, | ||
| "parent_title": parent_title, | ||
| "depth": depth, | ||
| "paragraph_index": i, | ||
| "content": text_content | ||
| }) | ||
|
|
||
| # Look for section and subsection references | ||
| section_refs = re.findall(r"§ion([\d\.]+);", xml_text) | ||
| subsection_refs = re.findall(r"&subsection([\d\.]+);", xml_text) | ||
|
|
||
| if section_refs: | ||
| print(f"{indent}🔍 Found {len(section_refs)} section ref(s): {section_refs}") | ||
| if subsection_refs: | ||
| print(f"{indent} ↳ Found {len(subsection_refs)} subsection ref(s): {subsection_refs}") | ||
|
|
||
| # Recurse into sections | ||
| for ref in section_refs: | ||
| section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}") | ||
| section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml") | ||
| print(f"{indent}➡️ Going into section file: {section_file}") | ||
| chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1)) | ||
|
|
||
| # Recurse into subsections | ||
| for ref in subsection_refs: | ||
| subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml") | ||
| print(f"{indent}➡️ Going into subsection file: {subsection_file}") | ||
| chunks.extend(parse_file(subsection_file, parent_title=title, depth=depth + 1)) | ||
|
|
||
| print(f"{indent}✅ Done parsing {os.path.basename(file_path)}, total chunks so far: {len(chunks)}\n") | ||
| return chunks |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code lacks any mechanism to prevent infinite recursion if there are circular references between XML files. While the chunking.py has a visited set to track processed files, parse_sicp.py does not have this protection, which could lead to stack overflow errors.
| else: | ||
| buffer.append(text) | ||
| types.append(ttype) | ||
| current_tokens = current_tokens + unit_tokens | ||
|
|
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accumulation logic adds tokens incorrectly. Line 468 uses current_tokens = current_tokens + unit_tokens but this should be += for consistency. More importantly, when overflow occurs and flush() is called, current_tokens is reset to 0, but then the unit that caused the overflow is lost because it's only added after the else clause checks pass, potentially losing content.
| else: | |
| buffer.append(text) | |
| types.append(ttype) | |
| current_tokens = current_tokens + unit_tokens | |
| buffer.append(text) | |
| types.append(ttype) | |
| current_tokens += unit_tokens |
| "parent_title": null, | ||
| "depth": 0, | ||
| "paragraph_index": 2, | ||
| "content": "One powerful design strategy, which is particularly appropriate to the construction of programs for" |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 18 contains truncated content: "One powerful design strategy, which is particularly appropriate to the construction of programs for" - this sentence is incomplete. This indicates a parsing issue where text extraction is cutting off mid-sentence.
| "content": "One powerful design strategy, which is particularly appropriate to the construction of programs for" | |
| "content": "One powerful design strategy, which is particularly appropriate to the construction of programs for modeling physical systems, is to base the structure of our program on the structure of the system being modeled." |
| nltk.download("punkt", quiet=True) | ||
| try: | ||
| nltk.download("punkt_tab", quiet=True) | ||
| except Exception: |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # 'punkt_tab' is an optional resource; ignore failures so we can still use 'punkt'. |
This PR adds a structure-aware XML parsing and meso-level chunking pipeline for the SICP-style corpus. It walks chapters → sections → subsections, resolves entity-like references, and propagates document context into every extracted unit. Non-content tags (indexing, Scheme-only, editorial) are pruned, and text/code are segmented and normalized. Finally, content is regrouped by location and packed into token-bounded, markdown-friendly chunks suitable for RAG ingestion.Below is a detailed report:
Report for Parsing+Chunking.pdf