Added pre-processing for implementing RAG for Louis:Chatbot #1118

Isha-Sovasaria · 2025-11-06T12:21:19Z

This PR adds a structure-aware XML parsing and meso-level chunking pipeline for the SICP-style corpus. It walks chapters → sections → subsections, resolves entity-like references, and propagates document context into every extracted unit. Non-content tags (indexing, Scheme-only, editorial) are pruned, and text/code are segmented and normalized. Finally, content is regrouped by location and packed into token-bounded, markdown-friendly chunks suitable for RAG ingestion.Below is a detailed report:
Report for Parsing+Chunking.pdf

Copilot

Pull request overview

This PR introduces XML parsing and chunking infrastructure for processing SICP (Structure and Interpretation of Computer Programs) textbook content for RAG (Retrieval-Augmented Generation) ingestion. The implementation includes structure-aware parsing that traverses chapter/section/subsection hierarchies, extracts text and code segments, and produces token-bounded chunks suitable for embedding.

Changes:

Added parse_sicp.py for recursive XML parsing of chapter structures
Added chunking.py with sophisticated text/code segmentation and token-aware chunking
Generated chapter3_chunks.json as sample output from the parsing pipeline

Reviewed changes

Copilot reviewed 2 out of 8 changed files in this pull request and generated 7 comments.

File	Description
parser/parse_sicp.py	Recursively parses SICP XML files, extracting text blocks while following section/subsection references
parser/chunking.py	Comprehensive chunking system with XML parsing, text normalization, sentence tokenization, and token-bounded chunk generation
parser/chapter3_chunks.json	Generated JSON output containing 2378 parsed chunks from Chapter 3 content

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-12T14:49:57Z

parser/parse_sicp.py

+        section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}")
+        section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml")
+        print(f"{indent}➡️  Going into section file: {section_file}")
+        chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))
+
+    # Recurse into subsections
+    for ref in subsection_refs:
+        subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")


The section reference extraction logic uses split('.')[0] which only takes the first part before a dot. For references like "&section1.2.3;", this would incorrectly extract only "1" instead of the full identifier "1.2.3". This will cause the wrong section files to be loaded.

Suggested change

section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}")

section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml")

print(f"{indent}➡️ Going into section file: {section_file}")

chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))

# Recurse into subsections

for ref in subsection_refs:

subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")

section_folder = os.path.join(os.path.dirname(file_path), f"section{ref}")

section_file = os.path.join(section_folder, f"section{ref}.xml")

print(f"{indent}➡️ Going into section file: {section_file}")

chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))

# Recurse into subsections

for ref in subsection_refs:

subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref}.xml")

Copilot · 2026-01-12T14:49:58Z

parser/parse_sicp.py

+
+    # Recurse into subsections
+    for ref in subsection_refs:
+        subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")


The subsection reference extraction uses split('.')[0] which is incorrect. Subsection references typically include the full path (e.g., "1.2.3"), so this will extract only the first number and construct the wrong file path.

Suggested change

subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")

subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref}.xml")

Copilot · 2026-01-12T14:49:58Z

parser/parse_sicp.py

+def parse_file(file_path, parent_title=None, depth=0):
+    """
+    Recursively parse any XML file (chapter, section, or subsection).
+    """
+    indent = "  " * depth  # for nice indentation in logs
+
+    if not os.path.exists(file_path):
+        print(f"{indent}⚠️ Missing file: {file_path}")
+        return []
+
+    print(f"{indent}📄 Parsing ({depth=}): {file_path}")
+
+    # Parse and unescape
+    try:
+        tree = ET.parse(file_path)
+        root = tree.getroot()
+    except Exception as e:
+        print(f"{indent}❌ XML parse error in {file_path}: {e}")
+        return []
+
+    xml_text = html.unescape(ET.tostring(root, encoding="unicode"))
+    chunks = []
+
+    # Identify tag type
+    tag_type = root.tag.upper()
+    if root.find("NAME") is not None:
+        title = " ".join(root.find("NAME").itertext())
+        title = re.sub(r"\s+", " ", title).strip()
+    else:
+        title = "Untitled"
+
+    # Extract text paragraphs
+    text_blocks = root.findall(".//TEXT")
+    print(f"{indent}🧩 Found {len(text_blocks)} <TEXT> blocks in {os.path.basename(file_path)}")
+
+    for i, t in enumerate(text_blocks, start=1):
+        for bad_tag in ["INDEX", "LABEL", "CITATION", "FOOTNOTE", "COMMENT", "WEB_ONLY"]:
+            for el in t.findall(f".//{bad_tag}"):
+                el.clear()
+
+        text_content = " ".join(t.itertext()).strip()
+        text_content = re.sub(r"\s+", " ", text_content)
+
+        if text_content:
+            chunks.append({
+                "source_file": os.path.basename(file_path),
+                "tag_type": tag_type,
+                "title": title,
+                "parent_title": parent_title,
+                "depth": depth,
+                "paragraph_index": i,
+                "content": text_content
+            })
+
+    # Look for section and subsection references
+    section_refs = re.findall(r"&section([\d\.]+);", xml_text)
+    subsection_refs = re.findall(r"&subsection([\d\.]+);", xml_text)
+
+    if section_refs:
+        print(f"{indent}🔍 Found {len(section_refs)} section ref(s): {section_refs}")
+    if subsection_refs:
+        print(f"{indent}  ↳ Found {len(subsection_refs)} subsection ref(s): {subsection_refs}")
+
+    # Recurse into sections
+    for ref in section_refs:
+        section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}")
+        section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml")
+        print(f"{indent}➡️  Going into section file: {section_file}")
+        chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))
+
+    # Recurse into subsections
+    for ref in subsection_refs:
+        subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")
+        print(f"{indent}➡️  Going into subsection file: {subsection_file}")
+        chunks.extend(parse_file(subsection_file, parent_title=title, depth=depth + 1))
+
+    print(f"{indent}✅ Done parsing {os.path.basename(file_path)}, total chunks so far: {len(chunks)}\n")
+    return chunks


The code lacks any mechanism to prevent infinite recursion if there are circular references between XML files. While the chunking.py has a visited set to track processed files, parse_sicp.py does not have this protection, which could lead to stack overflow errors.

Copilot · 2026-01-12T14:49:58Z

parser/chunking.py

+        else:
+            buffer.append(text)
+            types.append(ttype)
+            current_tokens = current_tokens + unit_tokens
+


The accumulation logic adds tokens incorrectly. Line 468 uses current_tokens = current_tokens + unit_tokens but this should be += for consistency. More importantly, when overflow occurs and flush() is called, current_tokens is reset to 0, but then the unit that caused the overflow is lost because it's only added after the else clause checks pass, potentially losing content.

Suggested change

else:

buffer.append(text)

types.append(ttype)

current_tokens = current_tokens + unit_tokens

buffer.append(text)

types.append(ttype)

current_tokens += unit_tokens

Copilot · 2026-01-12T14:49:59Z

parser/chapter3_chunks.json

+    "parent_title": null,
+    "depth": 0,
+    "paragraph_index": 2,
+    "content": "One powerful design strategy, which is particularly appropriate to the construction of programs for"


Line 18 contains truncated content: "One powerful design strategy, which is particularly appropriate to the construction of programs for" - this sentence is incomplete. This indicates a parsing issue where text extraction is cutting off mid-sentence.

Suggested change

"content": "One powerful design strategy, which is particularly appropriate to the construction of programs for"

"content": "One powerful design strategy, which is particularly appropriate to the construction of programs for modeling physical systems, is to base the structure of our program on the structure of the system being modeled."

Copilot · 2026-01-12T14:49:59Z

parser/chunking.py

+    nltk.download("punkt", quiet=True)
+    try:
+        nltk.download("punkt_tab", quiet=True)
+    except Exception:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except Exception:

except Exception:

# 'punkt_tab' is an optional resource; ignore failures so we can still use 'punkt'.

Isha-Sovasaria and others added 4 commits November 2, 2025 12:09

Add parser updates and chunking logic

4e827f6

updated chunking logic

4cec811

further edits to chunking logic

310941c

Merge branch 'master' into Isha

e6a0a73

RichDom2185 requested a review from Copilot January 12, 2026 14:45

Copilot started reviewing on behalf of RichDom2185 January 12, 2026 14:46 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

Merge branch 'master' into Isha

1928f17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Uh oh!

Isha-Sovasaria commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")
	subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref}.xml")

	"content": "One powerful design strategy, which is particularly appropriate to the construction of programs for"
	"content": "One powerful design strategy, which is particularly appropriate to the construction of programs for modeling physical systems, is to base the structure of our program on the structure of the system being modeled."

	except Exception:
	except Exception:
	# 'punkt_tab' is an optional resource; ignore failures so we can still use 'punkt'.

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Are you sure you want to change the base?

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Uh oh!

Conversation

Isha-Sovasaria commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants