Skip to content

Conversation

@Isha-Sovasaria
Copy link

This PR adds a structure-aware XML parsing and meso-level chunking pipeline for the SICP-style corpus. It walks chapters → sections → subsections, resolves entity-like references, and propagates document context into every extracted unit. Non-content tags (indexing, Scheme-only, editorial) are pruned, and text/code are segmented and normalized. Finally, content is regrouped by location and packed into token-bounded, markdown-friendly chunks suitable for RAG ingestion.Below is a detailed report:
Report for Parsing+Chunking.pdf

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces XML parsing and chunking infrastructure for processing SICP (Structure and Interpretation of Computer Programs) textbook content for RAG (Retrieval-Augmented Generation) ingestion. The implementation includes structure-aware parsing that traverses chapter/section/subsection hierarchies, extracts text and code segments, and produces token-bounded chunks suitable for embedding.

Changes:

  • Added parse_sicp.py for recursive XML parsing of chapter structures
  • Added chunking.py with sophisticated text/code segmentation and token-aware chunking
  • Generated chapter3_chunks.json as sample output from the parsing pipeline

Reviewed changes

Copilot reviewed 2 out of 8 changed files in this pull request and generated 7 comments.

File Description
parser/parse_sicp.py Recursively parses SICP XML files, extracting text blocks while following section/subsection references
parser/chunking.py Comprehensive chunking system with XML parsing, text normalization, sentence tokenization, and token-bounded chunk generation
parser/chapter3_chunks.json Generated JSON output containing 2378 parsed chunks from Chapter 3 content

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to +82
section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}")
section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml")
print(f"{indent}➡️ Going into section file: {section_file}")
chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))

# Recurse into subsections
for ref in subsection_refs:
subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section reference extraction logic uses split('.')[0] which only takes the first part before a dot. For references like "&section1.2.3;", this would incorrectly extract only "1" instead of the full identifier "1.2.3". This will cause the wrong section files to be loaded.

Suggested change
section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}")
section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml")
print(f"{indent}➡️ Going into section file: {section_file}")
chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))
# Recurse into subsections
for ref in subsection_refs:
subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")
section_folder = os.path.join(os.path.dirname(file_path), f"section{ref}")
section_file = os.path.join(section_folder, f"section{ref}.xml")
print(f"{indent}➡️ Going into section file: {section_file}")
chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))
# Recurse into subsections
for ref in subsection_refs:
subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref}.xml")

Copilot uses AI. Check for mistakes.

# Recurse into subsections
for ref in subsection_refs:
subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsection reference extraction uses split('.')[0] which is incorrect. Subsection references typically include the full path (e.g., "1.2.3"), so this will extract only the first number and construct the wrong file path.

Suggested change
subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")
subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref}.xml")

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +87
def parse_file(file_path, parent_title=None, depth=0):
"""
Recursively parse any XML file (chapter, section, or subsection).
"""
indent = " " * depth # for nice indentation in logs

if not os.path.exists(file_path):
print(f"{indent}⚠️ Missing file: {file_path}")
return []

print(f"{indent}📄 Parsing ({depth=}): {file_path}")

# Parse and unescape
try:
tree = ET.parse(file_path)
root = tree.getroot()
except Exception as e:
print(f"{indent}❌ XML parse error in {file_path}: {e}")
return []

xml_text = html.unescape(ET.tostring(root, encoding="unicode"))
chunks = []

# Identify tag type
tag_type = root.tag.upper()
if root.find("NAME") is not None:
title = " ".join(root.find("NAME").itertext())
title = re.sub(r"\s+", " ", title).strip()
else:
title = "Untitled"

# Extract text paragraphs
text_blocks = root.findall(".//TEXT")
print(f"{indent}🧩 Found {len(text_blocks)} <TEXT> blocks in {os.path.basename(file_path)}")

for i, t in enumerate(text_blocks, start=1):
for bad_tag in ["INDEX", "LABEL", "CITATION", "FOOTNOTE", "COMMENT", "WEB_ONLY"]:
for el in t.findall(f".//{bad_tag}"):
el.clear()

text_content = " ".join(t.itertext()).strip()
text_content = re.sub(r"\s+", " ", text_content)

if text_content:
chunks.append({
"source_file": os.path.basename(file_path),
"tag_type": tag_type,
"title": title,
"parent_title": parent_title,
"depth": depth,
"paragraph_index": i,
"content": text_content
})

# Look for section and subsection references
section_refs = re.findall(r"&section([\d\.]+);", xml_text)
subsection_refs = re.findall(r"&subsection([\d\.]+);", xml_text)

if section_refs:
print(f"{indent}🔍 Found {len(section_refs)} section ref(s): {section_refs}")
if subsection_refs:
print(f"{indent} ↳ Found {len(subsection_refs)} subsection ref(s): {subsection_refs}")

# Recurse into sections
for ref in section_refs:
section_folder = os.path.join(os.path.dirname(file_path), f"section{ref.split('.')[0]}")
section_file = os.path.join(section_folder, f"section{ref.split('.')[0]}.xml")
print(f"{indent}➡️ Going into section file: {section_file}")
chunks.extend(parse_file(section_file, parent_title=title, depth=depth + 1))

# Recurse into subsections
for ref in subsection_refs:
subsection_file = os.path.join(os.path.dirname(file_path), f"subsection{ref.split('.')[0]}.xml")
print(f"{indent}➡️ Going into subsection file: {subsection_file}")
chunks.extend(parse_file(subsection_file, parent_title=title, depth=depth + 1))

print(f"{indent}✅ Done parsing {os.path.basename(file_path)}, total chunks so far: {len(chunks)}\n")
return chunks
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code lacks any mechanism to prevent infinite recursion if there are circular references between XML files. While the chunking.py has a visited set to track processed files, parse_sicp.py does not have this protection, which could lead to stack overflow errors.

Copilot uses AI. Check for mistakes.
Comment on lines +465 to +469
else:
buffer.append(text)
types.append(ttype)
current_tokens = current_tokens + unit_tokens

Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accumulation logic adds tokens incorrectly. Line 468 uses current_tokens = current_tokens + unit_tokens but this should be += for consistency. More importantly, when overflow occurs and flush() is called, current_tokens is reset to 0, but then the unit that caused the overflow is lost because it's only added after the else clause checks pass, potentially losing content.

Suggested change
else:
buffer.append(text)
types.append(ttype)
current_tokens = current_tokens + unit_tokens
buffer.append(text)
types.append(ttype)
current_tokens += unit_tokens

Copilot uses AI. Check for mistakes.
"parent_title": null,
"depth": 0,
"paragraph_index": 2,
"content": "One powerful design strategy, which is particularly appropriate to the construction of programs for"
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 18 contains truncated content: "One powerful design strategy, which is particularly appropriate to the construction of programs for" - this sentence is incomplete. This indicates a parsing issue where text extraction is cutting off mid-sentence.

Suggested change
"content": "One powerful design strategy, which is particularly appropriate to the construction of programs for"
"content": "One powerful design strategy, which is particularly appropriate to the construction of programs for modeling physical systems, is to base the structure of our program on the structure of the system being modeled."

Copilot uses AI. Check for mistakes.
nltk.download("punkt", quiet=True)
try:
nltk.download("punkt_tab", quiet=True)
except Exception:
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
except Exception:
# 'punkt_tab' is an optional resource; ignore failures so we can still use 'punkt'.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants