Skip to content

Commit 901f666

Browse files
realmarcinclaude
andcommitted
Add document concatenation script and Makefile targets
Script Features (src/download/concatenate_documents.py): - Concatenates documents from a directory in reproducible alphabetical order - Includes table of contents with all file paths - Adds headers with filename, path, and size for each document - Handles multiple encodings (UTF-8, UTF-8-sig, latin-1) - Supports filtering by file extension - Supports recursive directory search - Customizable separators between documents - Optional headers and summary sections Makefile Targets: - make concat-docs INPUT_DIR=dir OUTPUT_FILE=file Generic concatenation with required parameters Optional: EXTENSIONS=".txt .md" RECURSIVE=true - make concat-extracted Concatenates D4D YAML files from data/extracted_by_column/ Creates one file per project column (AI_READI, CHORUS, CM4AI, VOICE) Output: data/concatenated/{column}_d4d.txt - make concat-downloads Concatenates raw downloads from downloads_by_column/ Creates one file per project column Output: data/concatenated/{column}_raw.txt Documentation: - Added comprehensive "Document Concatenation" section to CLAUDE.md - Documented all command options and use cases - Updated help menu in Makefile Use Cases: - Combine all downloaded dataset documentation for a project - Create single input documents for LLM processing - Merge documentation fragments into complete documents - Aggregate logs or reports from multiple files Tested with data/extracted_by_column/: - AI_READI: 6 files → 18K - CHORUS: 2 files → 4.0K - CM4AI: 4 files → 6.4K - VOICE: 4 files → 19K 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent b3a2bd4 commit 901f666

File tree

4 files changed

+401
-0
lines changed

4 files changed

+401
-0
lines changed

CLAUDE.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,51 @@ The D4D agents use the `aurelian` framework:
225225
- Processes HTML, PDF, JSON, and text documents
226226
- Can be run via CLI: `aurelian datasheets <URL>` or `aurelian datasheets --ui`
227227

228+
## Document Concatenation
229+
230+
This project includes tools to concatenate multiple documents from a directory into a single document in reproducible order.
231+
232+
### Concatenation Commands
233+
234+
```bash
235+
# Concatenate documents from a specific directory
236+
make concat-docs INPUT_DIR=path/to/dir OUTPUT_FILE=path/to/output.txt
237+
238+
# Optional parameters:
239+
make concat-docs INPUT_DIR=path/to/dir OUTPUT_FILE=output.txt EXTENSIONS=".txt .md" RECURSIVE=true
240+
241+
# Concatenate extracted D4D documents by column (from data/extracted_by_column)
242+
make concat-extracted
243+
244+
# Concatenate raw downloads by column (from downloads_by_column)
245+
make concat-downloads
246+
247+
# Direct script usage with more options:
248+
python src/download/concatenate_documents.py -i input_dir -o output.txt [OPTIONS]
249+
250+
# Script options:
251+
# -e, --extensions .txt .md # Filter by file extensions
252+
# -r, --recursive # Search subdirectories
253+
# --no-headers # Exclude file headers
254+
# --no-summary # Exclude table of contents
255+
# -s "separator" # Custom separator between files
256+
```
257+
258+
### Features
259+
260+
- **Reproducible ordering**: Files are sorted alphabetically for consistent results
261+
- **Multiple formats**: Handles text, HTML, YAML, and other text-based formats
262+
- **File metadata**: Includes headers with filename, path, and size
263+
- **Table of contents**: Summary section lists all concatenated files
264+
- **Error handling**: Gracefully handles encoding issues and read errors
265+
266+
### Use Cases
267+
268+
- Combine all downloaded dataset documentation for a project
269+
- Create single input documents for LLM processing
270+
- Merge documentation fragments into complete documents
271+
- Aggregate logs or reports from multiple files
272+
228273
## Custom Makefile Targets
229274

230275
Beyond standard LinkML targets, this project adds:
@@ -235,6 +280,9 @@ make gen-html # Generate HTML from D4D YAML files using human_readab
235280
make full-schema # Generate data_sheets_schema_all.yaml (merged schema)
236281
make test-modules # Validate all individual D4D module schemas
237282
make lint-modules # Lint all individual D4D module schemas
283+
make concat-docs # Concatenate documents from a directory
284+
make concat-extracted # Concatenate extracted D4D documents by column
285+
make concat-downloads # Concatenate raw downloads by column
238286
```
239287

240288
## Null/Empty Value Handling

Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ help: status
5050
@echo "make test-modules -- validate all D4D module schemas"
5151
@echo "make lint -- perfom linting"
5252
@echo "make lint-modules -- lint all D4D module schemas"
53+
@echo "make concat-docs INPUT_DIR=dir OUTPUT_FILE=file -- concatenate documents from directory"
54+
@echo "make concat-extracted -- concatenate extracted D4D documents by column"
55+
@echo "make concat-downloads -- concatenate raw downloads by column"
5356
@echo "make testdoc -- builds docs and runs local test server"
5457
@echo "make deploy -- deploys site"
5558
@echo "make update -- updates linkml version"

project.Makefile

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,48 @@ gen-minimal-examples:
1515
# Generate HTML from current D4D YAML files
1616
gen-html:
1717
$(RUN) python src/html/human_readable_renderer.py
18+
19+
# Concatenate documents from a directory
20+
# Usage: make concat-docs INPUT_DIR=path/to/dir OUTPUT_FILE=path/to/output.txt
21+
# Optional: EXTENSIONS=".txt .md" RECURSIVE=true
22+
concat-docs:
23+
ifndef INPUT_DIR
24+
$(error INPUT_DIR is not defined. Usage: make concat-docs INPUT_DIR=path/to/dir OUTPUT_FILE=path/to/output.txt)
25+
endif
26+
ifndef OUTPUT_FILE
27+
$(error OUTPUT_FILE is not defined. Usage: make concat-docs INPUT_DIR=path/to/dir OUTPUT_FILE=path/to/output.txt)
28+
endif
29+
@echo "Concatenating documents from $(INPUT_DIR) to $(OUTPUT_FILE)"
30+
$(RUN) python src/download/concatenate_documents.py -i $(INPUT_DIR) -o $(OUTPUT_FILE) \
31+
$(if $(EXTENSIONS),-e $(EXTENSIONS),) \
32+
$(if $(RECURSIVE),-r,)
33+
34+
# Concatenate extracted D4D documents by column
35+
# This creates a single file per project column from data/extracted_by_column
36+
concat-extracted:
37+
@echo "Concatenating extracted D4D documents by column..."
38+
@mkdir -p data/concatenated
39+
@for column_dir in data/extracted_by_column/*/; do \
40+
if [ -d "$$column_dir" ]; then \
41+
column_name=$$(basename "$$column_dir"); \
42+
output_file="data/concatenated/$${column_name}_d4d.txt"; \
43+
echo "Processing $$column_name..."; \
44+
$(RUN) python src/download/concatenate_documents.py -i "$$column_dir" -o "$$output_file" || exit 1; \
45+
fi \
46+
done
47+
@echo "✅ All columns concatenated to data/concatenated/"
48+
49+
# Concatenate documents from downloads_by_column subdirectories
50+
# This creates a single file per project column from raw downloads
51+
concat-downloads:
52+
@echo "Concatenating downloaded documents by column..."
53+
@mkdir -p data/concatenated
54+
@for column_dir in downloads_by_column/*/; do \
55+
if [ -d "$$column_dir" ]; then \
56+
column_name=$$(basename "$$column_dir"); \
57+
output_file="data/concatenated/$${column_name}_raw.txt"; \
58+
echo "Processing $$column_name..."; \
59+
$(RUN) python src/download/concatenate_documents.py -i "$$column_dir" -o "$$output_file" || exit 1; \
60+
fi \
61+
done
62+
@echo "✅ All downloads concatenated to data/concatenated/"

0 commit comments

Comments
 (0)