Skip to content

Commit c5d81ab

Browse files
alicodingclaude
andcommitted
feat: Add export domain with LlamaIndex support and batching
- Add new export/ domain for conversation export formats - Implement export_for_llamaindex() with optional batch_size parameter - Fix discover_claude_files() to properly return file paths - Export all filtering functions in main module - Update README with v2.1.0 features and export roadmap - Add batching using more_itertools.chunked() for memory efficiency BREAKING CHANGES: None - fully backward compatible 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
1 parent d7a357f commit c5d81ab

File tree

12 files changed

+447
-44
lines changed

12 files changed

+447
-44
lines changed

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,29 @@
1+
## [2.1.0] - 2025-09-16
2+
3+
### Added
4+
5+
- **Export Domain**: New `export/` module for converting conversations to different formats (#5)
6+
- `export_for_llamaindex()` - Export conversations to LlamaIndex document format for semantic search
7+
- Returns list of documents with `text` and `metadata` fields
8+
- Filters out tool operations, keeping only pure conversation
9+
- **Batching Support**: Optional `batch_size` parameter for memory-efficient processing
10+
- Uses `more_itertools.chunked()` for 100% framework delegation
11+
- **Complete API Export**: All filtering functions now properly exported from main module (#6)
12+
- `filter_messages_by_type()`, `filter_messages_by_tool()`, `search_messages_by_content()`, `exclude_tool_operations()`
13+
14+
### Fixed
15+
16+
- **discover_claude_files()**: Now properly returns file paths instead of empty list (#7)
17+
- Extracts `transcript_path` from session metadata
18+
- Returns empty list for non-existent search paths
19+
- Enables downstream services to find and index conversations
20+
21+
### Documentation
22+
23+
- Updated README with export functionality and roadmap for future formats
24+
- Added 10+ planned export formats: Mem0, ChromaDB, Pinecone, Markdown, JSON-LD, OpenAI, Anthropic, LangChain, Haystack
25+
- Corrected architecture diagram to show all 19 domains (was showing 15)
26+
127
## [2.0.1] - 2025-09-15
228

329
# Changelog

README.md

Lines changed: 68 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Claude Parser v2.0.0 🚀
1+
# Claude Parser v2.1.0 🚀
22

33
[![PyPI version](https://badge.fury.io/py/claude-parser.svg)](https://badge.fury.io/py/claude-parser)
44
[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue)](https://alicoding.github.io/claude-parser/)
@@ -8,9 +8,15 @@
88
99
Claude Parser treats every Claude API call as a git commit, enabling powerful recovery and analysis capabilities when things go wrong.
1010

11-
## 🎉 What's New in v2.0.0
11+
## 🎉 What's New in v2.1.0
1212

13-
### Major Changes
13+
### New Features
14+
- **🔍 Export Domain** - Export conversations to different formats for indexing
15+
- **📚 LlamaIndex Export** - `export_for_llamaindex()` for semantic search
16+
- **🛠️ Fixed Discovery** - `discover_claude_files()` now properly returns file paths
17+
- **📦 Complete API** - All filtering functions now properly exported
18+
19+
## 📋 v2.0.0 Major Changes
1420
- **🎯 Complete API Redesign** - Clean, intuitive Python API with 30+ functions
1521
- **📚 15 Domain Architecture** - Organized into focused, composable modules
1622
- **🔧 CG Commands** - Full Git-like CLI for disaster recovery
@@ -106,7 +112,10 @@ from claude_parser import (
106112

107113
# Filtering (NEW in v2!)
108114
filter_messages_by_type, filter_messages_by_tool,
109-
search_messages_by_content, exclude_tool_operations
115+
search_messages_by_content, exclude_tool_operations,
116+
117+
# Export (NEW in v2.1!)
118+
export_for_llamaindex # Export conversations for semantic search
110119
)
111120
```
112121

@@ -155,26 +164,44 @@ def on_assistant(message):
155164
watch("~/.claude/projects/current/session.jsonl", on_assistant=on_assistant)
156165
```
157166

167+
### 5. Export for Semantic Search (NEW in v2.1!)
168+
```python
169+
from claude_parser import export_for_llamaindex
170+
171+
# Export conversations to LlamaIndex format
172+
docs = export_for_llamaindex("session.jsonl")
173+
# Returns: [{"text": "message", "metadata": {...}}, ...]
174+
175+
# Use with semantic search services
176+
for doc in docs:
177+
print(f"Text: {doc['text'][:50]}...")
178+
print(f"Speaker: {doc['metadata']['speaker']}")
179+
```
180+
158181
## 🏗️ Architecture
159182

160-
### Clean Domain Organization (15 modules)
183+
### Clean Domain Organization (19 modules)
161184
```
162185
claude_parser/
163-
├── analytics/ # Session and tool analysis
186+
├── analytics/ # Session and tool analysis
187+
├── api/ # API utilities
164188
├── cli/ # CG and CH commands
189+
├── core/ # Core utilities
165190
├── discovery/ # File and project discovery
166-
├── filtering/ # Message filtering (NEW!)
191+
├── export/ # Export formats (NEW in v2.1!)
192+
├── extensions/ # Extension system
193+
├── filtering/ # Message filtering
167194
├── hooks/ # Hook system and API
168195
├── loaders/ # Session loading
169-
├── messages/ # Message utilities (NEW!)
170-
├── models/ # Data models (NEW!)
196+
├── messages/ # Message utilities
197+
├── models/ # Data models
171198
├── navigation/ # Timeline and UUID navigation
172199
├── operations/ # File operations
173200
├── queries/ # DuckDB SQL queries
174201
├── session/ # Session management
175202
├── storage/ # DuckDB engine
176203
├── tokens/ # Token counting and billing
177-
└── watch/ # Real-time monitoring (NEW!)
204+
└── watch/ # Real-time monitoring
178205
```
179206

180207
### LNCA Principles
@@ -229,6 +256,33 @@ twine upload dist/*
229256
### Documentation
230257
Documentation auto-deploys to GitHub Pages on every push to main.
231258

259+
## 🗺️ Export Format Roadmap
260+
261+
### Currently Available (v2.1)
262+
-**LlamaIndex** - `export_for_llamaindex()` - For semantic search indexing
263+
264+
### Planned Export Formats
265+
- 🔜 **Mem0** - Long-term memory for AI agents
266+
- 🔜 **ChromaDB** - Vector database format
267+
- 🔜 **Pinecone** - Cloud vector database
268+
- 🔜 **Markdown** - Human-readable conversation logs
269+
- 🔜 **JSON-LD** - Structured data with context
270+
- 🔜 **OpenAI Messages** - Direct OpenAI API format
271+
- 🔜 **Anthropic Messages** - Direct Anthropic API format
272+
- 🔜 **LangChain Documents** - LangChain document format
273+
- 🔜 **Haystack Documents** - Haystack NLP framework
274+
275+
### Export Domain Architecture
276+
```python
277+
claude_parser/export/
278+
├── __init__.py # Export registry
279+
├── llamaindex.py # LlamaIndex format (DONE)
280+
├── mem0.py # Mem0 format (TODO)
281+
├── chroma.py # ChromaDB format (TODO)
282+
├── markdown.py # Markdown format (TODO)
283+
└── ... # More formats
284+
```
285+
232286
## 🤝 Contributing
233287

234288
We welcome contributions! Please ensure:
@@ -249,18 +303,18 @@ MIT License - See [LICENSE](LICENSE) file for details.
249303

250304
## 📊 Stats
251305

252-
- **15** specialized domains
253-
- **30+** public functions
306+
- **19** specialized domains
307+
- **35+** public functions
254308
- **<80** lines per file
255309
- **100%** framework delegation
256310
- **0** custom error handling
257311

258312
---
259313

260-
**Ready to never lose code again?** Install v2.0.0 and experience the power of Git-like recovery for Claude Code!
314+
**Ready to never lose code again?** Install v2.1.0 and experience the power of Git-like recovery for Claude Code!
261315

262316
```bash
263-
pip install claude-parser==2.0.0
317+
pip install claude-parser==2.1.0
264318
```
265319

266320
[Documentation](https://alicoding.github.io/claude-parser/) | [GitHub](https://github.com/alicoding/claude-parser) | [PyPI](https://pypi.org/project/claude-parser/)

claude_parser/__init__.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,19 @@
66

77
# LNCA Core API - 100% Framework Delegation
88
from .main import load_session, load_latest_session, discover_all_sessions
9-
from .analytics import analyze_session, analyze_project_contexts, analyze_tool_usage
9+
from .analytics import analyze_session, analyze_project_contexts, analyze_tool_usage
1010
from .discovery import discover_claude_files, group_by_projects, analyze_project_structure, discover_current_project_files
1111
from .operations import restore_file_content, generate_file_diff, compare_files, backup_file
1212
from .navigation import find_message_by_uuid, get_message_sequence, get_timeline_summary
1313
from .tokens import count_tokens, analyze_token_usage, estimate_cost, token_status
1414
from .tokens.context import calculate_context_window
1515
from .tokens.billing import calculate_session_cost
1616
from .session import SessionManager
17+
from .export import export_for_llamaindex
18+
from .filtering import filter_messages_by_type, filter_messages_by_tool, search_messages_by_content, exclude_tool_operations
1719

1820
# Version info
19-
__version__ = "2.0.1"
21+
__version__ = "2.1.0"
2022

2123
# Message types for filtering
2224
class MessageType:
@@ -41,13 +43,14 @@ def find_current_transcript():
4143

4244
# Clean exports - API only
4345
__all__ = [
44-
'load_session', 'load_latest_session', 'discover_all_sessions',
46+
'load_session', 'load_latest_session', 'discover_all_sessions',
4547
'analyze_session', 'analyze_project_contexts', 'analyze_tool_usage',
4648
'discover_claude_files', 'group_by_projects', 'analyze_project_structure', 'discover_current_project_files',
4749
'restore_file_content', 'generate_file_diff', 'compare_files', 'backup_file',
4850
'find_message_by_uuid', 'get_message_sequence', 'get_timeline_summary',
4951
'count_tokens', 'analyze_token_usage', 'estimate_cost', 'token_status',
5052
'calculate_context_window', 'calculate_session_cost',
51-
'load_many', 'find_current_transcript',
53+
'filter_messages_by_type', 'filter_messages_by_tool', 'search_messages_by_content', 'exclude_tool_operations',
54+
'load_many', 'find_current_transcript', 'export_for_llamaindex',
5255
'MessageType', '__version__'
5356
]

claude_parser/discovery/core.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,12 +42,20 @@ def discover_claude_files(search_path: str = None) -> List[Path]:
4242
for pattern in ["*.jsonl", "*.claude", "*.transcript"]:
4343
files.extend(search_dir.rglob(pattern))
4444
return sorted(files, key=lambda p: p.stat().st_mtime, reverse=True)
45+
else:
46+
# Non-existent path returns empty list
47+
return []
4548

46-
# Default: discover all sessions
49+
# Default: discover all sessions and extract file paths
4750
sessions = discover_all_sessions()
48-
# Extract file paths from session data
49-
# Sessions are dicts with metadata, need to extract paths
50-
return [] # Simplified for now
51+
# Extract transcript_path from each session's metadata
52+
paths = []
53+
for session in sessions:
54+
if session and 'metadata' in session:
55+
transcript_path = session['metadata'].get('transcript_path')
56+
if transcript_path:
57+
paths.append(Path(transcript_path))
58+
return paths
5159

5260

5361
def group_by_projects(files: List[Path]) -> Dict[Path, List[Path]]:

claude_parser/export/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
"""
2+
Export domain - 100% framework delegation
3+
@FRAMEWORK_FIRST: Only use existing functions
4+
@ZERO_CUSTOM_CODE: No loops, no manual parsing
5+
"""
6+
7+
from .llamaindex import export_for_llamaindex
8+
9+
__all__ = ['export_for_llamaindex']

claude_parser/export/llamaindex.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
"""
2+
LlamaIndex export format - 100% framework delegation
3+
@ZERO_CUSTOM_CODE: No loops, only map/filter
4+
@FRAMEWORK_FIRST: Reuse existing functions
5+
@DRY_FIRST: Don't duplicate text extraction logic
6+
@LOC_ENFORCEMENT: <80 lines
7+
"""
8+
9+
from typing import List, Dict, Any, Union, Iterator
10+
from operator import methodcaller
11+
from functools import partial
12+
from more_itertools import chunked
13+
14+
from ..main import load_session
15+
from ..filtering import filter_pure_conversation
16+
from ..messages.utils import get_text
17+
18+
19+
def _extract_document(msg: Dict[str, Any]) -> Dict[str, Any]:
20+
"""Transform message to LlamaIndex document format
21+
22+
@SEMANTIC_INTERFACE: Returns simple text + metadata
23+
@NO_IMPLEMENTATION_EXPOSURE: Hides message complexity
24+
"""
25+
return {
26+
'text': get_text(msg), # Reuse existing text extractor
27+
'metadata': {
28+
'speaker': msg.get('type', 'unknown'),
29+
'uuid': msg.get('uuid', ''),
30+
'timestamp': msg.get('timestamp', ''),
31+
'session_id': msg.get('sessionId', '')
32+
}
33+
}
34+
35+
36+
def export_for_llamaindex(jsonl_path: str, batch_size: int = None) -> Union[List[Dict[str, Any]], Iterator[List[Dict[str, Any]]]]:
37+
"""Export conversation for LlamaIndex indexing with optional batching
38+
39+
@API_FIRST: Public interface for semantic-search
40+
@FRAMEWORK_FIRST: 100% delegation to existing functions
41+
@ZERO_CUSTOM_CODE: No manual loops, only map and chunked
42+
43+
Args:
44+
jsonl_path: Path to JSONL conversation file
45+
batch_size: Optional batch size for memory-efficient processing
46+
47+
Returns:
48+
If batch_size=None: List of all documents
49+
If batch_size>0: Iterator yielding batches of documents
50+
Each document has:
51+
- text: Plain text content
52+
- metadata: speaker, uuid, timestamp, session_id
53+
"""
54+
# Load using existing SDK
55+
session = load_session(jsonl_path)
56+
if not session:
57+
return [] if not batch_size else iter([])
58+
59+
# Get messages
60+
messages = session.get('messages', [])
61+
62+
# Filter using existing function (returns generator)
63+
clean_messages = filter_pure_conversation(messages)
64+
65+
# Transform using map (no custom loops!)
66+
documents = map(_extract_document, clean_messages)
67+
68+
# Return batched or full list based on batch_size
69+
if batch_size:
70+
# more-itertools handles all batching logic!
71+
return chunked(documents, batch_size)
72+
else:
73+
# Current behavior - return full list
74+
return list(documents)

docs/api.md

Lines changed: 7 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,9 @@
1-
To generate API documentation for the Claude Parser project, you can follow these steps:
1+
To generate API documentation, you can follow these steps:
22

3-
1. **Documentation Tool**: Choose a documentation tool like Sphinx, MkDocs, or similar that suits your project requirements.
3+
1. Document each module, function, and class in your codebase using docstrings that describe their purpose, parameters, and return values.
4+
2. Utilize tools like Sphinx or MkDocs to automatically generate documentation from the docstrings.
5+
3. Organize the documentation into sections based on the different components of your project, such as hooks, tokens, CLI commands, etc.
6+
4. Include examples and usage scenarios to help users understand how to interact with the API.
7+
5. Provide information on any configuration settings, environment variables, or dependencies required to use the API.
48

5-
2. **Setup Documentation**: Initialize the documentation tool in the project directory and configure it to generate documentation from the source code.
6-
7-
3. **Documenting Modules**: Use the tool to automatically generate documentation from the Python source files in the project. This will include information about modules, classes, functions, and their docstrings.
8-
9-
4. **Include Descriptions**: Ensure that the generated documentation includes descriptions of modules, classes, functions, and their parameters. This will help users understand the purpose and usage of each component.
10-
11-
5. **API Endpoints**: If the project includes API endpoints, document them with details on request/response formats, parameters, and expected outputs.
12-
13-
6. **Hooks and Handlers**: Document the hooks and handlers used in the project, including their purpose, input parameters, and expected behavior.
14-
15-
7. **Settings and Configurations**: Include documentation for settings and configurations used in the project, such as API keys, database URLs, and other important settings.
16-
17-
8. **Token Analysis**: Document the token analysis functions, including details on estimating costs, analyzing token usage, and status tracking.
18-
19-
9. **Analytics and Tools**: Provide documentation for the analytics functions, tool usage analysis, and any other data analysis operations in the project.
20-
21-
10. **CLI Commands**: Document the CLI commands available in the project, including their usage, options, and expected outcomes.
22-
23-
By following these steps and ensuring comprehensive documentation across all aspects of the project, you can generate detailed API documentation that will be helpful for users and developers interacting with the Claude Parser project.
9+
Additionally, you can install Sphinx using pip, navigate to the root directory of your project, initialize Sphinx, modify the `conf.py` file to include paths to your Python modules, write docstrings in reStructuredText format, use Sphinx directives like `autodoc` to generate documentation, run the Sphinx build command, and find the generated HTML documentation in the specified output directory. By customizing the generated documentation with additional details as needed, you can create comprehensive API documentation for your Python project.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "claude-parser"
3-
version = "2.0.1"
3+
version = "2.1.0"
44
description = "Parse and analyze Claude Code JSONL exports"
55
authors = ["Your Name <you@example.com>"]
66
readme = "README.md"

0 commit comments

Comments
 (0)