Skip to content

Commit 870de45

Browse files
committed
refactor(core): reorganize project structure and add streaming analysis capabilities
- Restructure source code: src/ → src/src/ for better package organization - Reorganize Jupyter notebooks into snapshot_based/ and per_maker_based/ directories - Add StreamingOfferAnalyzer for real-time maker behavior change detection - Add maker_offers.py for maker-specific analysis pipeline - Add utility scripts for data processing and reorganization - Update .gitignore to exclude data files (pkl, csv, logs) - Add CLAUDE.md with project documentation and development guidelines - Update all tests and imports to reflect new structure
1 parent 7368f58 commit 870de45

34 files changed

+34857
-1520
lines changed

.gitignore

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,29 @@ src/__pycache__/
99

1010
# Notebooks & checkpoints
1111
src/.ipynb_checkpoints/
12+
src/jupyter_notebooks/**/.ipynb_checkpoints/
1213

1314
# Data files
15+
data-*
1416
data/
1517
data_*
1618
data_old
1719
dataframe_old.pkl
20+
data_cache/
21+
22+
# Processed dataframes (pkl files)
23+
dataframe.pkl
24+
dataframe_test.pkl
25+
dataframe_*.pkl
26+
27+
# Analysis outputs
28+
*.csv
29+
output-*.log
30+
OPTIMIZATION_SUMMARY.md
31+
32+
# Temporary scripts
33+
temp_scripts/
1834

1935
# Poetry & test caches
2036
poetry.lock
2137
.pytest_cache/
22-
23-
# Project-specific
24-
dataframe.pkl
25-
dataframe_test.pkl

CLAUDE.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is a research project for analyzing Joinmarket orderbook snapshots as part of a master thesis. The codebase parses JSON orderbook data and creates visualizations to answer questions about liquidity, maker offer lifetimes, and coinjoin protocol dynamics.
8+
9+
## Development Environment
10+
11+
**Package Management**: Poetry
12+
- Use `poetry install` to set up dependencies
13+
- Use `poetry shell` to activate the virtual environment
14+
- Dependencies managed in `pyproject.toml`
15+
16+
**Testing**: pytest
17+
- Run tests with `poetry run pytest`
18+
- Test files located in `test/` directory
19+
- Main test modules: `test_snapshot_processing.py`, `test_extended_snapshot_data.py`, `test_visualisations.py`
20+
21+
**Data Analysis Environment**: Jupyter notebooks
22+
- Start with `poetry run jupyter notebook`
23+
- Analysis notebooks located in `src/` directory
24+
25+
## Architecture
26+
27+
The codebase follows a modular structure with three main components:
28+
29+
### Data Processing Pipeline (`src/preprocessing/`)
30+
- **`snapshot.py`**: Core data loading and orderbook snapshot processing
31+
- `load_and_process_snapshot()`: Main entry point for processing individual snapshots
32+
- `process_offers()`: Analyzes offer data to extract liquidity, fees, and order sizes
33+
- `compute_statistics()`: Calculates statistical measures from processed offers
34+
- **`dataframe.py`**: Batch processing and DataFrame management
35+
- `load_snapshots_to_dataframe()`: Processes multiple snapshots with timestamp filtering
36+
- Handles data serialization with pickle format for performance
37+
- **`utils.py`**: Utility functions for file processing and timestamp extraction
38+
- **`maker_offers.py`**: Specialized processing for maker-specific analysis
39+
40+
### Analysis Modules (`src/analysis/`)
41+
- **`offer_changes.py`**: Advanced streaming analysis for detecting offer modifications
42+
- `StreamingOfferAnalyzer`: Single-pass algorithm for tracking maker behavior changes
43+
- `OfferSignature`: Data structure for comparing similar offers
44+
- Real-time change detection (appeared, disappeared, modified, quantity changes)
45+
- **`fees.py`**: Fee analysis and calculations
46+
47+
### Visualization (`src/visualisations/`)
48+
- **`plot.py`**: Core plotting utilities
49+
- **`fees.py`**: Fee-specific visualizations
50+
51+
## Data Structure
52+
53+
**Raw Data**: JSON files containing orderbook snapshots
54+
- Located in timestamped directories (e.g., `2025-jsons/2025-07-27/`)
55+
- Format: `orderbook_HH-MM.json`
56+
- Contains `offers` and `fidelitybonds` arrays
57+
58+
**Processed Data**:
59+
- Cached as pickle files (`dataframe.pkl`) for performance
60+
- Structured DataFrames with temporal analysis capabilities
61+
62+
## Key Processing Concepts
63+
64+
**Offer Types**:
65+
- `sw0reloffer`: Relative fee offers (percentage-based)
66+
- `sw0absoffer`: Absolute fee offers (fixed satoshi amounts)
67+
68+
**Timestamp Filtering**: Built-in capability to process data with minimum time intervals to manage large datasets efficiently
69+
70+
**Streaming Analysis**: Single-pass algorithms for real-time change detection without loading entire datasets into memory
71+
72+
## Common Workflows
73+
74+
**Loading and Processing Data**:
75+
76+
```python
77+
from src.src.preprocessing.dataframe import load_snapshots_to_dataframe
78+
79+
df = load_snapshots_to_dataframe(filepaths, min_interval_minutes=60)
80+
```
81+
82+
**Analyzing Offer Changes**:
83+
84+
```python
85+
from src.src.analysis import StreamingOfferAnalyzer
86+
87+
analyzer = StreamingOfferAnalyzer()
88+
summary = analyzer.analyze_complete_dataset(df_offers)
89+
```
90+
91+
**Data Persistence**:
92+
93+
```python
94+
from src.src.preprocessing.dataframe import save_dataframe, load_dataframe
95+
96+
save_dataframe(df, 'processed_data.pkl')
97+
df = load_dataframe('processed_data.pkl')
98+
```

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,14 @@ readme = "README.md"
99
python = "^3.11"
1010
matplotlib = "^3.9.2"
1111
jupyter = "^1.1.1"
12-
pandas = "^2.2.3"
12+
pandas = "^2.3.1"
1313
seaborn = "^0.13.2"
1414
pytest = "^8.3.5"
1515
setuptools = "^78.1.0"
16+
notebook = "^7.5.0"
17+
pandoc = "^2.4"
18+
pyarrow = "^21.0.0"
19+
psutil = "^6.1.1"
1620

1721

1822
[build-system]
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Example script showing how to use the enhanced dataframe loading with:
4+
- Progress tracking with ETA
5+
- Memory monitoring
6+
- Periodic checkpointing
7+
- Resume from checkpoint
8+
- Configurable snapshot frequency
9+
"""
10+
11+
from src.src.preprocessing.utils import get_snapshot_filepaths
12+
from src.src.preprocessing.dataframe import load_snapshots_to_dataframe, save_dataframe
13+
14+
# Configuration
15+
DATA_DIR = "2025-jsons" # Your data directory
16+
OUTPUT_FILE = "dataframe_processed.pkl"
17+
CHECKPOINT_FILE = "dataframe_checkpoint.pkl"
18+
19+
# Processing options
20+
MIN_INTERVAL_MINUTES = 120 # Only process snapshots at least 5 minutes apart
21+
CHECKPOINT_EVERY = 100 # Save checkpoint every 100 files
22+
MAX_FILES = None # Process all files (set to number to limit)
23+
24+
def main():
25+
print("=" * 70)
26+
print("Orderbook Snapshot Processing with Progress Tracking")
27+
print("=" * 70)
28+
print()
29+
30+
# Get all snapshot filepaths
31+
print("📂 Scanning for snapshot files...")
32+
filepaths = get_snapshot_filepaths(DATA_DIR)
33+
print(f" Found {len(filepaths)} snapshot files")
34+
print()
35+
36+
# Configuration summary
37+
print("⚙️ Configuration:")
38+
print(f" Min interval: {MIN_INTERVAL_MINUTES} minutes")
39+
print(f" Checkpoint every: {CHECKPOINT_EVERY} files")
40+
print(f" Checkpoint file: {CHECKPOINT_FILE}")
41+
print(f" Output file: {OUTPUT_FILE}")
42+
print(f" Max files: {MAX_FILES or 'All'}")
43+
print()
44+
45+
# Process snapshots with all the enhanced features
46+
print("🚀 Starting processing...")
47+
print()
48+
49+
df = load_snapshots_to_dataframe(
50+
filepaths=filepaths,
51+
min_interval_minutes=MIN_INTERVAL_MINUTES,
52+
max_files=MAX_FILES,
53+
checkpoint_every=CHECKPOINT_EVERY,
54+
checkpoint_path=CHECKPOINT_FILE,
55+
resume_from_checkpoint=True, # Will resume if checkpoint exists
56+
)
57+
58+
# Save final result
59+
if len(df) > 0:
60+
print()
61+
print(f"💾 Saving final dataframe to {OUTPUT_FILE}...")
62+
save_dataframe(df, OUTPUT_FILE)
63+
print()
64+
print("=" * 70)
65+
print("✅ Processing complete!")
66+
print(f" Total snapshots in dataframe: {len(df)}")
67+
print(f" Date range: {df.index.min()} to {df.index.max()}")
68+
print(f" Columns: {len(df.columns)}")
69+
print("=" * 70)
70+
else:
71+
print()
72+
print("⚠️ No data processed. Check your configuration.")
73+
74+
if __name__ == "__main__":
75+
try:
76+
main()
77+
except KeyboardInterrupt:
78+
print("\n\n⚠️ Processing interrupted by user")
79+
print(" Progress has been saved to checkpoint file")
80+
print(" Run script again to resume from checkpoint")
81+
except Exception as e:
82+
print(f"\n\n❌ Error: {e}")
83+
raise

0 commit comments

Comments
 (0)