DavidRajnoha
diff --git a/‎.gitignore‎
Lines changed: 16 additions & 4 deletions b/‎.gitignore‎
Lines changed: 16 additions & 4 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 98 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 98 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 5 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎scripts/process_snapshots_with_progress.py‎
Lines changed: 83 additions & 0 deletions b/‎scripts/process_snapshots_with_progress.py‎
Lines changed: 83 additions & 0 deletions
@@ -9,17 +9,29 @@ src/__pycache__/
 
 # Notebooks & checkpoints
 src/.ipynb_checkpoints/
+src/jupyter_notebooks/**/.ipynb_checkpoints/
 
 # Data files
+data-*
 data/
 data_*
 data_old
 dataframe_old.pkl
+data_cache/
+
+# Processed dataframes (pkl files)
+dataframe.pkl
+dataframe_test.pkl
+dataframe_*.pkl
+
+# Analysis outputs
+*.csv
+output-*.log
+OPTIMIZATION_SUMMARY.md
+
+# Temporary scripts
+temp_scripts/
 
 # Poetry & test caches
 poetry.lock
 .pytest_cache/
-
-# Project-specific
-dataframe.pkl
-dataframe_test.pkl
 
@@ -0,0 +1,98 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is a research project for analyzing Joinmarket orderbook snapshots as part of a master thesis. The codebase parses JSON orderbook data and creates visualizations to answer questions about liquidity, maker offer lifetimes, and coinjoin protocol dynamics.
+
+## Development Environment
+
+**Package Management**: Poetry
+- Use `poetry install` to set up dependencies
+- Use `poetry shell` to activate the virtual environment
+- Dependencies managed in `pyproject.toml`
+
+**Testing**: pytest
+- Run tests with `poetry run pytest`
+- Test files located in `test/` directory
+- Main test modules: `test_snapshot_processing.py`, `test_extended_snapshot_data.py`, `test_visualisations.py`
+
+**Data Analysis Environment**: Jupyter notebooks
+- Start with `poetry run jupyter notebook`
+- Analysis notebooks located in `src/` directory
+
+## Architecture
+
+The codebase follows a modular structure with three main components:
+
+### Data Processing Pipeline (`src/preprocessing/`)
+- **`snapshot.py`**: Core data loading and orderbook snapshot processing
+  - `load_and_process_snapshot()`: Main entry point for processing individual snapshots
+  - `process_offers()`: Analyzes offer data to extract liquidity, fees, and order sizes
+  - `compute_statistics()`: Calculates statistical measures from processed offers
+- **`dataframe.py`**: Batch processing and DataFrame management
+  - `load_snapshots_to_dataframe()`: Processes multiple snapshots with timestamp filtering
+  - Handles data serialization with pickle format for performance
+- **`utils.py`**: Utility functions for file processing and timestamp extraction
+- **`maker_offers.py`**: Specialized processing for maker-specific analysis
+
+### Analysis Modules (`src/analysis/`)
+- **`offer_changes.py`**: Advanced streaming analysis for detecting offer modifications
+  - `StreamingOfferAnalyzer`: Single-pass algorithm for tracking maker behavior changes
+  - `OfferSignature`: Data structure for comparing similar offers
+  - Real-time change detection (appeared, disappeared, modified, quantity changes)
+- **`fees.py`**: Fee analysis and calculations
+
+### Visualization (`src/visualisations/`)
+- **`plot.py`**: Core plotting utilities
+- **`fees.py`**: Fee-specific visualizations
+
+## Data Structure
+
+**Raw Data**: JSON files containing orderbook snapshots
+- Located in timestamped directories (e.g., `2025-jsons/2025-07-27/`)
+- Format: `orderbook_HH-MM.json`
+- Contains `offers` and `fidelitybonds` arrays
+
+**Processed Data**: 
+- Cached as pickle files (`dataframe.pkl`) for performance
+- Structured DataFrames with temporal analysis capabilities
+
+## Key Processing Concepts
+
+**Offer Types**:
+- `sw0reloffer`: Relative fee offers (percentage-based)
+- `sw0absoffer`: Absolute fee offers (fixed satoshi amounts)
+
+**Timestamp Filtering**: Built-in capability to process data with minimum time intervals to manage large datasets efficiently
+
+**Streaming Analysis**: Single-pass algorithms for real-time change detection without loading entire datasets into memory
+
+## Common Workflows
+
+**Loading and Processing Data**:
+
+```python
+from src.src.preprocessing.dataframe import load_snapshots_to_dataframe
+
+df = load_snapshots_to_dataframe(filepaths, min_interval_minutes=60)
+```
+
+**Analyzing Offer Changes**:
+
+```python
+from src.src.analysis import StreamingOfferAnalyzer
+
+analyzer = StreamingOfferAnalyzer()
+summary = analyzer.analyze_complete_dataset(df_offers)
+```
+
+**Data Persistence**:
+
+```python
+from src.src.preprocessing.dataframe import save_dataframe, load_dataframe
+
+save_dataframe(df, 'processed_data.pkl')
+df = load_dataframe('processed_data.pkl')
+```
@@ -9,10 +9,14 @@ readme = "README.md"
 python = "^3.11"
 matplotlib = "^3.9.2"
 jupyter = "^1.1.1"
-pandas = "^2.2.3"
+pandas = "^2.3.1"
 seaborn = "^0.13.2"
 pytest = "^8.3.5"
 setuptools = "^78.1.0"
+notebook = "^7.5.0"
+pandoc = "^2.4"
+pyarrow = "^21.0.0"
+psutil = "^6.1.1"
 
 
 [build-system]
 
@@ -0,0 +1,83 @@
+#!/usr/bin/env python3
+"""
+Example script showing how to use the enhanced dataframe loading with:
+- Progress tracking with ETA
+- Memory monitoring
+- Periodic checkpointing
+- Resume from checkpoint
+- Configurable snapshot frequency
+"""
+
+from src.src.preprocessing.utils import get_snapshot_filepaths
+from src.src.preprocessing.dataframe import load_snapshots_to_dataframe, save_dataframe
+
+# Configuration
+DATA_DIR = "2025-jsons"  # Your data directory
+OUTPUT_FILE = "dataframe_processed.pkl"
+CHECKPOINT_FILE = "dataframe_checkpoint.pkl"
+
+# Processing options
+MIN_INTERVAL_MINUTES = 120  # Only process snapshots at least 5 minutes apart
+CHECKPOINT_EVERY = 100     # Save checkpoint every 100 files
+MAX_FILES = None           # Process all files (set to number to limit)
+
+def main():
+    print("=" * 70)
+    print("Orderbook Snapshot Processing with Progress Tracking")
+    print("=" * 70)
+    print()
+
+    # Get all snapshot filepaths
+    print("📂 Scanning for snapshot files...")
+    filepaths = get_snapshot_filepaths(DATA_DIR)
+    print(f"   Found {len(filepaths)} snapshot files")
+    print()
+
+    # Configuration summary
+    print("⚙️  Configuration:")
+    print(f"   Min interval: {MIN_INTERVAL_MINUTES} minutes")
+    print(f"   Checkpoint every: {CHECKPOINT_EVERY} files")
+    print(f"   Checkpoint file: {CHECKPOINT_FILE}")
+    print(f"   Output file: {OUTPUT_FILE}")
+    print(f"   Max files: {MAX_FILES or 'All'}")
+    print()
+
+    # Process snapshots with all the enhanced features
+    print("🚀 Starting processing...")
+    print()
+
+    df = load_snapshots_to_dataframe(
+        filepaths=filepaths,
+        min_interval_minutes=MIN_INTERVAL_MINUTES,
+        max_files=MAX_FILES,
+        checkpoint_every=CHECKPOINT_EVERY,
+        checkpoint_path=CHECKPOINT_FILE,
+        resume_from_checkpoint=True,  # Will resume if checkpoint exists
+    )
+
+    # Save final result
+    if len(df) > 0:
+        print()
+        print(f"💾 Saving final dataframe to {OUTPUT_FILE}...")
+        save_dataframe(df, OUTPUT_FILE)
+        print()
+        print("=" * 70)
+        print("✅ Processing complete!")
+        print(f"   Total snapshots in dataframe: {len(df)}")
+        print(f"   Date range: {df.index.min()} to {df.index.max()}")
+        print(f"   Columns: {len(df.columns)}")
+        print("=" * 70)
+    else:
+        print()
+        print("⚠️  No data processed. Check your configuration.")
+
+if __name__ == "__main__":
+    try:
+        main()
+    except KeyboardInterrupt:
+        print("\n\n⚠️  Processing interrupted by user")
+        print("   Progress has been saved to checkpoint file")
+        print("   Run script again to resume from checkpoint")
+    except Exception as e:
+        print(f"\n\n❌ Error: {e}")
+        raise