Skip to content

Commit cd3f9e6

Browse files
authored
nov data
2 parents c9fcb0a + f6228e1 commit cd3f9e6

20 files changed

+198481
-132525
lines changed

.env.example

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Cloudflare R2 Storage Configuration
2+
# See R2_SETUP_GUIDE.md for instructions on obtaining these credentials
3+
4+
# Your Cloudflare R2 Account ID (32-character hex string)
5+
# Find in: R2 Dashboard → Right Sidebar → Account ID
6+
R2_ACCOUNT_ID=your_account_id_here
7+
8+
# R2 API Access Key ID (20-character alphanumeric string)
9+
# Created in: R2 Dashboard → Manage R2 API Tokens
10+
R2_ACCESS_KEY_ID=your_access_key_id_here
11+
12+
# R2 API Secret Access Key (40-character alphanumeric string)
13+
# IMPORTANT: Only shown once when creating API token
14+
R2_SECRET_ACCESS_KEY=your_secret_access_key_here
15+
16+
# Your R2 Bucket Name
17+
# The bucket where database backups are stored
18+
R2_BUCKET_NAME=your_bucket_name_here

PIPELINE_GUIDE.md

Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
# Pipeline Automation Guide
2+
3+
This guide explains how to use the automated pipeline script (`run_pipeline.sh`) to set up and run the KIA Live Server data pipeline.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Run the complete pipeline
9+
./run_pipeline.sh
10+
11+
# Run with options
12+
./run_pipeline.sh --skip-download --no-copy
13+
```
14+
15+
## Overview
16+
17+
The pipeline automates these steps from `steps.txt`:
18+
19+
1. **Download database backups from R2** - Fetches backup files from Cloudflare R2 storage
20+
2. **Extract vehicle positions** - Converts SQLite data to CSV for ML training
21+
3. **Train ML model** - Trains the universal prediction model
22+
4. **Generate input files** - Fetches latest data from BMTC API and generates GTFS files
23+
24+
## Prerequisites
25+
26+
### 1. Install Poetry
27+
28+
```bash
29+
curl -sSL https://install.python-poetry.org | python3 -
30+
```
31+
32+
### 2. Install Dependencies
33+
34+
```bash
35+
poetry install
36+
```
37+
38+
### 3. Set Up R2 Credentials (for Step 1)
39+
40+
Create a `.env` file with your R2 credentials:
41+
42+
```bash
43+
cp .env.example .env
44+
# Edit .env and add your credentials
45+
```
46+
47+
Required variables:
48+
- `R2_ACCOUNT_ID`
49+
- `R2_ACCESS_KEY_ID`
50+
- `R2_SECRET_ACCESS_KEY`
51+
- `R2_BUCKET_NAME`
52+
53+
See `R2_SETUP_GUIDE.md` for detailed setup instructions.
54+
55+
## Usage
56+
57+
### Run Complete Pipeline
58+
59+
```bash
60+
./run_pipeline.sh
61+
```
62+
63+
This will:
64+
1. Download all database backups from R2
65+
2. Extract vehicle positions to CSV
66+
3. Train the universal prediction model
67+
4. Generate input files from BMTC API
68+
5. Copy generated files to `in/` directory
69+
70+
### Command-Line Options
71+
72+
```bash
73+
./run_pipeline.sh [OPTIONS]
74+
```
75+
76+
Available options:
77+
78+
| Option | Description |
79+
|--------|-------------|
80+
| `--skip-download` | Skip downloading from R2 (use existing files in `db/`) |
81+
| `--skip-extract` | Skip extracting vehicle positions (use existing CSV) |
82+
| `--skip-train` | Skip training model (use existing model in `models/`) |
83+
| `--skip-generate` | Skip generating input files (use existing files) |
84+
| `--no-copy` | Don't copy generated files to `in/` directory |
85+
| `--help` | Show help message |
86+
87+
### Common Scenarios
88+
89+
#### First-Time Setup
90+
91+
```bash
92+
# Complete pipeline with all steps
93+
./run_pipeline.sh
94+
```
95+
96+
#### Update Input Files Only
97+
98+
```bash
99+
# Skip download, extract, and training - just regenerate input files
100+
./run_pipeline.sh --skip-download --skip-extract --skip-train
101+
```
102+
103+
#### Retrain Model with Existing Data
104+
105+
```bash
106+
# Skip download and extract, but retrain model and regenerate files
107+
./run_pipeline.sh --skip-download --skip-extract
108+
```
109+
110+
#### Test Generated Files Without Copying
111+
112+
```bash
113+
# Generate files but don't copy to in/ directory
114+
./run_pipeline.sh --skip-download --skip-extract --skip-train --no-copy
115+
```
116+
117+
#### Use Local Database Files
118+
119+
If you already have database files in `db/`:
120+
121+
```bash
122+
# Skip R2 download
123+
./run_pipeline.sh --skip-download
124+
```
125+
126+
## Pipeline Steps Detail
127+
128+
### Step 1: Download from R2
129+
130+
**Script:** `download_from_r2.py`
131+
132+
Downloads database backup files from Cloudflare R2 storage to `db/` directory.
133+
134+
**Options:**
135+
```bash
136+
# Download only the latest backup
137+
poetry run python download_from_r2.py --latest-only
138+
139+
# Download files matching a pattern
140+
poetry run python download_from_r2.py --pattern "database-2025*.db"
141+
```
142+
143+
**Skip if:**
144+
- You already have database files in `db/`
145+
- You're using a local development database
146+
147+
### Step 2: Extract Vehicle Positions
148+
149+
**Script:** `extract_vehicle_positions.py`
150+
151+
Extracts vehicle position data from SQLite database(s) to CSV format for ML training.
152+
153+
**Output:** `db/vehicle_positions.csv`
154+
155+
**Skip if:**
156+
- You already have `db/vehicle_positions.csv`
157+
- You're not retraining the model
158+
159+
### Step 3: Train Model
160+
161+
**Command:**
162+
```bash
163+
poetry run python -m src.model.cli train-universal \
164+
--vehicle-positions db/vehicle_positions.csv \
165+
--stops-data in/client_stops.json \
166+
--model-dir models
167+
```
168+
169+
Trains the universal stop-to-stop prediction model using historical vehicle position data.
170+
171+
**Output:** Model files in `models/` directory
172+
173+
**Skip if:**
174+
- You already have a trained model in `models/`
175+
- You're just updating input files without retraining
176+
177+
**Note:** Requires `in/client_stops.json` to exist. If it doesn't exist, you may need to:
178+
1. Run step 4 first with existing data, OR
179+
2. Manually create/copy `client_stops.json` from a previous run
180+
181+
### Step 4: Generate Input Files
182+
183+
**Script:** `generate_in_files.py`
184+
185+
Fetches latest data from BMTC API and generates input files for the GTFS feeds.
186+
187+
**Flags used:**
188+
- `-s` - Generate client_stops.json with stop information
189+
- `-r` - Generate routelines.json with route polylines
190+
- `-t` - Generate timings.tsv with schedule data from API
191+
- `-tdb` - Generate times.json with ML predictions (requires trained model)
192+
- `-c` - Copy generated files to `in/` directory (unless `--no-copy` is set)
193+
194+
**Output Files (in `generated_in/`):**
195+
- `route_children_ids.json` - Route ID mappings
196+
- `route_parent_ids.json` - Parent route ID mappings
197+
- `client_stops.json` - Stop locations and information
198+
- `routelines.json` - Encoded route polylines
199+
- `timings.tsv` - Schedule timings
200+
- `times.json` - ML-predicted stop-by-stop times
201+
202+
**Skip if:**
203+
- You don't need to refresh data from BMTC API
204+
- You're only retraining the model
205+
206+
## Directory Structure
207+
208+
```
209+
.
210+
├── db/ # Database files
211+
│ ├── database.db # Current/downloaded database
212+
│ └── vehicle_positions.csv # Extracted vehicle positions
213+
├── generated_in/ # Generated input files (staging)
214+
│ ├── route_children_ids.json
215+
│ ├── route_parent_ids.json
216+
│ ├── client_stops.json
217+
│ ├── routelines.json
218+
│ ├── timings.tsv
219+
│ └── times.json
220+
├── in/ # Production input files
221+
│ ├── routes_children_ids.json
222+
│ ├── routes_parent_ids.json
223+
│ ├── client_stops.json
224+
│ ├── routelines.json
225+
│ ├── times.json
226+
│ └── helpers/
227+
│ ├── construct_stops/
228+
│ │ └── client_stops.json
229+
│ └── construct_timings/
230+
│ └── timings.tsv
231+
└── models/ # Trained ML models
232+
└── universal_model.pkl
233+
```
234+
235+
## Troubleshooting
236+
237+
### Error: "pyproject.toml not found"
238+
239+
Make sure you're running the script from the project root directory:
240+
241+
```bash
242+
cd /path/to/kia-live-serverside
243+
./run_pipeline.sh
244+
```
245+
246+
### Error: "Poetry is not installed"
247+
248+
Install Poetry first:
249+
250+
```bash
251+
curl -sSL https://install.python-poetry.org | python3 -
252+
```
253+
254+
### Error: "Missing required environment variables"
255+
256+
Set up your `.env` file with R2 credentials. See `R2_SETUP_GUIDE.md`.
257+
258+
### Error: "No database files found in db/"
259+
260+
Either:
261+
1. Don't use `--skip-download` (let it download from R2), OR
262+
2. Manually place database files in `db/` directory
263+
264+
### Error: "in/client_stops.json not found" (during model training)
265+
266+
This happens when training a model for the first time without existing input files. Solutions:
267+
268+
1. **Option A:** Run step 4 first with existing data:
269+
```bash
270+
# Generate input files without model predictions
271+
poetry run python generate_in_files.py -s -r -t
272+
# Then run full pipeline
273+
./run_pipeline.sh --skip-download --skip-extract
274+
```
275+
276+
2. **Option B:** Copy from a previous run:
277+
```bash
278+
cp /path/to/old/client_stops.json in/client_stops.json
279+
```
280+
281+
### Error: "Model training failed"
282+
283+
Check that:
284+
- `db/vehicle_positions.csv` exists and has data
285+
- `in/client_stops.json` exists
286+
- You have enough disk space
287+
- Dependencies are installed: `poetry install`
288+
289+
### Pipeline stops midway
290+
291+
The script uses `set -e`, so it will stop on any error. Check the error message and:
292+
1. Fix the issue
293+
2. Re-run with appropriate `--skip-*` flags to resume from where it failed
294+
295+
## Performance Tips
296+
297+
### Speed Up Model Training
298+
299+
- Use `--skip-download` and `--skip-extract` if data hasn't changed
300+
- Model training time depends on CSV size (can take 5-30 minutes)
301+
302+
### Speed Up File Generation
303+
304+
- API calls are rate-limited, expect 5-10 minutes for full generation
305+
- Files are cached in `generated_in/` - review before copying to `in/`
306+
307+
### Disk Space
308+
309+
Typical space requirements:
310+
- Database backups: ~30-100 MB each
311+
- vehicle_positions.csv: ~50-500 MB (depends on time range)
312+
- Models: ~10-50 MB
313+
- Generated files: ~1-5 MB total
314+
315+
## Next Steps After Pipeline
316+
317+
Once the pipeline completes successfully:
318+
319+
1. **Review generated files:**
320+
```bash
321+
ls -lh generated_in/
322+
```
323+
324+
2. **Start the server:**
325+
```bash
326+
poetry run python -m src.main
327+
```
328+
329+
3. **Access endpoints:**
330+
- Static GTFS: `http://localhost:59966/gtfs.zip`
331+
- Real-time GTFS-RT: `http://localhost:59966/gtfs-rt.proto`
332+
- WebSocket stream: `ws://localhost:59966/ws/gtfs-rt`
333+
334+
4. **Monitor logs** for any issues
335+
336+
## Automation
337+
338+
### Cron Job for Regular Updates
339+
340+
Update input files daily at 3 AM:
341+
342+
```bash
343+
# Edit crontab
344+
crontab -e
345+
346+
# Add this line (adjust path)
347+
0 3 * * * cd /path/to/kia-live-serverside && ./run_pipeline.sh --skip-download --skip-extract --skip-train >> logs/pipeline.log 2>&1
348+
```
349+
350+
### Systemd Service
351+
352+
For production deployment, consider creating a systemd service to run the server after pipeline completion.
353+
354+
## See Also
355+
356+
- `steps.txt` - Original manual steps
357+
- `R2_SETUP_GUIDE.md` - R2 credentials setup
358+
- `CLAUDE.md` - Project architecture and development guide
359+
- `DATABASE_LOCK_ROOT_CAUSE_ANALYSIS.md` - Database optimization details

0 commit comments

Comments
 (0)