The crypto-data-aggregator includes a comprehensive historical data archive system that collects, enriches, and indexes crypto news along with market context.
# Run archive collection once
npm run archive
# Run continuously (hourly)
npm run archive:watch
# Run as background daemon
npm run archive:daemon
# Check archive status
npm run archive:status
# Stop background daemon
npm run archive:stop- Format: JSON files per day
- Path:
archive/YYYY/MM/DD.json - Content: Raw article data with basic metadata
- Use case: Simple historical lookups
- Format: JSONL files per month (more efficient for large datasets)
- Path:
archive/v2/articles/YYYY-MM.jsonl - Content: Enriched articles with sentiment, entities, market context
- Features:
- Market data snapshots
- On-chain event correlation
- Social signal tracking
- Prediction market data
- Story clustering
- Source reliability scoring
archive/
├── index.json # V1 master index
├── 2026/ # V1 daily archives
│ └── 01/
│ ├── 08.json
│ ├── 09.json
│ └── ...
└── v2/ # V2 enhanced archive
├── articles/ # Enriched articles (JSONL)
│ └── 2026-01.jsonl
├── market/ # Market data snapshots
│ └── 2026-01.jsonl
├── onchain/ # On-chain events
│ └── 2026-01.jsonl
├── social/ # Social signals
│ └── 2026-01.jsonl
├── predictions/ # Prediction market data
│ └── 2026-01.jsonl
├── snapshots/ # Hourly state snapshots
│ └── 2026-01-12-21.json
├── index/ # Lookup indexes
│ ├── by-date.json
│ ├── by-source.json
│ └── by-ticker.json
└── meta/
├── stats.json # Archive statistics
├── monthly/ # Monthly stats
├── schema.json # Data schema
└── runner.log # Collection log
# Single collection run
npm run archive
# With custom API URL
API_URL=https://your-api.com npm run archive
# With custom interval (30 minutes)
npm run archive:watch -- --interval 30| Variable | Default | Description |
|---|---|---|
API_URL |
http://localhost:3000 |
News API base URL |
ARCHIVE_DIR |
./archive |
Archive storage directory |
ARCHIVE_INTERVAL |
60 |
Minutes between collections |
FEATURE_MARKET |
true |
Collect market data |
FEATURE_ONCHAIN |
true |
Collect on-chain events |
FEATURE_SOCIAL |
true |
Collect social signals |
FEATURE_PREDICTIONS |
true |
Collect prediction markets |
FEATURE_CLUSTERING |
true |
Enable story clustering |
FEATURE_RELIABILITY |
true |
Track source reliability |
Enable/disable expensive operations:
# Minimal collection (articles only)
FEATURE_MARKET=false FEATURE_ONCHAIN=false npm run archive
# Full collection with all features
npm run archiveLocated in scripts/archive/services/:
| Service | Description |
|---|---|
market-data.js |
BTC/ETH prices, fear/greed index, DeFi TVL |
onchain-events.js |
Whale alerts, gas prices, token transfers |
social-signals.js |
Reddit sentiment, trending topics |
prediction-markets.js |
Polymarket, Manifold predictions |
story-clustering.js |
Groups related articles |
source-reliability.js |
Tracks source accuracy |
ai-training-export.js |
Exports data for ML training |
analytics-engine.js |
Trend analysis and metrics |
Each article is enriched with:
interface EnrichedArticle {
// Core fields
id: string;
title: string;
description: string;
url: string;
source_key: string;
pub_date: string;
// Enrichment
sentiment: {
score: number; // -1 to 1
label: string; // bullish/bearish/neutral
confidence: number;
};
tickers: string[]; // ['BTC', 'ETH', ...]
entities: {
people: string[];
organizations: string[];
protocols: string[];
};
// Market context at publish time
market_context: {
btc_price: number;
eth_price: number;
fear_greed_index: number;
total_market_cap: number;
};
// Metadata
first_seen: string;
last_updated: string;
fetch_count: number;
}// Get archive index
GET /api/archive
Response: { dates: string[], total: number }
// Get articles by date
GET /api/archive?date=2026-01-08
Response: Article[]
// Search archive
GET /api/archive?query=bitcoin&from=2026-01-01&to=2026-01-15
Response: { articles: Article[], total: number }// Get V2 stats
GET /api/archive/v2/stats
Response: ArchiveV2Stats
// Get monthly articles
GET /api/archive/v2/month/2026-01
Response: EnrichedArticle[]
// Search by ticker
GET /api/archive/v2/search?ticker=BTC&limit=50
Response: { articles: EnrichedArticle[], total: number }
// Get snapshot
GET /api/archive/v2/snapshot/2026-01-12-21
Response: ArchiveSnapshotimport {
getArchiveV2Stats,
getArchiveV2Month,
queryArchiveV2
} from '@/lib/archive-v2';
// Get stats
const stats = await getArchiveV2Stats();
console.log(`Total articles: ${stats.total_articles}`);
// Get a month's articles
const articles = await getArchiveV2Month('2026-01');
// Search articles
const results = await queryArchiveV2({
ticker: 'BTC',
sentiment: 'bullish',
fromDate: '2026-01-01',
limit: 100,
});Run the archiver as a background process:
# Start daemon
npm run archive:daemon
# Check status
npm run archive:status
# View logs
tail -f archive/v2/meta/runner.log
# Stop daemon
npm run archive:stop$ npm run archive:status
📊 Archive Runner Status
🟢 Daemon running (PID: 12345)
📁 Archive Stats:
Total articles: 293
Sources: 6
Tickers tracked: 32
Last fetch: 2026-01-12T21:30:30.762Z
📈 Latest Market:
BTC: $91,273
ETH: $3,096
Fear/Greed: 27
📜 Recent log entries:
[2026-01-12T21:30:00] [INFO] Starting archive collection...
[2026-01-12T21:30:30] [SUCCESS] Collection completed in 30.2s- Runner log:
archive/v2/meta/runner.log - Stats:
archive/v2/meta/stats.json - Monthly stats:
archive/v2/meta/monthly/YYYY-MM.json
# Edit crontab
crontab -e
# Run every hour
0 * * * * cd /path/to/project && npm run archive >> /var/log/archive.log 2>&1
# Run every 30 minutes
*/30 * * * * cd /path/to/project && npm run archiveCreate /etc/systemd/system/crypto-archive.service:
[Unit]
Description=Crypto News Archive Collector
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/path/to/crypto-data-aggregator
ExecStart=/usr/bin/node scripts/archive-runner.js --watch
Restart=always
RestartSec=10
Environment=NODE_ENV=production
Environment=API_URL=https://your-api.com
[Install]
WantedBy=multi-user.target# Enable and start
sudo systemctl enable crypto-archive
sudo systemctl start crypto-archive
# Check status
sudo systemctl status crypto-archive
# View logs
sudo journalctl -u crypto-archive -f- Start with minimal features: Disable expensive operations until needed
- Use daemon mode for production: More reliable than manual cron
- Monitor disk usage: JSONL files grow over time
- Backup regularly: Archive data is valuable historical record
- Set appropriate intervals: Hourly is usually sufficient
# Check if API is running
curl http://localhost:3000/api/health
# Use production API
API_URL=https://free-crypto-news.vercel.app npm run archive# Disable expensive features
FEATURE_CLUSTERING=false FEATURE_SOCIAL=false npm run archive# Check last fetch time
npm run archive:status
# Force a fresh collection
npm run archive