TL;DR: PostgreSQL full-text search with GIN indexing. Requires Docker deployment. Sub-second results, Tor-compatible, Google-style operators.
Redd-Archiver uses PostgreSQL full-text search for lightning-fast, database-powered search capabilities:
Key Features:
- GIN Indexing: Instant lookup for large datasets
- Relevance Ranking: Intelligent result ordering with
ts_rank() - Highlighted Excerpts: Context with
ts_headline() - Advanced Filters: By subreddit, author, date, score
- Concurrent Queries: Multiple simultaneous searches
- Constant Memory: Efficient for any dataset size
Why PostgreSQL FTS?
- Native PostgreSQL indexing (no separate search engine)
- Large-scale tested (hundreds of GB)
- Tor-compatible (server-side processing)
- Sub-second response times with GIN indexes
User Query
↓
Search Server (Flask)
↓
PostgreSQL FTS Engine
↓ ← GIN Index Lookup
↓ ← Relevance Ranking (ts_rank)
↓ ← Result Highlighting (ts_headline)
Results (JSON/HTML)
Advantages:
- Constant Memory: Streaming results from database
- Concurrent: Connection pooling handles multiple users
- Real-time: No separate indexing step required
- Scalable: Efficient for datasets of any size
# Search for a term
python postgres_search.py "machine learning"
# Search in specific subreddit
python postgres_search.py "privacy" --subreddit technology
# Limit results
python postgres_search.py "python" --limit 50# Search with score filter
python postgres_search.py "data science" --min-score 100 --limit 20
# Search by author
python postgres_search.py "tutorial" --author specific_user
# Search with date range
python postgres_search.py "announcement" --after 2024-01-01 --before 2024-12-31
# Combine filters
python postgres_search.py "machine learning" \
--subreddit MachineLearning \
--min-score 50 \
--limit 100# JSON output (default)
python postgres_search.py "query" --format json
# Pretty table output
python postgres_search.py "query" --format table
# Export to CSV
python postgres_search.py "query" --format csv > results.csv# Start search server
docker compose up -d search-server
# Verify it's running
curl http://localhost:5000/health
# Expected: {"status":"healthy"}
# Access web interface
open http://localhost:5000# Set database connection
export DATABASE_URL="postgresql://user:pass@localhost:5432/reddarchiver"
# Start search server
python search_server.py
# Server starts on port 5000
# Access at http://localhost:5000Health Check:
GET /health
# Returns: {"status":"healthy"}Search:
GET /search?q=query&subreddit=optional&limit=50
# Examples:
curl "http://localhost:5000/search?q=python&limit=10"
curl "http://localhost:5000/search?q=machine+learning&subreddit=MachineLearning"
curl "http://localhost:5000/search?q=privacy&min_score=100"Features:
- RESTful JSON API
- Real-time search with PostgreSQL FTS
- Rate limiting (100 requests/minute)
- CSRF protection
- Result highlighting with
ts_headline()
Redd-Archiver supports Google-style search operators:
| Operator | Syntax | Example | Description |
|---|---|---|---|
| Exact Phrase | "phrase" |
"machine learning" |
Match exact phrase |
| Boolean OR | word1 OR word2 |
python OR javascript |
Match either term |
| Exclude | -term |
python -beginner |
Exclude term |
| Subreddit Filter | sub:name |
sub:technology |
Search in specific subreddit |
| Author Filter | author:name |
author:username |
Search by author |
| Score Filter | score:N |
score:100 |
Minimum score |
| Type Filter | type:post|comment |
type:post |
Result type |
| Sort | sort:score|date |
sort:score |
Sort order |
Find high-scoring posts about machine learning:
machine learning score:100 type:post
Search excluding certain terms:
python -javascript -ruby
Search specific subreddit:
"data science" sub:MachineLearning
Combine multiple filters:
python tutorial score:50 type:post author:specific_user
Boolean search:
(python OR javascript) tutorial -beginner
Typical search performance with GIN indexes:
| Archive Size | Query Type | Response Time |
|---|---|---|
| <100K posts | Simple | <100ms |
| 100K-1M posts | Simple | <200ms |
| >1M posts | Simple | <500ms |
| Any size | Complex multi-term | <1 second |
Factors Affecting Speed:
- Query complexity (number of terms)
- Result set size
- Index quality (run
VACUUM ANALYZEperiodically) - Hardware (CPU, RAM, SSD vs HDD)
1. Regular Index Maintenance:
-- Update statistics
VACUUM ANALYZE posts;
VACUUM ANALYZE comments;
-- Rebuild indexes if necessary
REINDEX TABLE posts;
REINDEX TABLE comments;2. Tune PostgreSQL Settings:
-- Increase work_mem for complex queries
SET work_mem = '256MB';
-- Adjust effective_cache_size (75% of RAM)
SET effective_cache_size = '6GB';3. Use Field Selection (API):
# Only fetch needed fields (faster)
curl "http://localhost:5000/search?q=python&fields=id,title,score"4. Limit Result Size:
# Smaller limits = faster responses
curl "http://localhost:5000/search?q=python&limit=25"PostgreSQL ts_rank() calculates relevance scores:
- Term frequency in document
- Term position (title weighted higher)
- Document length normalization
Results automatically sorted by relevance.
ts_headline() shows matching content in context:
{
"title": "Introduction to **Machine Learning**",
"excerpt": "This post covers **machine learning** basics including..."
}Matching terms are highlighted (default: **bold** in JSON, <mark> in HTML).
Connection pooling supports multiple simultaneous searches:
- Default: 8 connections
- Configure via
REDDARCHIVER_MAX_DB_CONNECTIONS - Each query gets its own connection from pool
Search is fully integrated with the REST API:
Search Endpoint:
GET /api/v1/search?q=query&subreddit=optional&limit=50Search with Export:
# Export results to CSV
GET /api/v1/search?q=python&format=csv
# Export to NDJSON
GET /api/v1/search?q=python&format=ndjsonSearch Explain (debugging):
GET /api/v1/search/explain?q=python+tutorial
# Returns: Parsed query structure and operators usedSee API.md for complete API documentation.
Problem: Search takes >2 seconds
Solutions:
- Check indexes exist:
SELECT indexname FROM pg_indexes WHERE tablename = 'posts';
-- Should show: posts_search_vector_idx (GIN)- Update statistics:
VACUUM ANALYZE posts;
VACUUM ANALYZE comments;- Check query complexity:
# Use explain to debug
GET /api/v1/search/explain?q=your_complex_queryProblem: Search returns no results but content exists
Solutions:
- Check FTS vectors generated:
SELECT COUNT(*) FROM posts WHERE search_vector IS NOT NULL;- Test direct FTS query:
SELECT title FROM posts
WHERE search_vector @@ to_tsquery('english', 'python')
LIMIT 5;- Rebuild search vectors:
UPDATE posts SET search_vector =
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(selftext, '')), 'B');Problem: Address already in use error
Solution:
# Check what's using port 5000
sudo lsof -i :5000
# Kill process or use different port
export FLASK_RUN_PORT=5001
python search_server.py- API.md - REST API search endpoints
- PERFORMANCE.md - Performance tuning
- QUICKSTART.md - Deploy search server
- FAQ - Common search questions
Last Updated: 2026-01-26