Home | Docs | API | FAQ

Search Guide

TL;DR: PostgreSQL full-text search with GIN indexing. Requires Docker deployment. Sub-second results, Tor-compatible, Google-style operators.

Overview

Redd-Archiver uses PostgreSQL full-text search for lightning-fast, database-powered search capabilities:

Key Features:

GIN Indexing: Instant lookup for large datasets
Relevance Ranking: Intelligent result ordering with ts_rank()
Highlighted Excerpts: Context with ts_headline()
Advanced Filters: By subreddit, author, date, score
Concurrent Queries: Multiple simultaneous searches
Constant Memory: Efficient for any dataset size

Why PostgreSQL FTS?

Native PostgreSQL indexing (no separate search engine)
Large-scale tested (hundreds of GB)
Tor-compatible (server-side processing)
Sub-second response times with GIN indexes

Search Architecture

Database-Powered Search

User Query
    ↓
Search Server (Flask)
    ↓
PostgreSQL FTS Engine
    ↓  ← GIN Index Lookup
    ↓  ← Relevance Ranking (ts_rank)
    ↓  ← Result Highlighting (ts_headline)
Results (JSON/HTML)

Advantages:

Constant Memory: Streaming results from database
Concurrent: Connection pooling handles multiple users
Real-time: No separate indexing step required
Scalable: Efficient for datasets of any size

Command-Line Interface

Basic Search

# Search for a term
python postgres_search.py "machine learning"

# Search in specific subreddit
python postgres_search.py "privacy" --subreddit technology

# Limit results
python postgres_search.py "python" --limit 50

Advanced Search

# Search with score filter
python postgres_search.py "data science" --min-score 100 --limit 20

# Search by author
python postgres_search.py "tutorial" --author specific_user

# Search with date range
python postgres_search.py "announcement" --after 2024-01-01 --before 2024-12-31

# Combine filters
python postgres_search.py "machine learning" \
  --subreddit MachineLearning \
  --min-score 50 \
  --limit 100

Output Formats

# JSON output (default)
python postgres_search.py "query" --format json

# Pretty table output
python postgres_search.py "query" --format table

# Export to CSV
python postgres_search.py "query" --format csv > results.csv

Web API Setup

Using Docker Compose (Recommended)

# Start search server
docker compose up -d search-server

# Verify it's running
curl http://localhost:5000/health
# Expected: {"status":"healthy"}

# Access web interface
open http://localhost:5000

Manual Setup

# Set database connection
export DATABASE_URL="postgresql://user:pass@localhost:5432/reddarchiver"

# Start search server
python search_server.py

# Server starts on port 5000
# Access at http://localhost:5000

Web API Endpoints

Health Check:

GET /health
# Returns: {"status":"healthy"}

Search:

GET /search?q=query&subreddit=optional&limit=50

# Examples:
curl "http://localhost:5000/search?q=python&limit=10"
curl "http://localhost:5000/search?q=machine+learning&subreddit=MachineLearning"
curl "http://localhost:5000/search?q=privacy&min_score=100"

Features:

RESTful JSON API
Real-time search with PostgreSQL FTS
Rate limiting (100 requests/minute)
CSRF protection
Result highlighting with ts_headline()

Search Operators

Google-Style Search Syntax

Redd-Archiver supports Google-style search operators:

Operator	Syntax	Example	Description
Exact Phrase	`"phrase"`	`"machine learning"`	Match exact phrase
Boolean OR	`word1 OR word2`	`python OR javascript`	Match either term
Exclude	`-term`	`python -beginner`	Exclude term
Subreddit Filter	`sub:name`	`sub:technology`	Search in specific subreddit
Author Filter	`author:name`	`author:username`	Search by author
Score Filter	`score:N`	`score:100`	Minimum score
Type Filter	`type:post\|comment`	`type:post`	Result type
Sort	`sort:score\|date`	`sort:score`	Sort order

Examples

Find high-scoring posts about machine learning:

machine learning score:100 type:post

Search excluding certain terms:

python -javascript -ruby

Search specific subreddit:

"data science" sub:MachineLearning

Combine multiple filters:

python tutorial score:50 type:post author:specific_user

Boolean search:

(python OR javascript) tutorial -beginner

Search Performance

Response Times

Typical search performance with GIN indexes:

Archive Size	Query Type	Response Time
<100K posts	Simple	<100ms
100K-1M posts	Simple	<200ms
>1M posts	Simple	<500ms
Any size	Complex multi-term	<1 second

Factors Affecting Speed:

Query complexity (number of terms)
Result set size
Index quality (run VACUUM ANALYZE periodically)
Hardware (CPU, RAM, SSD vs HDD)

Optimization Tips

1. Regular Index Maintenance:

-- Update statistics
VACUUM ANALYZE posts;
VACUUM ANALYZE comments;

-- Rebuild indexes if necessary
REINDEX TABLE posts;
REINDEX TABLE comments;

2. Tune PostgreSQL Settings:

-- Increase work_mem for complex queries
SET work_mem = '256MB';

-- Adjust effective_cache_size (75% of RAM)
SET effective_cache_size = '6GB';

3. Use Field Selection (API):

# Only fetch needed fields (faster)
curl "http://localhost:5000/search?q=python&fields=id,title,score"

4. Limit Result Size:

# Smaller limits = faster responses
curl "http://localhost:5000/search?q=python&limit=25"

Search Features

Relevance Ranking

PostgreSQL ts_rank() calculates relevance scores:

Term frequency in document
Term position (title weighted higher)
Document length normalization

Results automatically sorted by relevance.

Result Highlighting

ts_headline() shows matching content in context:

{
  "title": "Introduction to **Machine Learning**",
  "excerpt": "This post covers **machine learning** basics including..."
}

Matching terms are highlighted (default: **bold** in JSON, <mark> in HTML).

Concurrent Queries

Connection pooling supports multiple simultaneous searches:

Default: 8 connections
Configure via REDDARCHIVER_MAX_DB_CONNECTIONS
Each query gets its own connection from pool

Integration with REST API

Search is fully integrated with the REST API:

Search Endpoint:

GET /api/v1/search?q=query&subreddit=optional&limit=50

Search with Export:

# Export results to CSV
GET /api/v1/search?q=python&format=csv

# Export to NDJSON
GET /api/v1/search?q=python&format=ndjson

Search Explain (debugging):

GET /api/v1/search/explain?q=python+tutorial
# Returns: Parsed query structure and operators used

See API.md for complete API documentation.

Troubleshooting

Slow Search Queries

Problem: Search takes >2 seconds

Solutions:

Check indexes exist:

SELECT indexname FROM pg_indexes WHERE tablename = 'posts';
-- Should show: posts_search_vector_idx (GIN)

Update statistics:

VACUUM ANALYZE posts;
VACUUM ANALYZE comments;

Check query complexity:

# Use explain to debug
GET /api/v1/search/explain?q=your_complex_query

No Search Results

Problem: Search returns no results but content exists

Solutions:

Check FTS vectors generated:

SELECT COUNT(*) FROM posts WHERE search_vector IS NOT NULL;

Test direct FTS query:

SELECT title FROM posts 
WHERE search_vector @@ to_tsquery('english', 'python')
LIMIT 5;

Rebuild search vectors:

UPDATE posts SET search_vector = 
  setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
  setweight(to_tsvector('english', coalesce(selftext, '')), 'B');

Search Server Won't Start

Problem: Address already in use error

Solution:

# Check what's using port 5000
sudo lsof -i :5000

# Kill process or use different port
export FLASK_RUN_PORT=5001
python search_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Guide

Overview

Search Architecture

Database-Powered Search

Command-Line Interface

Basic Search

Advanced Search

Output Formats

Web API Setup

Using Docker Compose (Recommended)

Manual Setup

Web API Endpoints

Search Operators

Google-Style Search Syntax

Examples

Search Performance

Response Times

Optimization Tips

Search Features

Relevance Ranking

Result Highlighting

Concurrent Queries

Integration with REST API

Troubleshooting

Slow Search Queries

No Search Results

Search Server Won't Start

See Also

FilesExpand file tree

SEARCH.md

Latest commit

History

SEARCH.md

File metadata and controls

Search Guide

Overview

Search Architecture

Database-Powered Search

Command-Line Interface

Basic Search

Advanced Search

Output Formats

Web API Setup

Using Docker Compose (Recommended)

Manual Setup

Web API Endpoints

Search Operators

Google-Style Search Syntax

Examples

Search Performance

Response Times

Optimization Tips

Search Features

Relevance Ranking

Result Highlighting

Concurrent Queries

Integration with REST API

Troubleshooting

Slow Search Queries

No Search Results

Search Server Won't Start

See Also