Simple search engine implementation in Python for illustrative purposes to go with this blog post.
Python 3.10 or greater, and uv.
Install dependencies:
uv syncRun the full-text search from the command line. On first run, the Wikipedia dataset (~20GB) will be downloaded from Hugging Face and cached automatically:
uv run python run.pyRun the semantic (vector) search:
uv run python run_semantic.pyOn first run this builds a vector index by embedding all 6.4M documents. Embeddings are checkpointed to data/checkpoints/ so you can resume if interrupted. The finished index is saved to data/vector_index.* and memory-mapped on subsequent runs.
To skip the multi-hour encoding step, download the pre-computed embeddings from Hugging Face, place the JSON and .npy files in data/checkpoints/, and run uv run python run_semantic.py.
If you'd like to download the dataset separately (e.g. before a demo):
uv run python download.pyTo get higher download rate limits, set a Hugging Face token:
export HF_TOKEN=hf_...Run from interactive console:
uv run ipython
In [1]: run run.py
In [2]: index.search('python programming language', rank=True)[:5]Lint and type check:
uv run ruff check .
uv run mypy search/Run tests:
uv run pytest -v