|
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +<h1>BM25</h1> |
| 4 | + |
| 5 | +<i>A fast, simple, and high-level Python API and CLI for BM25, powered by `bm25s`.</i> |
| 6 | + |
| 7 | +<table> |
| 8 | + <tr> |
| 9 | + <td> |
| 10 | + <a href="https://github.com/xhluca/bm25s">💻 GitHub</a> |
| 11 | + </td> |
| 12 | + <td> |
| 13 | + <a href="https://pypi.org/project/bm25s/">📦 bm25s</a> |
| 14 | + </td> |
| 15 | + <td> |
| 16 | + <a href="https://bm25s.github.io">🏠 Homepage</a> |
| 17 | + </td> |
| 18 | + </tr> |
| 19 | +</table> |
| 20 | +</div> |
| 21 | + |
| 22 | +`BM25` is a wrapper package that installs `bm25s` with its optional core dependencies, providing a simple, high-level API and a command-line interface for fast and effective text retrieval. |
| 23 | + |
| 24 | +## Installation |
| 25 | + |
| 26 | +Install `BM25` using pip: |
| 27 | + |
| 28 | +```bash |
| 29 | +pip install BM25 |
| 30 | +``` |
| 31 | + |
| 32 | +This will automatically install the highly optimized `bm25s` backend, alongside necessary dependencies for stemming (`PyStemmer`), parallelization, and CLI (`rich`). |
| 33 | + |
| 34 | +## High Level API |
| 35 | + |
| 36 | +If you want to quickly search on a local file, you can use the `BM25` module: |
| 37 | + |
| 38 | +```python |
| 39 | +import BM25 |
| 40 | + |
| 41 | +# Load a file (csv, json, jsonl, txt) |
| 42 | +# For csv/jsonl, you can specify the column/key to use as document text |
| 43 | +corpus = BM25.load("tests/data/dummy.csv", document_column="text") |
| 44 | +# Index the corpus |
| 45 | +retriever = BM25.index(corpus) |
| 46 | + |
| 47 | +# Search |
| 48 | +results = retriever.search(["your query here"], k=5) |
| 49 | +for result in results[0]: |
| 50 | + print(result) |
| 51 | +``` |
| 52 | + |
| 53 | +The `load` function handles file reading, while `index` handles tokenization, indexing, and provides a simple search interface. |
| 54 | + |
| 55 | +## Command-Line Interface |
| 56 | + |
| 57 | +The package provides a terminal-based CLI for quick indexing and searching without writing Python code. |
| 58 | + |
| 59 | +### Indexing Documents |
| 60 | + |
| 61 | +Create an index from a CSV, TXT, JSON, or JSONL file: |
| 62 | + |
| 63 | +```bash |
| 64 | +# Index a CSV file (uses first column by default) |
| 65 | +bm25 index documents.csv -o my_index |
| 66 | + |
| 67 | +# Index with a specific column |
| 68 | +bm25 index documents.csv -o my_index -c text |
| 69 | + |
| 70 | +# Index a text file (one document per line) |
| 71 | +bm25 index documents.txt -o my_index |
| 72 | + |
| 73 | +# Index a JSONL file |
| 74 | +bm25 index documents.jsonl -o my_index -c content |
| 75 | +``` |
| 76 | + |
| 77 | +If you don't specify an output directory with `-o`, the index will be saved to `<filename>_index`. |
| 78 | + |
| 79 | +### User Directory |
| 80 | + |
| 81 | +You can save indices to a central user directory (`~/.bm25s/indices/`) using the `-u` flag: |
| 82 | + |
| 83 | +```bash |
| 84 | +# Save index to ~/.bm25s/indices/my_docs |
| 85 | +bm25 index documents.csv -u -o my_docs |
| 86 | + |
| 87 | +# Search using the user directory |
| 88 | +bm25 search -u -i my_docs "your query" |
| 89 | +``` |
| 90 | + |
| 91 | +### Searching |
| 92 | + |
| 93 | +Search an existing index with a query using `-i` (or `--index`): |
| 94 | + |
| 95 | +```bash |
| 96 | +# Basic search (returns top 10 results) |
| 97 | +bm25 search -i my_index "what is machine learning?" |
| 98 | + |
| 99 | +# Search with full path |
| 100 | +bm25 search -i ./path/to/my_index "your query here" |
| 101 | + |
| 102 | +# Return more results |
| 103 | +bm25 search -i my_index "your query here" -k 20 |
| 104 | + |
| 105 | +# Save results to a JSON file |
| 106 | +bm25 search -i my_index "your query here" -s results.json |
| 107 | +``` |
| 108 | + |
| 109 | +### Interactive Index Picker |
| 110 | + |
| 111 | +When using `-u` without specifying an index name, an interactive picker is displayed (requires `bm25s[cli]` which is installed by default with `BM25`): |
| 112 | + |
| 113 | +```bash |
| 114 | +# Interactive picker will show available indices |
| 115 | +bm25 search -u "your query" |
| 116 | +``` |
| 117 | + |
| 118 | +### Example Workflow |
| 119 | + |
| 120 | +**Basic usage** (index saved to current directory): |
| 121 | + |
| 122 | +```bash |
| 123 | +# 1. Create a simple text file with documents |
| 124 | +echo -e "Machine learning is a subset of AI\nDeep learning uses neural networks\nNatural language processing handles text" > docs.txt |
| 125 | + |
| 126 | +# 2. Index the documents |
| 127 | +bm25 index docs.txt -o my_index |
| 128 | + |
| 129 | +# 3. Search the index |
| 130 | +bm25 search -i my_index "what is AI?" |
| 131 | +``` |
| 132 | + |
| 133 | +**With user directory** (indices saved to `~/.bm25s/indices/`): |
| 134 | + |
| 135 | +```bash |
| 136 | +# Index to user directory |
| 137 | +bm25 index docs.txt -u -o ml_docs |
| 138 | + |
| 139 | +# Search from user directory |
| 140 | +bm25 search -u -i ml_docs "what is AI?" |
| 141 | + |
| 142 | +# Or use the interactive picker |
| 143 | +bm25 search -u "what is AI?" |
| 144 | +``` |
| 145 | + |
| 146 | +## Flexibility |
| 147 | + |
| 148 | +For more advanced use cases, including memory mapping, customized tokenization, hugging face integration, or using different BM25 variants, please use the underlying `bm25s` API directly. |
| 149 | + |
| 150 | +See the [bm25s documentation](https://github.com/xhluca/bm25s) for full details. |
0 commit comments