Skip to content

Commit e3d4719

Browse files
committed
feat: setup BM25 library as a high-level wrapper of bm25s
1 parent 04efccb commit e3d4719

File tree

4 files changed

+198
-2
lines changed

4 files changed

+198
-2
lines changed

.github/workflows/publish-python.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ jobs:
4141
- name: Build package
4242
run: |
4343
python setup.py sdist bdist_wheel
44+
cd bm25s/high_level
45+
python setup.py sdist bdist_wheel
46+
cp -r dist/* ../../dist/
47+
cd ../..
4448
4549
- name: Publish package distributions to PyPI
4650
uses: pypa/gh-action-pypi-publish@release/v1

bm25s/high_level/README.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
<div align="center">
2+
3+
<h1>BM25</h1>
4+
5+
<i>A fast, simple, and high-level Python API and CLI for BM25, powered by `bm25s`.</i>
6+
7+
<table>
8+
<tr>
9+
<td>
10+
<a href="https://github.com/xhluca/bm25s">💻 GitHub</a>
11+
</td>
12+
<td>
13+
<a href="https://pypi.org/project/bm25s/">📦 bm25s</a>
14+
</td>
15+
<td>
16+
<a href="https://bm25s.github.io">🏠 Homepage</a>
17+
</td>
18+
</tr>
19+
</table>
20+
</div>
21+
22+
`BM25` is a wrapper package that installs `bm25s` with its optional core dependencies, providing a simple, high-level API and a command-line interface for fast and effective text retrieval.
23+
24+
## Installation
25+
26+
Install `BM25` using pip:
27+
28+
```bash
29+
pip install BM25
30+
```
31+
32+
This will automatically install the highly optimized `bm25s` backend, alongside necessary dependencies for stemming (`PyStemmer`), parallelization, and CLI (`rich`).
33+
34+
## High Level API
35+
36+
If you want to quickly search on a local file, you can use the `BM25` module:
37+
38+
```python
39+
import BM25
40+
41+
# Load a file (csv, json, jsonl, txt)
42+
# For csv/jsonl, you can specify the column/key to use as document text
43+
corpus = BM25.load("tests/data/dummy.csv", document_column="text")
44+
# Index the corpus
45+
retriever = BM25.index(corpus)
46+
47+
# Search
48+
results = retriever.search(["your query here"], k=5)
49+
for result in results[0]:
50+
print(result)
51+
```
52+
53+
The `load` function handles file reading, while `index` handles tokenization, indexing, and provides a simple search interface.
54+
55+
## Command-Line Interface
56+
57+
The package provides a terminal-based CLI for quick indexing and searching without writing Python code.
58+
59+
### Indexing Documents
60+
61+
Create an index from a CSV, TXT, JSON, or JSONL file:
62+
63+
```bash
64+
# Index a CSV file (uses first column by default)
65+
bm25 index documents.csv -o my_index
66+
67+
# Index with a specific column
68+
bm25 index documents.csv -o my_index -c text
69+
70+
# Index a text file (one document per line)
71+
bm25 index documents.txt -o my_index
72+
73+
# Index a JSONL file
74+
bm25 index documents.jsonl -o my_index -c content
75+
```
76+
77+
If you don't specify an output directory with `-o`, the index will be saved to `<filename>_index`.
78+
79+
### User Directory
80+
81+
You can save indices to a central user directory (`~/.bm25s/indices/`) using the `-u` flag:
82+
83+
```bash
84+
# Save index to ~/.bm25s/indices/my_docs
85+
bm25 index documents.csv -u -o my_docs
86+
87+
# Search using the user directory
88+
bm25 search -u -i my_docs "your query"
89+
```
90+
91+
### Searching
92+
93+
Search an existing index with a query using `-i` (or `--index`):
94+
95+
```bash
96+
# Basic search (returns top 10 results)
97+
bm25 search -i my_index "what is machine learning?"
98+
99+
# Search with full path
100+
bm25 search -i ./path/to/my_index "your query here"
101+
102+
# Return more results
103+
bm25 search -i my_index "your query here" -k 20
104+
105+
# Save results to a JSON file
106+
bm25 search -i my_index "your query here" -s results.json
107+
```
108+
109+
### Interactive Index Picker
110+
111+
When using `-u` without specifying an index name, an interactive picker is displayed (requires `bm25s[cli]` which is installed by default with `BM25`):
112+
113+
```bash
114+
# Interactive picker will show available indices
115+
bm25 search -u "your query"
116+
```
117+
118+
### Example Workflow
119+
120+
**Basic usage** (index saved to current directory):
121+
122+
```bash
123+
# 1. Create a simple text file with documents
124+
echo -e "Machine learning is a subset of AI\nDeep learning uses neural networks\nNatural language processing handles text" > docs.txt
125+
126+
# 2. Index the documents
127+
bm25 index docs.txt -o my_index
128+
129+
# 3. Search the index
130+
bm25 search -i my_index "what is AI?"
131+
```
132+
133+
**With user directory** (indices saved to `~/.bm25s/indices/`):
134+
135+
```bash
136+
# Index to user directory
137+
bm25 index docs.txt -u -o ml_docs
138+
139+
# Search from user directory
140+
bm25 search -u -i ml_docs "what is AI?"
141+
142+
# Or use the interactive picker
143+
bm25 search -u "what is AI?"
144+
```
145+
146+
## Flexibility
147+
148+
For more advanced use cases, including memory mapping, customized tokenization, hugging face integration, or using different BM25 variants, please use the underlying `bm25s` API directly.
149+
150+
See the [bm25s documentation](https://github.com/xhluca/bm25s) for full details.
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@
99
import json
1010
import csv
1111
from pathlib import Path
12-
from . import BM25
13-
from .tokenization import Tokenizer
12+
from bm25s import BM25
13+
from bm25s.tokenization import Tokenizer
1414
import Stemmer
1515
from typing import List
1616

bm25s/high_level/setup.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
from setuptools import setup
2+
import os
3+
import sys
4+
5+
# Change to the current directory to avoid issues if run from elsewhere
6+
current_dir = os.path.dirname(os.path.abspath(__file__))
7+
os.chdir(current_dir)
8+
9+
package_name = "BM25"
10+
version = {}
11+
with open(os.path.join("..", "version.py"), encoding="utf8") as fp:
12+
exec(fp.read(), version)
13+
14+
with open("README.md", encoding="utf8") as fp:
15+
long_description = fp.read()
16+
17+
setup(
18+
name=package_name,
19+
version=version["__version__"],
20+
author="Xing Han Lù",
21+
author_email="bm25s@googlegroups.com",
22+
url="https://github.com/xhluca/bm25s/tree/main/bm25s/high_level",
23+
description="A simple high-level API and CLI for BM25.",
24+
long_description=long_description,
25+
long_description_content_type="text/markdown",
26+
packages=["BM25"],
27+
package_dir={"BM25": "."},
28+
install_requires=[
29+
f"bm25s[core,cli]=={version['__version__']}",
30+
],
31+
entry_points={
32+
"console_scripts": [
33+
"bm25=bm25s.cli:main",
34+
],
35+
},
36+
classifiers=[
37+
"Programming Language :: Python :: 3",
38+
"License :: OSI Approved :: MIT License",
39+
"Operating System :: OS Independent",
40+
],
41+
python_requires=">=3.8",
42+
)

0 commit comments

Comments
 (0)