Skip to content

Commit 43545ca

Browse files
xouyang1atharva-tendleAstraBertcpoerschke
authored
Add Solr reader integration (#19843)
* integrations: Add Solr reader Co-authored-by: Atharva Tendle <[email protected]> * Update pyproject.toml authors maintainers (#1) * Update llama-index-integrations/readers/llama-index-readers-solr/pyproject.toml Co-authored-by: Clelia (Astra) Bertelli <[email protected]> * Update Solr reader for review comments related to clean up and adding unit tests (#2) * Update llama-index-integrations/readers/llama-index-readers-solr/llama_index/readers/solr/base.py Co-authored-by: Christine Poerschke <[email protected]> * Add support for user provided fl and id * Update llama-index-integrations/readers/llama-index-readers-solr/llama_index/readers/solr/base.py Co-authored-by: Christine Poerschke <[email protected]> --------- Co-authored-by: Atharva Tendle <[email protected]> Co-authored-by: Clelia (Astra) Bertelli <[email protected]> Co-authored-by: Christine Poerschke <[email protected]>
1 parent 9be3a8b commit 43545ca

File tree

11 files changed

+4743
-0
lines changed

11 files changed

+4743
-0
lines changed
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
llama_index/_static
2+
.DS_Store
3+
# Byte-compiled / optimized / DLL files
4+
__pycache__/
5+
*.py[cod]
6+
*$py.class
7+
8+
# C extensions
9+
*.so
10+
11+
# Distribution / packaging
12+
.Python
13+
bin/
14+
build/
15+
develop-eggs/
16+
dist/
17+
downloads/
18+
eggs/
19+
.eggs/
20+
etc/
21+
include/
22+
lib/
23+
lib64/
24+
parts/
25+
sdist/
26+
share/
27+
var/
28+
wheels/
29+
pip-wheel-metadata/
30+
share/python-wheels/
31+
*.egg-info/
32+
.installed.cfg
33+
*.egg
34+
MANIFEST
35+
36+
# PyInstaller
37+
# Usually these files are written by a python script from a template
38+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
39+
*.manifest
40+
*.spec
41+
42+
# Installer logs
43+
pip-log.txt
44+
pip-delete-this-directory.txt
45+
46+
# Unit test / coverage reports
47+
htmlcov/
48+
.tox/
49+
.nox/
50+
.coverage
51+
.coverage.*
52+
.cache
53+
nosetests.xml
54+
coverage.xml
55+
*.cover
56+
*.py,cover
57+
.hypothesis/
58+
.pytest_cache/
59+
.ruff_cache
60+
61+
# Translations
62+
*.mo
63+
*.pot
64+
65+
# Django stuff:
66+
*.log
67+
local_settings.py
68+
db.sqlite3
69+
db.sqlite3-journal
70+
71+
# Flask stuff:
72+
instance/
73+
.webassets-cache
74+
75+
# Scrapy stuff:
76+
.scrapy
77+
78+
# Sphinx documentation
79+
docs/_build/
80+
81+
# PyBuilder
82+
target/
83+
84+
# Jupyter Notebook
85+
.ipynb_checkpoints
86+
notebooks/
87+
88+
# IPython
89+
profile_default/
90+
ipython_config.py
91+
92+
# pyenv
93+
.python-version
94+
95+
# pipenv
96+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
97+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
98+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
99+
# install all needed dependencies.
100+
#Pipfile.lock
101+
102+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
103+
__pypackages__/
104+
105+
# Celery stuff
106+
celerybeat-schedule
107+
celerybeat.pid
108+
109+
# SageMath parsed files
110+
*.sage.py
111+
112+
# Environments
113+
.env
114+
.venv
115+
env/
116+
venv/
117+
ENV/
118+
env.bak/
119+
venv.bak/
120+
pyvenv.cfg
121+
122+
# Spyder project settings
123+
.spyderproject
124+
.spyproject
125+
126+
# Rope project settings
127+
.ropeproject
128+
129+
# mkdocs documentation
130+
/site
131+
132+
# mypy
133+
.mypy_cache/
134+
.dmypy.json
135+
dmypy.json
136+
137+
# Pyre type checker
138+
.pyre/
139+
140+
# Jetbrains
141+
.idea
142+
modules/
143+
*.swp
144+
145+
# VsCode
146+
.vscode
147+
148+
# pipenv
149+
Pipfile
150+
Pipfile.lock
151+
152+
# pyright
153+
pyrightconfig.json
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# CHANGELOG
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Bloomberg Finance L.P.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
GIT_ROOT ?= $(shell git rev-parse --show-toplevel)
2+
3+
help: ## Show all Makefile targets.
4+
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[33m%-30s\033[0m %s\n", $$1, $$2}'
5+
6+
format: ## Run code autoformatters (black).
7+
pre-commit install
8+
git ls-files | xargs pre-commit run black --files
9+
10+
lint: ## Run linters: pre-commit (black, ruff, codespell) and mypy
11+
pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files
12+
13+
test: ## Run tests via pytest.
14+
pytest tests
15+
16+
watch-docs: ## Build and watch documentation.
17+
sphinx-autobuild docs/ docs/_build/html --open-browser --watch $(GIT_ROOT)/llama_index/
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# LlamaIndex Readers Integration: Solr
2+
3+
## Overview
4+
5+
Solr Reader retrieves documents through an existing Solr index. These documents can then be used in a downstream LlamaIndex data structure.
6+
7+
### Installation
8+
9+
You can install Solr Reader via pip:
10+
11+
```bash
12+
pip install llama-index-readers-solr
13+
```
14+
15+
## Usage
16+
17+
```python
18+
from llama_index.readers.solr import SolrReader
19+
20+
# Initialize SolrReader with the Solr URL. The Solr URL should include the path
21+
# to the core (if single node) or collection (if Solr Cloud).
22+
reader = SolrReader(endpoint="<Endpoint with full solr path>")
23+
24+
# Load data from Solr index
25+
documents = reader.load_data(
26+
query={"q": "*:*", "rows": 10}, # Solr query parameters
27+
field="content_t", # Only results with populated values in this field will be returned
28+
metadata_fields=["title_t", "category_s"],
29+
)
30+
```
31+
32+
This loader is designed to be used as a way to load data into
33+
[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently
34+
used as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from llama_index.readers.solr.base import SolrReader
2+
3+
__all__ = ["SolrReader"]
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
"""
2+
Solr reader over REST api.
3+
"""
4+
5+
from typing import Any, Optional
6+
7+
import pysolr
8+
9+
from llama_index.core.bridge.pydantic import Field, PrivateAttr
10+
from llama_index.core.readers.base import BasePydanticReader
11+
from llama_index.core.schema import Document
12+
13+
14+
class SolrReader(BasePydanticReader):
15+
"""
16+
Read documents from a Solr index.
17+
18+
These documents can then be used in a downstream Llama Index data structure.
19+
"""
20+
21+
endpoint: str = Field(description="Full endpoint, including collection info.")
22+
_client: Any = PrivateAttr()
23+
24+
def __init__(
25+
self,
26+
endpoint: str,
27+
):
28+
"""Initialize with parameters."""
29+
super().__init__(endpoint=endpoint)
30+
self._client = pysolr.Solr(endpoint)
31+
32+
def load_data(
33+
self,
34+
query: dict[str, Any],
35+
field: str,
36+
id_field: str = "id",
37+
metadata_fields: Optional[list[str]] = None,
38+
embedding: Optional[str] = None,
39+
) -> list[Document]:
40+
r"""
41+
Read data from the Solr index. At least one field argument must be specified.
42+
43+
Args:
44+
query (dict): The Solr query parameters.
45+
- "q" is required.
46+
- "rows" should be specified or will default to 10 by Solr.
47+
- If "fl" is provided, it is respected exactly as given.
48+
If "fl" is NOT provided, a default `fl` is constructed from
49+
{id_field, field, embedding?, metadata_fields?}.
50+
field (str): Field in Solr to retrieve as document text.
51+
id_field (str): Field in Solr to retrieve as the document identifier. Defaults to "id".
52+
metadata_fields (list[str], optional): Fields to include as metadata. Defaults to None.
53+
embedding (str, optional): Field to use for embeddings. Defaults to None.
54+
55+
Raises:
56+
ValueError: If the HTTP call to Solr fails.
57+
58+
Returns:
59+
list[Document]: A list of retrieved documents where field is populated.
60+
61+
"""
62+
if "q" not in query:
63+
raise ValueError("Query parameters must include a 'q' field for the query.")
64+
65+
fl_default = {}
66+
if "fl" not in query:
67+
fields = [id_field, field]
68+
if embedding:
69+
fields.append(embedding)
70+
if metadata_fields:
71+
fields.extend(metadata_fields)
72+
fl_default = {"fl": ",".join(fields)}
73+
74+
try:
75+
query_params = {
76+
**query,
77+
**fl_default,
78+
}
79+
results = self._client.search(**query_params)
80+
except Exception as e: # pragma: no cover
81+
raise ValueError(f"Failed to query Solr endpoint: {e!s}") from e
82+
83+
documents: list[Document] = []
84+
for doc in results.docs:
85+
if field not in doc:
86+
continue
87+
88+
doc_kwargs: dict[str, Any] = {
89+
"id_": str(doc[id_field]),
90+
"text": doc[field],
91+
**({"embedding": doc.get(embedding)} if embedding else {}),
92+
"metadata": {
93+
metadata_field: doc[metadata_field]
94+
for metadata_field in (metadata_fields or [])
95+
if metadata_field in doc
96+
},
97+
}
98+
documents.append(Document(**doc_kwargs))
99+
return documents

0 commit comments

Comments
 (0)