Skip to content

Commit f927a15

Browse files
authored
feat: move sources to core package (#496)
1 parent 10ef831 commit f927a15

File tree

73 files changed

+333
-305
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+333
-305
lines changed

docs/api_reference/core/sources.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Sources
2+
3+
::: ragbits.core.sources.base.Source
4+
5+
::: ragbits.core.sources.azure.AzureBlobStorageSource
6+
7+
::: ragbits.core.sources.gcs.GCSSource
8+
9+
::: ragbits.core.sources.git.GitSource
10+
11+
::: ragbits.core.sources.hf.HuggingFaceSource
12+
13+
::: ragbits.core.sources.local.LocalFileSource
14+
15+
::: ragbits.core.sources.s3.S3Source
16+
17+
::: ragbits.core.sources.web.WebSource

docs/api_reference/document_search/documents.md

Lines changed: 0 additions & 25 deletions
This file was deleted.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Documents
2+
3+
::: ragbits.document_search.documents.document.Document
4+
5+
::: ragbits.document_search.documents.document.TextDocument
6+
7+
::: ragbits.document_search.documents.document.DocumentMeta
8+
9+
::: ragbits.document_search.documents.document.DocumentType
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Elements
2+
3+
::: ragbits.document_search.documents.element.Element
4+
5+
::: ragbits.document_search.documents.element.TextElement
6+
7+
::: ragbits.document_search.documents.element.ImageElement

docs/how-to/document_search/ingest-documents.md

Lines changed: 6 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
The Ragbits document ingest pipeline consists of four main steps: loading, parsing, enrichment, and indexing. All of these steps can be orchestrated using different strategies, depending on the expected load.
44

5-
## Loading sources
5+
## Loading dataset
66

7-
Before a document can be processed, it must be defined and downloaded. In Ragbits, there are a few ways to do this: you can specify the source URI, the source instance, the document metadata or the document itself.
7+
Before processing a document in Ragbits, it must first be defined and downloaded. This can be done in several ways: by specifying a source URI or using an instance of [`Source`][ragbits.core.sources.base.Source], [`DocumentMeta`][ragbits.document_search.documents.document.DocumentMeta] or [`Document`][ragbits.document_search.documents.document.Document].
88

99
=== "URI"
1010

@@ -19,7 +19,7 @@ Before a document can be processed, it must be defined and downloaded. In Ragbit
1919
=== "Source"
2020

2121
```python
22-
from ragbits.document_search.documents.sources import WebSource
22+
from ragbits.core.sources import WebSource
2323
from ragbits.document_search import DocumentSearch
2424

2525
document_search = DocumentSearch(...)
@@ -49,65 +49,7 @@ Before a document can be processed, it must be defined and downloaded. In Ragbit
4949
await document_search.ingest([Document(...), ...])
5050
```
5151

52-
### Supported sources
53-
54-
This is the list of currently supported sources by Ragbits.
55-
56-
| Source | URI Schema | Class |
57-
|-|-|-|
58-
| Azure Blob Storage | `azure://https://account_name.blob.core.windows.net/<container-name>|<blob-name>` | [`AzureBlobStorageSource`][ragbits.document_search.documents.sources.AzureBlobStorageSource] |
59-
| Google Cloud Storage | `gcs://<bucket-name>/<prefix>` | [`GCSSource`][ragbits.document_search.documents.sources.GCSSource] |
60-
| Git | `git://<https-url>|<ssh-url>` | [`GitSource`][ragbits.document_search.documents.sources.GitSource] |
61-
| Hugging Face | `huggingface://<dataset-path>/<split>/<row>` | [`HuggingFaceSource`][ragbits.document_search.documents.sources.HuggingFaceSource] |
62-
| Local file | `file://<file-path>|<blob-pattern>` | [`LocalFileSource`][ragbits.document_search.documents.sources.LocalFileSource] |
63-
| Amazon S3 | `s3://<bucket-name>/<prefix>` | [`S3Source`][ragbits.document_search.documents.sources.S3Source] |
64-
| Web | `web://<https-url>` | [`WebSource`][ragbits.document_search.documents.sources.WebSource] |
65-
66-
To define a new sources, extend the [`Source`][ragbits.document_search.documents.sources.Source] class.
67-
68-
```python
69-
from ragbits.document_search.documents.sources import Source
70-
71-
72-
class CustomSource(Source):
73-
"""
74-
Source that downloads file from the web.
75-
"""
76-
77-
protocol: ClassVar[str] = "custom"
78-
source_url: str
79-
...
80-
81-
@property
82-
def id(self) -> str:
83-
"""
84-
Source unique identifier.
85-
"""
86-
return f"{self.protocol}:{self.source_url}"
87-
88-
@classmethod
89-
async def from_uri(cls, uri: str) -> list[Self]:
90-
"""
91-
Create source instances from a URI path.
92-
93-
Args:
94-
uri: The URI path.
95-
96-
Returns:
97-
The list of sources.
98-
"""
99-
return [cls(...), ...]
100-
101-
async def fetch(self) -> Path:
102-
"""
103-
Download a file for the given url.
104-
105-
Returns:
106-
The local path to the downloaded file.
107-
"""
108-
...
109-
return Path(f"/tmp/{self.source_url}")
110-
```
52+
All sources supported by Ragbits are available [here](../sources/load-dataset.md#supported-sources).
11153

11254
## Parsing documents
11355

@@ -290,7 +232,7 @@ Running an ingest pipeline can be time-consuming, depending on your expected loa
290232
--address http://<cluster_address>:8265 \
291233
--runtime-env '{"pip": ["ragbits-core", "ragbits-document-search[ray]"]}' \
292234
--working-dir . \
293-
--python script.py
235+
-- python script.py
294236
```
295237

296238
There are also other ways to submit jobs to the Ray cluster. For more information, please refer to the [Ray documentation](https://docs.ray.io/en/latest/ray-overview/index.html).
@@ -300,7 +242,7 @@ To define a new ingest strategy, extend the [`IngestStrategy`][ragbits.document_
300242
```python
301243
from ragbits.core.vector_stores import VectorStore
302244
from ragbits.document_search.documents.document import Document, DocumentMeta
303-
from ragbits.document_search.documents.sources import Source
245+
from ragbits.core.sources import Source
304246
from ragbits.document_search.ingestion.enrichers import ElementEnricherRouter
305247
from ragbits.document_search.ingestion.parsers import DocumentParserRouter
306248
from ragbits.document_search.ingestion.strategies import (
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# How-To: Load dataset with sources
2+
3+
Ragbits provides an abstraction for handling datasets. The [`Source`][ragbits.core.sources.Source] component is designed to define interactions with any data source, such as downloading and querying.
4+
5+
## Supported sources
6+
7+
This is the list of currently supported sources by Ragbits.
8+
9+
| Source | URI Schema | Class |
10+
|-|-|-|
11+
| Azure Blob Storage | `azure://https://account_name.blob.core.windows.net/<container-name>|<blob-name>` | [`AzureBlobStorageSource`][ragbits.core.sources.AzureBlobStorageSource] |
12+
| Google Cloud Storage | `gcs://<bucket-name>/<prefix>` | [`GCSSource`][ragbits.core.sources.GCSSource] |
13+
| Git | `git://<https-url>|<ssh-url>` | [`GitSource`][ragbits.core.sources.GitSource] |
14+
| Hugging Face | `hf://<dataset-path>/<split>/<row>` | [`HuggingFaceSource`][ragbits.core.sources.HuggingFaceSource] |
15+
| Local file | `file://<file-path>|<blob-pattern>` | [`LocalFileSource`][ragbits.core.sources.LocalFileSource] |
16+
| Amazon S3 | `s3://<bucket-name>/<prefix>` | [`S3Source`][ragbits.core.sources.S3Source] |
17+
| Web | `web://<https-url>` | [`WebSource`][ragbits.core.sources.WebSource] |
18+
19+
## Custom source
20+
21+
To define a new sources, extend the [`Source`][ragbits.core.sources.Source] class.
22+
23+
```python
24+
from ragbits.core.sources import Source
25+
26+
27+
class CustomSource(Source):
28+
"""
29+
Source that downloads file from the web.
30+
"""
31+
32+
protocol: ClassVar[str] = "custom"
33+
source_url: str
34+
...
35+
36+
@property
37+
def id(self) -> str:
38+
"""
39+
Source unique identifier.
40+
"""
41+
return f"{self.protocol}:{self.source_url}"
42+
43+
@classmethod
44+
async def from_uri(cls, uri: str) -> list[Self]:
45+
"""
46+
Create source instances from a URI path.
47+
48+
Args:
49+
uri: The URI path.
50+
51+
Returns:
52+
The list of sources.
53+
"""
54+
return [cls(...), ...]
55+
56+
async def fetch(self) -> Path:
57+
"""
58+
Download a file for the given url.
59+
60+
Returns:
61+
The local path to the downloaded file.
62+
"""
63+
...
64+
return Path(f"/tmp/{self.source_url}")
65+
```

docs/quickstart/quickstart2_rag.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ We first need to direct Ragbits to the location of the documents to load them. T
4343

4444
```python
4545
from pathlib import Path
46-
from ragbits.document_search.documents.sources import LocalFileSource
46+
from ragbits.core.sources import LocalFileSource
4747

4848
# Path to the directory with markdown files to ingest
4949
documents_path = Path(__file__).parent / "pb-source/en"

examples/document-search/multimodal.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,12 @@
3434
from pathlib import Path
3535

3636
from ragbits.core.embeddings.vertex_multimodal import VertexAIMultimodelEmbedder
37+
from ragbits.core.sources import LocalFileSource
3738
from ragbits.core.vector_stores.base import EmbeddingType
3839
from ragbits.core.vector_stores.hybrid import HybridSearchVectorStore
3940
from ragbits.core.vector_stores.in_memory import InMemoryVectorStore
4041
from ragbits.document_search import DocumentSearch
4142
from ragbits.document_search.documents.document import DocumentMeta, DocumentType
42-
from ragbits.document_search.documents.sources import LocalFileSource
4343
from ragbits.document_search.ingestion.parsers.base import ImageDocumentParser
4444
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
4545

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
type: ragbits.document_search.documents.sources.hf:HuggingFaceSource
1+
type: ragbits.core.sources.hf:HuggingFaceSource
22
config:
33
path: "micpst/hf-docs"
44
split: "train[:5]"

examples/evaluation/document-search/advanced/evaluate.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# /// script
22
# requires-python = ">=3.10"
33
# dependencies = [
4-
# "ragbits-core[chroma]",
5-
# "ragbits-document-search[huggingface]",
4+
# "ragbits-core[chroma,hf]",
5+
# "ragbits-document-search",
66
# "ragbits-evaluate[relari]",
77
# ]
88
# ///

0 commit comments

Comments
 (0)