Skip to content

Commit 99ce46c

Browse files
feat: Add support for Markdown files (#496)
1 parent 83eb963 commit 99ce46c

40 files changed

+1028
-332
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,21 @@
22

33
## Next
44

5+
### Added
6+
7+
- MarkdownLoader (experimental): added a Markdown loader to support `.md` and `.markdown` files.
8+
9+
### Changed
10+
11+
- SimpleKG pipeline (experimental): the `from_pdf` parameter is deprecated in favor of `from_file` (PDF and Markdown inputs). `from_pdf` still works but emits a deprecation warning and will be removed in a future version.
12+
- Data loaders (experimental): the `PdfDocument` type name is deprecated in favor of `LoadedDocument`; `PdfDocument` remains available as a backward-compatible alias with a deprecation warning.
13+
514
## 1.14.1
615

716
### Added
817

918
- `NodeType` and `RelationshipType` now reject labels and types that start or end with double underscores (`__`), e.g. `__Person__`. This convention is reserved for internal Neo4j GraphRAG labels. A `ValidationError` is raised on construction.
19+
- SimpleKG pipeline (experimental): Markdown inputs (`.md` / `.markdown`) are supported alongside PDF via the default extension-based file loader when building from a file path.
1020

1121
### Changed
1222

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ kg_builder = SimpleKGPipeline(
138138
"patterns": patterns,
139139
},
140140
on_error="IGNORE",
141-
from_pdf=False,
141+
from_file=False,
142142
)
143143

144144
# Run the pipeline on a piece of text

docs/source/api.rst

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,19 @@ Component
1818
DataLoader
1919
==========
2020

21-
.. autoclass:: neo4j_graphrag.experimental.components.pdf_loader.DataLoader
21+
.. autoclass:: neo4j_graphrag.experimental.components.data_loader.DataLoader
2222
:members: run, get_document_metadata
2323

2424
PdfLoader
2525
=========
2626

27-
.. autoclass:: neo4j_graphrag.experimental.components.pdf_loader.PdfLoader
27+
.. autoclass:: neo4j_graphrag.experimental.components.data_loader.PdfLoader
28+
:members: run, load_file
29+
30+
MarkdownLoader
31+
==============
32+
33+
.. autoclass:: neo4j_graphrag.experimental.components.data_loader.MarkdownLoader
2834
:members: run, load_file
2935

3036
TextSplitter

docs/source/types.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,11 @@ DocumentInfo
4444

4545
.. autoclass:: neo4j_graphrag.experimental.components.types.DocumentInfo
4646

47+
LoadedDocument
48+
==============
49+
50+
.. autoclass:: neo4j_graphrag.experimental.components.types.LoadedDocument
51+
4752

4853
TextChunk
4954
=========

docs/source/user_guide_kg_builder.rst

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,10 @@ is utilizing the `SimpleKGPipeline` interface:
5454
llm=llm, # an LLMInterface for Entity and Relation extraction
5555
driver=neo4j_driver, # a neo4j driver to write results to graph
5656
embedder=embedder, # an Embedder for chunks
57-
from_pdf=True, # set to False if parsing an already extracted text
57+
from_file=True, # set to False if parsing an already extracted text
5858
)
5959
await kg_builder.run_async(file_path=str(file_path))
60-
# await kg_builder.run_async(text="my text") # if using from_pdf=False
60+
# await kg_builder.run_async(text="my text") # if using from_file=False
6161
6262
6363
See:
@@ -216,9 +216,12 @@ instances of specific components to the `SimpleKGPipeline`. The components that
216216
customized at the moment are:
217217

218218
- `text_splitter`: must be an instance of :ref:`TextSplitter`
219-
- `pdf_loader`: must be an instance of :ref:`PdfLoader`
219+
- `file_loader`: must be an instance of :ref:`PdfLoader` or :ref:`MarkdownLoader`
220220
- `kg_writer`: must be an instance of :ref:`KGWriter`
221221

222+
The legacy names ``from_pdf`` and ``pdf_loader`` (in Python, YAML, or JSON) are still accepted
223+
with a deprecation warning; use ``from_file`` and ``file_loader`` instead.
224+
222225
For instance, the following code can be used to customize the chunk size and
223226
chunk overlap in the text splitter component:
224227

@@ -450,7 +453,7 @@ within the configuration file.
450453
.. code:: json
451454
452455
{
453-
"from_pdf": false,
456+
"from_file": false,
454457
"perform_entity_resolution": true,
455458
"neo4j_database": "myDb",
456459
"on_error": "IGNORE",
@@ -502,7 +505,7 @@ or in YAML:
502505

503506
.. code:: yaml
504507
505-
from_pdf: false
508+
from_file: false
506509
perform_entity_resolution: true
507510
neo4j_database: myDb
508511
on_error: IGNORE
@@ -578,7 +581,7 @@ Each of these components can be run individually:
578581
.. code:: python
579582
580583
import asyncio
581-
from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader
584+
from neo4j_graphrag.experimental.components.data_loader import PdfLoader
582585
my_component = PdfLoader()
583586
asyncio.run(my_component.run("my_file.pdf"))
584587
@@ -588,7 +591,7 @@ They can also be used within a pipeline:
588591
.. code:: python
589592
590593
from neo4j_graphrag.experimental.pipeline import Pipeline
591-
from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader
594+
from neo4j_graphrag.experimental.components.data_loader import PdfLoader
592595
pipeline = Pipeline()
593596
my_component = PdfLoader()
594597
pipeline.add_component(my_component, "component_name")
@@ -604,7 +607,7 @@ This package currently supports text extraction from PDFs:
604607
.. code:: python
605608
606609
from pathlib import Path
607-
from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader
610+
from neo4j_graphrag.experimental.components.data_loader import PdfLoader
608611
609612
loader = PdfLoader()
610613
await loader.run(filepath=Path("my_file.pdf"))
@@ -614,12 +617,13 @@ To implement your own loader, use the `DataLoader` interface:
614617
.. code:: python
615618
616619
from pathlib import Path
617-
from neo4j_graphrag.experimental.components.pdf_loader import DataLoader, PdfDocument
620+
from neo4j_graphrag.experimental.components.data_loader import DataLoader
621+
from neo4j_graphrag.experimental.components.types import LoadedDocument
618622
619623
class MyDataLoader(DataLoader):
620-
async def run(self, filepath: Path, metadata: Optional[Dict[str, str]] = None) -> PdfDocument:
624+
async def run(self, filepath: Path, metadata: Optional[Dict[str, str]] = None) -> LoadedDocument:
621625
# process file in `filepath`
622-
return PdfDocument(
626+
return LoadedDocument(
623627
text="text",
624628
document_info=DocumentInfo(
625629
path=str(filepath),

examples/build_graph/automatic_schema_extraction/simple_kg_builder_schema_from_pdf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ async def run_kg_pipeline_with_auto_schema() -> None:
5454
llm=llm,
5555
driver=driver,
5656
embedder=embedder,
57-
from_pdf=True,
57+
from_file=True,
5858
)
5959

6060
print(f"Processing PDF file: {PDF_FILE}")

examples/build_graph/automatic_schema_extraction/simple_kg_builder_schema_from_text.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ async def run_kg_pipeline_with_auto_schema() -> None:
7373
llm=llm,
7474
driver=driver,
7575
embedder=embedder,
76-
from_pdf=False, # Using raw text input, not PDF
76+
from_file=False, # Using raw text input, not PDF
7777
)
7878

7979
# Run the pipeline on the text

examples/build_graph/from_config_files/simple_kg_pipeline_config.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
}
3737
}
3838
},
39-
"from_pdf": false,
39+
"from_file": false,
4040
"schema": {
4141
"node_types": [
4242
"Person",

examples/build_graph/from_config_files/simple_kg_pipeline_config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ embedder_config:
2424
api_key:
2525
resolver_: ENV
2626
var_: OPENAI_API_KEY
27-
from_pdf: false
27+
from_file: false
2828
schema:
2929
node_types:
3030
- label: Person

examples/build_graph/from_config_files/simple_kg_pipeline_config_url.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
}
3737
}
3838
},
39-
"from_pdf": true,
39+
"from_file": true,
4040
"schema": {
4141
"node_types": [
4242
"Person",
@@ -105,8 +105,8 @@
105105
"chunk_overlap": 10
106106
}
107107
},
108-
"pdf_loader": {
109-
"class_": "pdf_loader.PdfLoader",
108+
"file_loader": {
109+
"class_": "data_loader.PdfLoader",
110110
"run_params_": {
111111
"fs": "http"
112112
}

0 commit comments

Comments
 (0)