Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
a2a4f86
implement md-header-splitter and add tests
OGuggenbuehl Jul 11, 2025
4337a5b
rework md-header splitter to rewrite md-header levels
OGuggenbuehl Jul 29, 2025
393cd53
remove deprecated test
OGuggenbuehl Jul 29, 2025
de6b0d9
Merge branch 'main' into feature/md-header-splitter
OGuggenbuehl Aug 11, 2025
0e9f955
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl Sep 9, 2025
fad1ed7
use native types
OGuggenbuehl Sep 9, 2025
8910485
move to haystack logging
OGuggenbuehl Sep 9, 2025
b3114e6
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl Sep 9, 2025
2abec16
docstrings improvements
OGuggenbuehl Sep 9, 2025
3917116
Merge branch 'feature/md-header-splitter' of https://github.com/OGugg…
OGuggenbuehl Sep 9, 2025
7f92dc9
fix CustomDocumentSplitter arguments
OGuggenbuehl Sep 9, 2025
6d75b58
remove header prefix from content
OGuggenbuehl Sep 9, 2025
c1bb05e
rework split_id assignment to avoid collisions
OGuggenbuehl Sep 9, 2025
970ec90
remove unneeded dese methods
OGuggenbuehl Sep 9, 2025
169cb06
cleanup
OGuggenbuehl Sep 9, 2025
3dc0504
cleanup
OGuggenbuehl Sep 9, 2025
bcbbf9a
add tests
OGuggenbuehl Sep 16, 2025
c7a8756
move initialization of secondary-splitter out of run method
OGuggenbuehl Sep 19, 2025
356ca73
move _custom_document_splitter to class method
OGuggenbuehl Sep 19, 2025
5dde973
removed the _CustomDocumentSplitter class. splitting logic is now enc…
OGuggenbuehl Sep 19, 2025
59c81c7
return to standard feed-forward character and add tests for page brea…
OGuggenbuehl Sep 19, 2025
8fc8281
quit exposing splitting_function param since it shouldn't be changed …
OGuggenbuehl Sep 19, 2025
191d98d
remove test section in module
OGuggenbuehl Sep 19, 2025
3e76544
add license header
OGuggenbuehl Sep 19, 2025
ed5dc6f
add release note
OGuggenbuehl Sep 19, 2025
38e04e7
minor refactor for type safety
OGuggenbuehl Sep 23, 2025
6518ce4
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl Sep 23, 2025
21451f1
remove unneeded release notes entries
OGuggenbuehl Sep 23, 2025
a6028a0
improved documentation for methods
OGuggenbuehl Sep 23, 2025
aca4d4c
improve method naming
OGuggenbuehl Sep 23, 2025
7ef16a7
improved page-number assignment & added return in docstring
OGuggenbuehl Sep 23, 2025
876b244
Merge branch 'main' into feature/md-header-splitter
OGuggenbuehl Sep 23, 2025
5203603
unified page-counting
OGuggenbuehl Sep 24, 2025
debe17e
simplify conditional secondary-split initialization and usage
OGuggenbuehl Sep 24, 2025
fc2cc58
Merge branch 'feature/md-header-splitter' of https://github.com/OGugg…
OGuggenbuehl Sep 24, 2025
b74cefc
fix linting error
OGuggenbuehl Sep 24, 2025
e7e9872
clearly specify the use of ATX-style headers (#) only
OGuggenbuehl Sep 24, 2025
f66e77b
reference doc_id when logging no headers found
OGuggenbuehl Sep 24, 2025
445ffe8
initialize md-header pattern as private variable once
OGuggenbuehl Sep 24, 2025
1b2160b
add example to for inferring header levels to docstring
OGuggenbuehl Sep 25, 2025
94218fa
improve empty document handling
OGuggenbuehl Sep 25, 2025
b6e2486
more explicit testing for inferred headers
OGuggenbuehl Sep 25, 2025
530eafa
fix linting issue
OGuggenbuehl Sep 25, 2025
44e0454
improved empty content handling test cases
OGuggenbuehl Sep 26, 2025
47e3b9e
remove all functionality related to inferring md-header levels
OGuggenbuehl Sep 29, 2025
12fbf8b
compile regex-pattern in init for performance gains
OGuggenbuehl Sep 30, 2025
393f13f
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl Oct 13, 2025
85d9553
change all "none" to proper None values
OGuggenbuehl Oct 13, 2025
5aaec38
fix minor
OGuggenbuehl Oct 13, 2025
00799f6
explicitly test doc content
OGuggenbuehl Oct 13, 2025
645ec7f
rename parentheaders to parent_headers
OGuggenbuehl Oct 13, 2025
da16dd9
test split_id, doc length
OGuggenbuehl Oct 13, 2025
4769715
check meta content
OGuggenbuehl Oct 13, 2025
9b68b76
remove unneeded test
OGuggenbuehl Oct 13, 2025
020a2fe
make split_id testing more robust
OGuggenbuehl Oct 13, 2025
26c7825
more realistic overlap test sample
OGuggenbuehl Oct 14, 2025
b40036a
assign split_id globally to all output docs
OGuggenbuehl Oct 14, 2025
fb6ed86
taste page numbers explicitly
OGuggenbuehl Oct 14, 2025
a00f758
cleanup pagebreak test
OGuggenbuehl Oct 14, 2025
186115f
minor
OGuggenbuehl Oct 14, 2025
88a0460
return doc unchunked if no headers have content
OGuggenbuehl Oct 14, 2025
07ff103
add doc-id to logging statement for unsplit documents
OGuggenbuehl Oct 16, 2025
83c7c07
remove unneeded logs
OGuggenbuehl Oct 16, 2025
3bd6176
minor cleanup
OGuggenbuehl Oct 16, 2025
51b093e
simplify page-number tracking method to not return count, just the up…
OGuggenbuehl Oct 16, 2025
e333d12
add dev comment to mypy check for doc.content is None
OGuggenbuehl Oct 16, 2025
6e348f8
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl Oct 16, 2025
5a4c74f
Merge branch 'feature/md-header-splitter' of https://github.com/OGugg…
OGuggenbuehl Oct 16, 2025
a77253c
remove split meta flattening
OGuggenbuehl Oct 16, 2025
489bffd
keep empty meta return consistent
OGuggenbuehl Oct 16, 2025
7ac9338
remove unneeded content is none check
OGuggenbuehl Oct 16, 2025
39c0c17
update tests to reflect empty meta dict for unsplit docs
OGuggenbuehl Oct 16, 2025
2881178
clean up total_page counts
OGuggenbuehl Oct 16, 2025
3fe2882
remove unneeded meta check
OGuggenbuehl Oct 16, 2025
78083d2
Update test/components/preprocessors/test_markdown_header_splitter.py
OGuggenbuehl Oct 16, 2025
29b92a6
implement keep_headers parameter
OGuggenbuehl Oct 17, 2025
18ffc54
remove meta-dict flattening
OGuggenbuehl Oct 17, 2025
abb2b34
Update test/components/preprocessors/test_markdown_header_splitter.py
OGuggenbuehl Oct 21, 2025
2174cc2
add minor sanity checks
OGuggenbuehl Oct 21, 2025
35fc3ab
Merge branch 'feature/md-header-splitter' of https://github.com/OGugg…
OGuggenbuehl Oct 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Loading