-
Notifications
You must be signed in to change notification settings - Fork 179
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Before You Report a Bug, Please Confirm You Have Done The Following...
- I have updated to the latest version of the packages.
- I have searched for both existing issues and closed issues and found none that matched my issue.
neo4j-graphrag-python's version
1.12.0
Python version
3.12.3
Operating System
Linux Mint 22.2
Dependencies
We use the pipeline to run on markdown files (not PDF).
We can have very large tables in the markdown files. The FixedSizeSplitter can go to infinite loop when the chunck_size is less than the size of a table separator (too many dash signs without spaces).
This bug only happens when the parameter approximate is True and the chunk_overlap is not 0. It is related to the algorithm that tries to find the best index to split the chunk on a space character. In this particular example, we have a sequence of characters greater than a chunk size without any space character.
Reproducible example
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
import asyncio
async def testcase_that_creates_an_infinite_loop():
text = """
| xxx-xxxxxx xxxxx | |
|---------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| xxxxxxx xxxxxxx | xxxxxxx xxxxxxxxxxx xx xxx xxxxxxxx xxxxxxxxxxxx xx xxxxxxxxxx xxxxxxxxx xxxxxx xxx xxxxxxxxxxxxx xxxxxxxxxx xxxxxxx xxx+xx xx xxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxxx xxxx xx xxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxx xxx (x.xxxx) xxx xxxxxxxx xxxxxxxxxxxx xx xxxxxxxxxx xxxxxxxxx xxxxxx xxx xxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxx xxx (x.xxxx) xxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxxx xxxx xx xxxxxxxxxx xxxxxxxxxx |
"""
sp = FixedSizeSplitter(chunk_size=500, chunk_overlap=100, approximate=True)
await sp.run(text)
if __name__ == "__main__":
asyncio.run(testcase_that_creates_an_infinite_loop())Relevant Log Output
No response
Expected Result
The call to sp.run must return the list of chunks
What happened instead?
The code never returns from the call to sp.run and the program is stuck.
Additional Info
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working