Skip to content

[BUG]: FixedSizeSplitter goes to infinite loop for markdown input #471

@spoonless

Description

@spoonless

Before You Report a Bug, Please Confirm You Have Done The Following...

  • I have updated to the latest version of the packages.
  • I have searched for both existing issues and closed issues and found none that matched my issue.

neo4j-graphrag-python's version

1.12.0

Python version

3.12.3

Operating System

Linux Mint 22.2

Dependencies

We use the pipeline to run on markdown files (not PDF).
We can have very large tables in the markdown files. The FixedSizeSplitter can go to infinite loop when the chunck_size is less than the size of a table separator (too many dash signs without spaces).
This bug only happens when the parameter approximate is True and the chunk_overlap is not 0. It is related to the algorithm that tries to find the best index to split the chunk on a space character. In this particular example, we have a sequence of characters greater than a chunk size without any space character.

Reproducible example

from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
import asyncio


async def testcase_that_creates_an_infinite_loop():
    text = """
| xxx-xxxxxx xxxxx                |                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|---------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| xxxxxxx xxxxxxx                 | xxxxxxx xxxxxxxxxxx xx xxx xxxxxxxx xxxxxxxxxxxx xx xxxxxxxxxx xxxxxxxxx xxxxxx xxx xxxxxxxxxxxxx xxxxxxxxxx xxxxxxx xxx+xx xx xxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxxx xxxx xx xxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxx xxx (x.xxxx) xxx xxxxxxxx xxxxxxxxxxxx xx xxxxxxxxxx xxxxxxxxx xxxxxx xxx xxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxx xxx (x.xxxx) xxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxxx xxxx xx xxxxxxxxxx xxxxxxxxxx |
"""

    sp = FixedSizeSplitter(chunk_size=500, chunk_overlap=100, approximate=True)
    await sp.run(text)


if __name__ == "__main__":
    asyncio.run(testcase_that_creates_an_infinite_loop())

Relevant Log Output

No response

Expected Result

The call to sp.run must return the list of chunks

What happened instead?

The code never returns from the call to sp.run and the program is stuck.

Additional Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions