total bin coverage for default_transform() in Knowledge Graph transformations (#1950)

tolgaerdonmez · web-flow · commit 998c3babdf58 · 2025-09-09T15:54:09.000+05:30
# Problem

default_transform() uses token lengths up to 100k (0-100k interval) and
seperates it into three bins.
But for longer documents with token length &gt;100k and 0 this function
raises the following:
```python
 raise ValueError(
            "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
        )
```
Which covers the case of empty documents but also violates the
constraint &gt;100k.

# Solution (Currently implemented)

I'm not sure with this solution but my first approach was to change the
last bin interval to `inf`. This solves the problem easily but could be
inefficient for very large documents.
```python
    bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]
```

# Better Solution Proposal (Let's discuss this)

If the given document is larger than &gt;100k tokens. Seperate the document
in half. And start the transformation again, until it fits into the
initial bin sizes.
diff --git a/src/ragas/testset/transforms/default.py b/src/ragas/testset/transforms/default.py
@@ -74,13 +74,13 @@ def filter_docs(node):
     def filter_chunks(node):
         return node.type == NodeType.CHUNK
 
-    bin_ranges = [(0, 100), (101, 500), (501, 100000)]
+    bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]
     result = count_doc_length_bins(documents, bin_ranges)
     result = {k: v / len(documents) for k, v in result.items()}
 
     transforms = []
 
-    if result["501-100000"] >= 0.25:
+    if result["501-inf"] >= 0.25:
         headline_extractor = HeadlinesExtractor(
             llm=llm, filter_nodes=lambda node: filter_doc_with_num_tokens(node)
         )