Skip to content

Commit 998c3ba

Browse files
total bin coverage for default_transform() in Knowledge Graph transformations (#1950)
# Problem default_transform() uses token lengths up to 100k (0-100k interval) and seperates it into three bins. But for longer documents with token length >100k and 0 this function raises the following: ```python raise ValueError( "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents." ) ``` Which covers the case of empty documents but also violates the constraint >100k. # Solution (Currently implemented) I'm not sure with this solution but my first approach was to change the last bin interval to `inf`. This solves the problem easily but could be inefficient for very large documents. ```python bin_ranges = [(0, 100), (101, 500), (501, float("inf"))] ``` # Better Solution Proposal (Let's discuss this) If the given document is larger than >100k tokens. Seperate the document in half. And start the transformation again, until it fits into the initial bin sizes.
1 parent b1c7f79 commit 998c3ba

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

src/ragas/testset/transforms/default.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,13 +74,13 @@ def filter_docs(node):
7474
def filter_chunks(node):
7575
return node.type == NodeType.CHUNK
7676

77-
bin_ranges = [(0, 100), (101, 500), (501, 100000)]
77+
bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]
7878
result = count_doc_length_bins(documents, bin_ranges)
7979
result = {k: v / len(documents) for k, v in result.items()}
8080

8181
transforms = []
8282

83-
if result["501-100000"] >= 0.25:
83+
if result["501-inf"] >= 0.25:
8484
headline_extractor = HeadlinesExtractor(
8585
llm=llm, filter_nodes=lambda node: filter_doc_with_num_tokens(node)
8686
)

0 commit comments

Comments
 (0)