-
Couldn't load subscription status.
- Fork 25.6k
Closed
Labels
:mlMachine learningMachine learning:ml/Chunking>enhancementFeature:GenAIFeatures around GenAIFeatures around GenAIFeature:NLPFeatures and issues around NLPFeatures and issues around NLPTeam:MLMeta label for the ML teamMeta label for the ML team
Description
The purpose of this issue is to implement a basic chunking process for the elastic reranker and evaluate it against the existing truncation process.
The proposed chunking process:
- Build a chunking strategy
- "strategy": "sentence",
- "max_chunk_size":
max(elastic_reranker_max_token_limit - query_token_count, elastic_reranker_max_token_limit / 2) * words_per_token - "sentence_overlap": 0
- Note: The elastic reranker is optimized for English text and each document must be concatenated with the query.
- Note: Chunking in ES currently uses word count as the unit for max chunk size. There is also no way to calculate the exact token count for a given query. As such we use a conversion rate of 1 token = ¾ word to convert between the two units.
- Chunk each document and send all chunks from all documents to the elastic reranker.
- Parse the elastic reranker response to return a single relevance score for each document corresponding with the highest relevance score for any of its chunks.
The proposed evaluation process (to run with both truncation and chunking):
- Create an ECH cluster.
- Ingest a test data set.
- Run BM25 retrieval and identify some metrics to calculate it's rerank accuracy.
- Run retrieval using the text similarity reranker and identify some metrics to calculate it's rerank accuracy.
Metadata
Metadata
Assignees
Labels
:mlMachine learningMachine learning:ml/Chunking>enhancementFeature:GenAIFeatures around GenAIFeatures around GenAIFeature:NLPFeatures and issues around NLPFeatures and issues around NLPTeam:MLMeta label for the ML teamMeta label for the ML team