-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Hello Hong,
Congratulations on your recent paper "Context ROT" - I found your work quite insightful, particularly the discussion around token utilization strategies.
I noticed in your tweet video (timestamp 2:46) that your team used 300 tokens for processing. This observation sparked a hypothesis I've been exploring regarding the relationship between token count and search granularity in embedding systems.
My Core Hypothesis:
I believe that using fewer tokens per chunk results in higher information density per token, which in turn improves granular-level search performance.
The Question That Started This:
While examining OpenAI's file search documentation, I noticed they recommend 800-token chunks with 400-token overlap. This puzzled me - why use only 800 tokens when the text-embedding-3-large model can handle 8,192 tokens? That's utilizing just 10% of the model's capacity.
My Reasoning:
Consider a scenario where we have a chunk of exactly 8,192 tokens that gets embedded into a 3,072-dimensional vector. If we need to retrieve a small piece of information from that chunk, the probability of the chunk's embedding ranking high in cosine similarity searches becomes quite low. The information we seek gets "diluted" across the larger token space.
My hypothesis suggests that with fewer tokens per chunk, the model can allocate higher information density per token, making retrieval more precise and effective.
Request for Feedback:
I've documented this hypothesis in detail with supporting analysis and visualizations. Given your expertise in this area, I would greatly value your thoughts on:
- Whether this direction of research shows promise
- Any potential flaws in my reasoning
- Suggestions for further investigation
Document link: https://docs.google.com/document/d/1CpgLZAlwQ-q5v_d0g27GlxdpbHbGflcDTLC1YVodty4/edit?tab=t.0
Thank you for your time and consideration. I look forward to any insights you might share.