Paraphrased questions were not clustering together. The initial threshold of 0.82 was too high, and even after lowering to 0.75, semantically similar questions failed to match.
- Added
/debugendpoint to inspect similarity scores - Tested with paraphrased variations of a base question
- Analyzed score distribution for similar vs. different questions
- Set threshold based on empirical data
"How do I reset my password?"
| Question | Similarity Score |
|---|---|
| "Where do I change my password?" | 0.7487 |
| "I forgot my password, how do I get a new one?" | 0.7292 |
| "Password reset instructions?" | 0.7095 |
| Question | Similarity Score |
|---|---|
| "How do I update my email address?" | 0.5461 |
| Question | Similarity Score |
|---|---|
| "What time is the team meeting tomorrow?" | 0.1017 |
| Score | Meaning | Action |
|---|---|---|
| ≥ 0.70 | Paraphrased questions (same intent, different wording) | Clustered together |
| 0.55–0.70 | Related topic, different question (e.g., password vs email) | Not clustered |
| < 0.55 | Unrelated questions | Not clustered |
Set threshold to 0.70
Reasoning:
- Paraphrased questions score ≥ 0.70
- Related but distinct questions fall in the 0.55–0.70 gap — close enough to seem similar, but different enough to warrant separate FAQ entries
- Unrelated questions score well below 0.55
- 0.70 captures natural paraphrasing while avoiding false matches
Tested with 3 paraphrased questions:
- "How do I reset my password?" → new
- "Where do I change my password?" → matched (cluster_id: 6, count: 2)
- "I forgot my password, how do I get a new one?" → matched (cluster_id: 6, count: 3)
✅ All three questions clustered together correctly.
- OpenAI
text-embedding-3-small(1536 dimensions) - Cosine similarity metric
- The 0.70 threshold was baselined against a small subset of common IT and HR questions (password resets, VPN access, PTO requests) to establish a quick working baseline
- Production use across other domains (engineering, legal, finance) would likely require additional testing and threshold adjustment
- This threshold may also shift based on:
- Domain-specific terminology
- Question length variations
- Multi-language support
- Consider adding a
/tuneendpoint that suggests optimal thresholds based on labeled training data