Skip to content

feat: add camelCase-aware search tokenization#667

Merged
DaleSeo merged 3 commits intomainfrom
AMS-424
Mar 4, 2026
Merged

feat: add camelCase-aware search tokenization#667
DaleSeo merged 3 commits intomainfrom
AMS-424

Conversation

@DaleSeo
Copy link
Member

@DaleSeo DaleSeo commented Feb 26, 2026

While working on apollographql/rover#3020, I realized that the MCP server's schema index does not split camelCase identifiers when tokenizing, so searching for "post" fails to match types like PostConnection, createPost, or CreatePostInput.

@apollo-librarian
Copy link

apollo-librarian bot commented Feb 26, 2026

✅ Docs preview has no changes

The preview was not built because there were no changes.

Build ID: 8d64e1fda84313b3d8140c83
Build Logs: View logs


✅ AI Style Review — No Changes Detected

No MDX files were changed in this pull request.

Review Log: View detailed log

This review is AI-generated. Please use common sense when accepting these suggestions, as they may not always be accurate or appropriate for your specific context.

@DaleSeo DaleSeo self-assigned this Feb 26, 2026
@DaleSeo DaleSeo marked this pull request as ready for review February 26, 2026 20:02
@DaleSeo DaleSeo requested a review from a team as a code owner February 26, 2026 20:02
@github-actions
Copy link
Contributor

github-actions bot commented Feb 26, 2026

Changeset file added - thank you!

@claude
Copy link

claude bot commented Feb 26, 2026

Review Summary

This PR successfully adds camelCase-aware tokenization to the schema index, enabling searches like "post" to match types such as PostAnalytics, CreatePostInput, and createPost. The implementation is clean, well-tested, and follows Rust best practices.

Re-review Confirmation (2026-02-27)

Re-reviewed against Apollo Rust best practices chapters 1-5. All findings from the original review remain valid. No new issues identified.

Original Findings

No blocking or should-fix issues found

Strengths:

  • Clean implementation: The expand_identifiers and push_expanded_word functions are well-designed with clear doc comments and examples
  • Efficient allocation strategy: Pre-allocating with text.len() * 2 is reasonable, and minimizes unnecessary allocations
  • Comprehensive test coverage: Three new tests cover the main use cases (searching for camelCase types, camelCase query terms, and camelCase in descriptions), plus parameterized tests using rstest cover edge cases
  • Proper documentation: Fixed the intra-doc link for EnumSet (changed from invalid (EnumSet) to proper [EnumSet] format)
  • No clippy violations: Code follows Rust idioms with no unnecessary clones, unwraps, or panics in library code
  • Security: No unsafe code, no unwraps on external data, no security concerns
  • Error handling: Proper use of if let patterns, no unwraps in library code

Test Coverage Assessment

The PR adds comprehensive test coverage:

  • search_camel_case_splitting: Validates that searching "post" matches PostAnalytics and UpdatePostInput
  • search_camel_case_query_term: Ensures camelCase query terms are also split correctly
  • search_camel_case_in_description: Verifies camelCase splitting in descriptions
  • expand_identifiers_splits_at_word_boundaries: Parameterized test covering PascalCase, camelCase, acronyms, separators, and edge cases

The snapshot test update reflects expected behavior (reordered results with updated relevance scores due to additional word-boundary matches).

Final Recommendation

✅ Approve - This is a well-crafted feature addition that enhances search functionality without introducing performance regressions or code quality issues. Ready to merge.


Reviewed by Claude Code Sonnet 4.5

@@ -261,7 +314,7 @@ impl SchemaIndex {
),
_ => String::new(),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DaleSeo ! This one took me a little longer to parse as I'm new to this crate. But I noticed we’re running expand_identifiers() on query terms now (token_stream(&expanded)), and we also expand type names and the big fields string when indexing 👀

But I believe descriptions are still indexed raw (doc.add_text(description_field, ...) without expand_identifiers). If a camelCase identifier only shows up in the description (like “Use createPost to…”), it might get indexed as one token (createpost) while the query becomes create post, so description-only matches could get missed.

Is it maybe worth also expanding descriptions at index time (or storing both the raw and expanded values)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, @gocamille! That was my oversight. I think expanding descriptions will help improve the search quality. I'll make sure to apply this to the Rover PR as well.

Copy link
Contributor

@gocamille gocamille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you @DaleSeo !

@DaleSeo DaleSeo merged commit a9663dd into main Mar 4, 2026
16 checks passed
@DaleSeo DaleSeo deleted the AMS-424 branch March 4, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants