Skip to content

[BUG]: Inconsistent UUID string format across database models #1324

@jonpspri

Description

@jonpspri

Issue Description

The codebase uses UUID-based primary keys for 18 database models, but there is an inconsistency in how UUIDs are converted to strings. This creates two different representations that are incompatible with each other.

UUID String Format Inconsistency

Format 1: Hyphenated UUID (str(uuid.uuid4()))

Storage: 36 characters including hyphens (e.g., "550e8400-e29b-41d4-a716-446655440000")

Classes using this format:

  • Role (db.py:224)
  • UserRole (db.py:266)
  • PendingUserApproval (db.py:1276)
  • RegisteredOAuthClient (db.py:2829)
  • EmailApiToken (db.py:2902)
  • SSOAuthSession (db.py:3207)

Format 2: Hexadecimal UUID (uuid.uuid4().hex)

Storage: 32 characters, no hyphens (e.g., "550e8400e29b41d4a716446655440000")

Classes using this format:

  • EmailTeam (db.py:834)
  • EmailTeamMember (db.py:943)
  • EmailTeamMemberHistory (db.py:1009)
  • EmailTeamInvitation (db.py:1078)
  • EmailTeamJoinRequest (db.py:1181)
  • Tool (db.py:1565)
  • Server (db.py:2275)
  • Gateway (db.py:2431)
  • A2AAgent (db.py:2555)
  • GrpcService (db.py:2689)
  • OAuthToken (db.py:2780)
  • OAuthState (db.py:2805)

Note: EmailApiToken also has a jti field (db.py:2906) using hyphenated format.

Impact

1. Data Inconsistency

  • Two different string representations of the same UUID concept
  • Foreign key relationships may expect one format but receive another
  • API clients must handle different formats depending on the entity type

2. Storage Overhead

  • Hyphenated format: 36 bytes per ID
  • Hex format: 32 bytes per ID
  • Both formats are significantly larger than a 4-byte int or 8-byte bigint

3. Potential Performance Concerns

While not currently causing issues, using 32-36 byte strings as primary keys may introduce database performance problems:

  • Index Size: String-based indexes are much larger than integer indexes
    • Int (4 bytes) vs UUID hex (32 bytes) = 8x size increase
    • Int (4 bytes) vs UUID hyphenated (36 bytes) = 9x size increase
  • Comparison Speed: String comparisons are slower than integer comparisons
  • Memory Usage: Larger primary keys increase memory footprint for indexes and joins
  • Foreign Key Joins: Each foreign key relationship carries the storage overhead
  • B-tree Performance: Larger keys reduce B-tree fan-out, requiring more disk I/O

4. Cross-Database Compatibility

  • PostgreSQL has native UUID type (16 bytes) that would be more efficient
  • SQLite stores strings as-is, making the overhead more pronounced
  • Migration to native UUID type would be complex with two string formats

Recommendations

Option 1: Standardize on Hex Format (Quick Fix)

  • Convert all str(uuid.uuid4()) to uuid.uuid4().hex
  • Saves 4 bytes per ID in hyphenated classes
  • Maintains consistency across the codebase
  • Still retains UUID string overhead

Option 2: Use Native UUID Type (Best for PostgreSQL)

  • Use SQLAlchemy's UUID column type for PostgreSQL
  • Falls back to CHAR(32) or CHAR(36) for SQLite
  • Reduces storage to 16 bytes in PostgreSQL
  • Requires migration script

Option 3: Use Binary UUID Storage (Most Efficient)

  • Store UUIDs as 16-byte binary values
  • Convert to string only at API boundary
  • Most efficient storage and performance
  • Requires significant refactoring

Option 4: Migrate to Integer IDs (Maximum Performance)

  • Use auto-incrementing BigInt (8 bytes) for high-volume tables
  • Reserve UUIDs for distributed/federated scenarios
  • Best database performance
  • May complicate federation features

Additional Notes

Two classes still use integer auto-increment IDs and are unaffected by this issue:

  • Resource (db.py:1842): id: Mapped[int]
  • Prompt (db.py:2083): id: Mapped[int]
  • EmailAuthEvent (db.py:696): id: Mapped[int]

Suggested Action

At minimum, standardize the UUID string format across all models to ensure consistency. Consider the performance implications for high-volume tables and evaluate whether native UUID types or integer IDs would be more appropriate for the project's scale and federation requirements.

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingdatabasepythonPython / backend development (FastAPI)

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions