Skip to content

Conversation

@mxsm
Copy link
Owner

@mxsm mxsm commented Jan 4, 2026

fix #72

Summary by CodeRabbit

  • New Features

    • Added safe validation methods for converting binary data to strings with proper error handling.
    • Introduced a new error handling system for library operations.
  • Tests

    • Added comprehensive test coverage for string functionality, including UTF-8 handling and conversions.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings January 4, 2026 15:07
@coderabbitai
Copy link

coderabbitai bot commented Jan 4, 2026

📝 Walkthrough

Walkthrough

This pull request introduces UTF-8 validation pathways for CheetahString by adding safe constructor methods try_from_vec() and try_from_bytes(), creating a comprehensive error handling module with an Error enum, and marking existing unsafe conversions for deprecation. A full test suite validates the new functionality.

Changes

Cohort / File(s) Summary
Error Handling Module
src/error.rs
New public error module introducing Error enum with three variants (Utf8Error, IndexOutOfBounds, InvalidCharBoundary), Display and std::error::Error trait implementations, From<Utf8Error> conversion, and public Result<T> type alias.
Core API Extensions
src/cheetah_string.rs
Added safe constructors try_from_vec() and try_from_bytes() returning Result<Self, Utf8Error>. Marked unsafe From<&[u8]> and From<Vec<u8>> for deprecation with safety-focused documentation. Added unsafe blocks with safety invariant comments for ArcVecString and Bytes variants in conversions and as_str() extraction.
Module Exposure
src/lib.rs
Added mod error module declaration and re-exports Error and Result at crate root via pub use error::{Error, Result};.
Comprehensive Test Coverage
tests/basic.rs
New test suite covering basic constructors, conversions, equality operations, cloning, ordering/hashing, display/debug formatting, borrowing patterns, FromArc/FromIterator implementations, UTF-8 validation with error cases, and Unicode handling including multi-byte characters.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A safer path for strings we weave,
With UTF-8 that we believe,
No more unsafe conversions blind,
Try\_ methods, validation kind,
From bytes and vecs, with errors clear! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main changes: adding UTF-8 validation methods (try_from_vec, try_from_bytes) and deprecating unsafe conversions in CheetahString.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #72: new UTF-8 validation methods (try_from_vec, try_from_bytes) are added with proper error handling, and documentation indicates deprecation of unsafe From implementations.
Out of Scope Changes check ✅ Passed All changes are directly related to issue #72: new error module, UTF-8 validation constructors, deprecation documentation, and comprehensive test coverage for the new functionality are all in scope.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mxsm mxsm merged commit 750ab5d into main Jan 4, 2026
12 of 13 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds UTF-8 validation methods and introduces safety warnings for existing unsafe byte-to-string conversions in CheetahString. It addresses issue #72 by providing safer alternatives (try_from_vec and try_from_bytes) that validate UTF-8 encoding before creating CheetahString instances.

Key Changes:

  • Introduces new error handling infrastructure with Error and Result types
  • Adds UTF-8-validating methods try_from_vec() and try_from_bytes()
  • Documents existing unsafe From<Vec<u8>> and From<&[u8]> implementations with safety warnings

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File Description
src/error.rs New error module defining Error enum for UTF-8 validation errors with proper Display and std::error::Error trait implementations
src/lib.rs Exports new Error and Result types from the error module
src/cheetah_string.rs Adds try_from_vec and try_from_bytes validation methods; documents existing unsafe conversions; adds safety comments to unsafe code blocks
tests/basic.rs Comprehensive test suite covering new validation methods, invalid UTF-8 handling, and various CheetahString constructors and conversions
Comments suppressed due to low confidence (1)

src/cheetah_string.rs:322

  • The from_arc_vec method accepts Arc<Vec<u8>> without performing UTF-8 validation, but later code assumes the bytes are valid UTF-8 when using from_utf8_unchecked. This violates the safety invariant documented in the code and can lead to undefined behavior if invalid UTF-8 bytes are passed in.

This method should either:

  1. Add UTF-8 validation before creating the CheetahString, or
  2. Be marked as unsafe with clear documentation about the caller's responsibility to ensure valid UTF-8, or
  3. Be deprecated in favor of a safe alternative
    pub fn from_arc_vec(s: Arc<Vec<u8>>) -> Self {
        CheetahString {
            inner: InnerString::ArcVecString(s),
        }
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +225 to +234
fn test_try_from_vec_method() {
let bytes = vec![104, 101, 108, 108, 111];
let s = CheetahString::try_from_vec(bytes).unwrap();
assert_eq!(s, "hello");

let invalid = vec![0xFF, 0xFE];
assert!(CheetahString::try_from_vec(invalid).is_err());
}

#[test]
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests test_try_from_vec_method and test_try_from_valid_vec/test_try_from_invalid_vec are duplicating test coverage for the same functionality. Consider removing test_try_from_vec_method since the behavior is already covered by the more descriptive pair of tests that separate valid and invalid cases.

Suggested change
fn test_try_from_vec_method() {
let bytes = vec![104, 101, 108, 108, 111];
let s = CheetahString::try_from_vec(bytes).unwrap();
assert_eq!(s, "hello");
let invalid = vec![0xFF, 0xFE];
assert!(CheetahString::try_from_vec(invalid).is_err());
}
#[test]

Copilot uses AI. Check for mistakes.
Comment on lines +358 to +360
// SAFETY: ArcVecString is only created from validated UTF-8 sources.
// All constructors ensure this invariant is maintained.
unsafe { std::str::from_utf8_unchecked(s.as_ref()) }
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "ArcVecString is only created from validated UTF-8 sources" and "All constructors ensure this invariant is maintained." However, the from_arc_vec method at line 318 does not validate UTF-8, creating a path for invalid UTF-8 to reach this unsafe code. This violates the documented safety invariant and can cause undefined behavior.

Suggested change
// SAFETY: ArcVecString is only created from validated UTF-8 sources.
// All constructors ensure this invariant is maintained.
unsafe { std::str::from_utf8_unchecked(s.as_ref()) }
// Enforce that ArcVecString only contains valid UTF-8.
std::str::from_utf8(s.as_ref())
.expect("ArcVecString must contain valid UTF-8")

Copilot uses AI. Check for mistakes.
Comment on lines +194 to +195
// SAFETY: Bytes variant should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(b.to_vec()) }
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "Bytes variant should only be created from valid UTF-8 sources", but the from_bytes method does not perform UTF-8 validation. This allows invalid UTF-8 to reach this unsafe code, which can cause undefined behavior when converting to String.

Suggested change
// SAFETY: Bytes variant should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(b.to_vec()) }
String::from_utf8(b.to_vec())
.expect("Bytes variant must contain valid UTF-8")

Copilot uses AI. Check for mistakes.
Comment on lines +312 to +313
pub fn try_from_bytes(b: &[u8]) -> Result<Self, Utf8Error> {
let s = std::str::from_utf8(b)?;
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try_from_bytes method returns Result<Self, Utf8Error> instead of using the new crate::Result<T> type alias (which is Result<T, crate::Error>). This creates an inconsistent error handling API. Consider changing the return type to crate::Result<Self> to maintain consistency with the error handling infrastructure introduced in this PR.

Suggested change
pub fn try_from_bytes(b: &[u8]) -> Result<Self, Utf8Error> {
let s = std::str::from_utf8(b)?;
pub fn try_from_bytes(b: &[u8]) -> crate::Result<Self> {
let s = std::str::from_utf8(b).map_err(Into::into)?;

Copilot uses AI. Check for mistakes.
/// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_bytes()`
/// for safe UTF-8 validation.
///
/// This implementation will be deprecated in a future version.
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says "This implementation will be deprecated in a future version" but doesn't use the #[deprecated] attribute. Consider adding the #[deprecated] attribute with a message directing users to try_from_bytes() to provide compiler warnings and IDE hints for users of this API.

Suggested change
/// This implementation will be deprecated in a future version.
/// This implementation will be deprecated in a future version.
#[deprecated(
note = "Use CheetahString::try_from_bytes() for safe UTF-8 validation instead; this unsafe From<&[u8]> impl will be removed in a future version."
)]

Copilot uses AI. Check for mistakes.
/// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_vec()`
/// for safe UTF-8 validation.
///
/// This implementation will be deprecated in a future version.
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says "This implementation will be deprecated in a future version" but doesn't use the #[deprecated] attribute. Consider adding the #[deprecated] attribute with a message directing users to try_from_vec() to provide compiler warnings and IDE hints for users of this API.

Suggested change
/// This implementation will be deprecated in a future version.
/// This implementation will be deprecated in a future version.
#[deprecated(note = "use CheetahString::try_from_vec() instead for safe UTF-8 validation")]

Copilot uses AI. Check for mistakes.
Comment on lines +363 to +367
InnerString::Bytes(b) => {
// SAFETY: Bytes variant is only created from validated UTF-8 sources.
// The from_bytes constructor ensures this invariant.
unsafe { std::str::from_utf8_unchecked(b.as_ref()) }
}
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "Bytes variant is only created from validated UTF-8 sources" and "The from_bytes constructor ensures this invariant." However, the from_bytes method does not perform UTF-8 validation, which means this safety claim is incorrect and the use of from_utf8_unchecked here can cause undefined behavior if invalid UTF-8 is passed through that constructor.

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +189
} => {
// SAFETY: ArcVecString should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(s.to_vec()) }
}
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "ArcVecString should only be created from valid UTF-8 sources", but the from_arc_vec method at line 318 does not validate UTF-8. This creates a path for invalid UTF-8 data to reach this unsafe code, which can cause undefined behavior when converting to String.

Copilot uses AI. Check for mistakes.
/// let invalid = vec![0xFF, 0xFE];
/// assert!(CheetahString::try_from_vec(invalid).is_err());
/// ```
pub fn try_from_vec(v: Vec<u8>) -> Result<Self, Utf8Error> {
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try_from_vec and try_from_bytes methods return Result<Self, Utf8Error> instead of using the new crate::Result<T> type alias (which is Result<T, crate::Error>). This creates an inconsistent error handling API. Consider changing the return type to crate::Result<Self> to maintain consistency with the error handling infrastructure introduced in this PR.

Suggested change
pub fn try_from_vec(v: Vec<u8>) -> Result<Self, Utf8Error> {
pub fn try_from_vec(v: Vec<u8>) -> crate::Result<Self> {

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +224
fn test_try_from_bytes_method() {
let bytes = b"hello world";
let s = CheetahString::try_from_bytes(bytes).unwrap();
assert_eq!(s, "hello world");

let invalid = &[0xFF, 0xFE];
assert!(CheetahString::try_from_bytes(invalid).is_err());
}

#[test]
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests test_try_from_bytes_method and test_try_from_valid_bytes/test_try_from_invalid_bytes are duplicating test coverage for the same functionality. Consider removing test_try_from_bytes_method since the behavior is already covered by the more descriptive pair of tests that separate valid and invalid cases.

Suggested change
fn test_try_from_bytes_method() {
let bytes = b"hello world";
let s = CheetahString::try_from_bytes(bytes).unwrap();
assert_eq!(s, "hello world");
let invalid = &[0xFF, 0xFE];
assert!(CheetahString::try_from_bytes(invalid).is_err());
}
#[test]

Copilot uses AI. Check for mistakes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/cheetah_string.rs (3)

262-266: Critical soundness bug: from_vec creates ArcVecString without UTF-8 validation.

This public method creates ArcVecString from a Vec<u8> without validating UTF-8. Later, as_str() (lines 357-361) uses from_utf8_unchecked on this data, which causes undefined behavior if the bytes aren't valid UTF-8.

Either:

  1. Mark this method as unsafe and document the UTF-8 requirement, or
  2. Deprecate it in favor of try_from_vec, or
  3. Add UTF-8 validation here
🔎 Recommended fix: Add UTF-8 validation
 #[inline]
 pub fn from_vec(s: Vec<u8>) -> Self {
+    // Validate UTF-8 to maintain safety invariants
+    let _ = std::str::from_utf8(&s).expect("from_vec requires valid UTF-8 bytes");
     CheetahString {
         inner: InnerString::ArcVecString(Arc::new(s)),
     }
 }

Or better, deprecate it:

+#[deprecated(since = "0.x.0", note = "Use `try_from_vec` for safe UTF-8 validation")]
 #[inline]
 pub fn from_vec(s: Vec<u8>) -> Self {
     CheetahString {
         inner: InnerString::ArcVecString(Arc::new(s)),
     }
 }

318-322: Critical soundness bug: from_arc_vec creates ArcVecString without UTF-8 validation.

Same issue as from_vec - this creates ArcVecString without UTF-8 validation, violating the safety invariant relied upon by as_str().

🔎 Recommended fix: Mark as unsafe or add validation

Option 1 - Mark as unsafe:

 #[inline]
-pub fn from_arc_vec(s: Arc<Vec<u8>>) -> Self {
+pub unsafe fn from_arc_vec(s: Arc<Vec<u8>>) -> Self {
     CheetahString {
         inner: InnerString::ArcVecString(s),
     }
 }

Option 2 - Add validation:

 #[inline]
 pub fn from_arc_vec(s: Arc<Vec<u8>>) -> Self {
+    let _ = std::str::from_utf8(&s).expect("from_arc_vec requires valid UTF-8 bytes");
     CheetahString {
         inner: InnerString::ArcVecString(s),
     }
 }

344-350: Critical soundness bug: from_bytes creates Bytes variant without UTF-8 validation.

The from_bytes method creates a Bytes variant without validating UTF-8. The as_str() method later uses from_utf8_unchecked on this data (lines 363-367), which assumes all Bytes are valid UTF-8. This can cause undefined behavior.

Additionally, the bytes::Bytes type does not guarantee UTF-8 encoding - it's just a byte buffer.

🔎 Recommended fix: Add UTF-8 validation or mark as unsafe
 #[inline]
 #[cfg(feature = "bytes")]
 pub fn from_bytes(b: bytes::Bytes) -> Self {
+    let _ = std::str::from_utf8(&b).expect("from_bytes requires valid UTF-8");
     CheetahString {
         inner: InnerString::Bytes(b),
     }
 }

Or provide a safe alternative:

#[cfg(feature = "bytes")]
pub fn try_from_bytes_buf(b: bytes::Bytes) -> Result<Self, Utf8Error> {
    std::str::from_utf8(&b)?;
    Ok(CheetahString {
        inner: InnerString::Bytes(b),
    })
}
🧹 Nitpick comments (2)
src/error.rs (1)

6-13: Consider deferring unused error variants.

The IndexOutOfBounds and InvalidCharBoundary variants are defined but not currently used in this PR. While forward-planning is good, you may want to consider adding these when they're actually needed to keep the API surface minimal.

tests/basic.rs (1)

1-296: Comprehensive test coverage, but missing tests for unsafe constructors.

The test suite provides excellent coverage of the CheetahString API, including the new try_from_bytes and try_from_vec methods with both valid and invalid UTF-8 inputs. However, there are no tests for the unsafe constructors that were identified as having soundness issues:

  • from_vec (with invalid UTF-8)
  • from_arc_vec (with invalid UTF-8)
  • from_bytes (with invalid UTF-8)

Consider adding tests that demonstrate the undefined behavior risk with these methods (though these tests would need to be marked as ignored or compile_fail since they trigger UB).

Example test to demonstrate the soundness issue
#[test]
#[should_panic] // or #[ignore] since this is UB
fn test_from_vec_invalid_utf8_causes_ub() {
    let invalid = vec![0xFF, 0xFE];
    let s = CheetahString::from_vec(invalid);
    // This call to as_str() triggers undefined behavior
    let _ = s.as_str();
}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cece62f and b080f8d.

📒 Files selected for processing (4)
  • src/cheetah_string.rs
  • src/error.rs
  • src/lib.rs
  • tests/basic.rs
🧰 Additional context used
🧬 Code graph analysis (3)
src/lib.rs (1)
src/serde.rs (1)
  • cheetah_string (22-84)
src/cheetah_string.rs (1)
src/error.rs (1)
  • from (40-42)
tests/basic.rs (1)
src/cheetah_string.rs (24)
  • new (250-252)
  • is_empty (397-406)
  • len (385-394)
  • as_str (353-370)
  • default (20-24)
  • from_static_str (255-259)
  • from (29-31)
  • from (36-38)
  • from (43-45)
  • from (57-61)
  • from (81-85)
  • from (90-95)
  • from (100-105)
  • from (119-121)
  • from (162-164)
  • from (169-171)
  • from (176-201)
  • from (236-238)
  • from_string (332-336)
  • borrow (504-506)
  • try_from_bytes (312-315)
  • try_from_vec (286-292)
  • empty (243-247)
  • from_str (67-69)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Agent
🔇 Additional comments (6)
src/lib.rs (1)

22-28: LGTM! Clean error module integration.

The error module is properly declared and its public types are exposed at the crate root following idiomatic Rust patterns.

src/error.rs (3)

15-27: LGTM! Clear and informative error messages.

The Display implementation provides user-friendly error messages for all variants.


29-37: LGTM! Proper std::error::Error implementation.

The conditional implementation correctly exposes the underlying Utf8Error as the error source when the std feature is enabled.


39-46: LGTM! Standard error handling patterns.

The From implementation and Result type alias follow idiomatic Rust conventions for error handling.

src/cheetah_string.rs (2)

268-292: LGTM! Safe UTF-8 validation implemented correctly.

This method properly validates UTF-8 before creating the CheetahString, providing a safe alternative to the unsafe from_vec method. The documentation with examples is also excellent.


294-315: LGTM! Safe UTF-8 validation implemented correctly.

This method properly validates UTF-8 before construction. Good documentation and examples.

Comment on lines +48 to 62
/// # Safety Warning
///
/// This implementation uses `unsafe` code and may cause undefined behavior
/// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_bytes()`
/// for safe UTF-8 validation.
///
/// This implementation will be deprecated in a future version.
impl From<&[u8]> for CheetahString {
#[inline]
fn from(b: &[u8]) -> Self {
// SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
// This will be deprecated in favor of try_from_bytes in the next version.
CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(b) })
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: This From implementation is unsound and must be deprecated immediately.

This implementation creates ArcString via from_utf8_unchecked, but it's callable from safe code without any validation. While you've documented that it "will be deprecated in a future version," this should be:

  1. Marked with #[deprecated] attribute now
  2. Or removed entirely in favor of try_from_bytes

The safety warning in documentation doesn't prevent misuse from safe code.

🔎 Recommended fix: Add deprecation attribute
+#[deprecated(since = "0.x.0", note = "Use `CheetahString::try_from_bytes()` instead for safe UTF-8 validation")]
 impl From<&[u8]> for CheetahString {
     #[inline]
     fn from(b: &[u8]) -> Self {
-        // SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
-        // This will be deprecated in favor of try_from_bytes in the next version.
         CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(b) })
     }
 }

Comment on lines +72 to 86
/// # Safety Warning
///
/// This implementation uses `unsafe` code and may cause undefined behavior
/// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_vec()`
/// for safe UTF-8 validation.
///
/// This implementation will be deprecated in a future version.
impl From<Vec<u8>> for CheetahString {
#[inline]
fn from(v: Vec<u8>) -> Self {
// SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
// This will be deprecated in favor of try_from_vec in the next version.
CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(&v) })
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: This From implementation is unsound and must be deprecated immediately.

Same issue as From<&[u8]> - this uses from_utf8_unchecked from safe code without validation.

🔎 Recommended fix: Add deprecation attribute
+#[deprecated(since = "0.x.0", note = "Use `CheetahString::try_from_vec()` instead for safe UTF-8 validation")]
 impl From<Vec<u8>> for CheetahString {
     #[inline]
     fn from(v: Vec<u8>) -> Self {
-        // SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
-        // This will be deprecated in favor of try_from_vec in the next version.
         CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(&v) })
     }
 }
🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 72 to 86, the From<Vec<u8>> implementation
unsafely uses from_utf8_unchecked; mark this impl as deprecated and make it safe
by delegating to the fallible conversion: add a deprecation attribute (e.g.
#[deprecated(since = "x.y.z", note = "unsound: use CheetahString::try_from_vec
instead")]) to the impl, and replace the body to call
CheetahString::try_from_vec(v).expect("From<Vec<u8>> is deprecated and requires
valid UTF-8; use try_from_vec for fallible conversion") so the code no longer
performs unchecked UTF-8 conversion while signaling deprecation and guiding
users to the safe API.

Comment on lines +186 to +196
} => {
// SAFETY: ArcVecString should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(s.to_vec()) }
}
#[cfg(feature = "bytes")]
CheetahString {
inner: InnerString::Bytes(b),
} => unsafe { String::from_utf8_unchecked(b.to_vec()) },
} => {
// SAFETY: Bytes variant should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(b.to_vec()) }
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all code that constructs ArcVecString or Bytes variants

echo "=== Searching for ArcVecString construction ==="
rg -n -C3 'InnerString::ArcVecString'

echo -e "\n=== Searching for Bytes variant construction ==="
rg -n -C3 'InnerString::Bytes'

Repository: mxsm/cheetah-string

Length of output: 7539


🏁 Script executed:

# Let's look at the specific constructors mentioned
sed -n '262,266p' src/cheetah_string.rs
echo "---"
sed -n '318,322p' src/cheetah_string.rs
echo "---"
sed -n '346,350p' src/cheetah_string.rs
echo "---"
# Also check if there's a from_vec_checked constructor
sed -n '280,293p' src/cheetah_string.rs

Repository: mxsm/cheetah-string

Length of output: 934


Fix the UTF-8 safety invariant for ArcVecString and Bytes variants.

The unsafe conversions at lines 186-196 use from_utf8_unchecked assuming valid UTF-8, but public constructors violate this invariant:

  • from_vec (line 262) creates ArcVecString from arbitrary Vec<u8> without validation
  • from_arc_vec (line 318) creates ArcVecString from arbitrary Arc<Vec<u8>> without validation
  • from_bytes (line 346) creates Bytes variant from arbitrary bytes::Bytes without validation

Calling to_string() or as_str() on these unvalidated instances triggers undefined behavior. The safety comments claiming "all constructors ensure this invariant" are false.

Either validate UTF-8 in these constructors (like try_from_vec does at line 288) or mark them unsafe to explicitly document the responsibility falls to callers.

🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 186-196, the unsafe use of
String::from_utf8_unchecked for ArcVecString and Bytes assumes a UTF-8 invariant
that is not actually enforced by the public constructors at lines 262
(from_vec), 318 (from_arc_vec) and 346 (from_bytes) (only try_from_vec at 288
validates); fix by either (A) validating input in those constructors (use
std::str::from_utf8 or String::from_utf8 and return Result or panic on invalid
UTF-8) so the unchecked conversions are safe, or (B) mark those constructors
unsafe and update docs/comments to require callers ensure valid UTF-8; pick one
approach and make it consistent across all constructors and any places that rely
on the invariant (e.g., as_str/to_string) so no unchecked conversion can be
reached with unvalidated data.

Comment on lines +357 to +367
InnerString::ArcVecString(s) => {
// SAFETY: ArcVecString is only created from validated UTF-8 sources.
// All constructors ensure this invariant is maintained.
unsafe { std::str::from_utf8_unchecked(s.as_ref()) }
}
#[cfg(feature = "bytes")]
InnerString::Bytes(b) => std::str::from_utf8(b.as_ref()).unwrap(),
InnerString::Bytes(b) => {
// SAFETY: Bytes variant is only created from validated UTF-8 sources.
// The from_bytes constructor ensures this invariant.
unsafe { std::str::from_utf8_unchecked(b.as_ref()) }
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Unsafe code relies on violated invariant - undefined behavior risk.

The safety comments claim "ArcVecString is only created from validated UTF-8 sources" and "Bytes variant is only created from validated UTF-8 sources," but as flagged in other comments, the following constructors violate this invariant:

  • from_vec (line 262)
  • from_arc_vec (line 318)
  • from_bytes (line 346)

This makes the from_utf8_unchecked calls here unsound. The invariant must be maintained by ALL constructors, or this code will cause undefined behavior.

🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 357 to 367, the unsafe from_utf8_unchecked
calls rely on an invariant that some constructors do not maintain, making this
unsound; fix by ensuring the invariant is enforced or by removing the unsafe
conversion: either validate UTF-8 in every constructor that can produce
ArcVecString/Bytes (from_vec, from_arc_vec, from_bytes) and return a Result or
panic on invalid UTF-8, or replace the unchecked conversions here with safe
std::str::from_utf8 and propagate or handle the error; update the safety
comments to reflect the chosen strategy.

@mxsm mxsm deleted the feat-72 branch January 6, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add UTF-8 validation methods and deprecate unsafe conversions in CheetahString

3 participants