Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 86 additions & 4 deletions src/cheetah_string.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
use core::fmt;
use core::str::Utf8Error;
use std::borrow::{Borrow, Cow};
use std::cmp::Ordering;
use std::fmt::Display;
Expand Down Expand Up @@ -44,9 +45,18 @@ impl<'a> From<&'a str> for CheetahString {
}
}

/// # Safety Warning
///
/// This implementation uses `unsafe` code and may cause undefined behavior
/// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_bytes()`
/// for safe UTF-8 validation.
///
/// This implementation will be deprecated in a future version.
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says "This implementation will be deprecated in a future version" but doesn't use the #[deprecated] attribute. Consider adding the #[deprecated] attribute with a message directing users to try_from_bytes() to provide compiler warnings and IDE hints for users of this API.

Suggested change
/// This implementation will be deprecated in a future version.
/// This implementation will be deprecated in a future version.
#[deprecated(
note = "Use CheetahString::try_from_bytes() for safe UTF-8 validation instead; this unsafe From<&[u8]> impl will be removed in a future version."
)]

Copilot uses AI. Check for mistakes.
impl From<&[u8]> for CheetahString {
#[inline]
fn from(b: &[u8]) -> Self {
// SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
// This will be deprecated in favor of try_from_bytes in the next version.
CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(b) })
}
}
Comment on lines +48 to 62
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: This From implementation is unsound and must be deprecated immediately.

This implementation creates ArcString via from_utf8_unchecked, but it's callable from safe code without any validation. While you've documented that it "will be deprecated in a future version," this should be:

  1. Marked with #[deprecated] attribute now
  2. Or removed entirely in favor of try_from_bytes

The safety warning in documentation doesn't prevent misuse from safe code.

🔎 Recommended fix: Add deprecation attribute
+#[deprecated(since = "0.x.0", note = "Use `CheetahString::try_from_bytes()` instead for safe UTF-8 validation")]
 impl From<&[u8]> for CheetahString {
     #[inline]
     fn from(b: &[u8]) -> Self {
-        // SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
-        // This will be deprecated in favor of try_from_bytes in the next version.
         CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(b) })
     }
 }

Expand All @@ -59,9 +69,18 @@ impl FromStr for CheetahString {
}
}

/// # Safety Warning
///
/// This implementation uses `unsafe` code and may cause undefined behavior
/// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_vec()`
/// for safe UTF-8 validation.
///
/// This implementation will be deprecated in a future version.
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says "This implementation will be deprecated in a future version" but doesn't use the #[deprecated] attribute. Consider adding the #[deprecated] attribute with a message directing users to try_from_vec() to provide compiler warnings and IDE hints for users of this API.

Suggested change
/// This implementation will be deprecated in a future version.
/// This implementation will be deprecated in a future version.
#[deprecated(note = "use CheetahString::try_from_vec() instead for safe UTF-8 validation")]

Copilot uses AI. Check for mistakes.
impl From<Vec<u8>> for CheetahString {
#[inline]
fn from(v: Vec<u8>) -> Self {
// SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
// This will be deprecated in favor of try_from_vec in the next version.
CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(&v) })
}
}
Comment on lines +72 to 86
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: This From implementation is unsound and must be deprecated immediately.

Same issue as From<&[u8]> - this uses from_utf8_unchecked from safe code without validation.

🔎 Recommended fix: Add deprecation attribute
+#[deprecated(since = "0.x.0", note = "Use `CheetahString::try_from_vec()` instead for safe UTF-8 validation")]
 impl From<Vec<u8>> for CheetahString {
     #[inline]
     fn from(v: Vec<u8>) -> Self {
-        // SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
-        // This will be deprecated in favor of try_from_vec in the next version.
         CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(&v) })
     }
 }
🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 72 to 86, the From<Vec<u8>> implementation
unsafely uses from_utf8_unchecked; mark this impl as deprecated and make it safe
by delegating to the fallible conversion: add a deprecation attribute (e.g.
#[deprecated(since = "x.y.z", note = "unsound: use CheetahString::try_from_vec
instead")]) to the impl, and replace the body to call
CheetahString::try_from_vec(v).expect("From<Vec<u8>> is deprecated and requires
valid UTF-8; use try_from_vec for fallible conversion") so the code no longer
performs unchecked UTF-8 conversion while signaling deprecation and guiding
users to the safe API.

Expand Down Expand Up @@ -164,11 +183,17 @@ impl From<CheetahString> for String {
} => s.to_string(),
CheetahString {
inner: InnerString::ArcVecString(s),
} => unsafe { String::from_utf8_unchecked(s.to_vec()) },
} => {
// SAFETY: ArcVecString should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(s.to_vec()) }
}
Comment on lines +186 to +189
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "ArcVecString should only be created from valid UTF-8 sources", but the from_arc_vec method at line 318 does not validate UTF-8. This creates a path for invalid UTF-8 data to reach this unsafe code, which can cause undefined behavior when converting to String.

Copilot uses AI. Check for mistakes.
#[cfg(feature = "bytes")]
CheetahString {
inner: InnerString::Bytes(b),
} => unsafe { String::from_utf8_unchecked(b.to_vec()) },
} => {
// SAFETY: Bytes variant should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(b.to_vec()) }
Comment on lines +194 to +195
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "Bytes variant should only be created from valid UTF-8 sources", but the from_bytes method does not perform UTF-8 validation. This allows invalid UTF-8 to reach this unsafe code, which can cause undefined behavior when converting to String.

Suggested change
// SAFETY: Bytes variant should only be created from valid UTF-8 sources
unsafe { String::from_utf8_unchecked(b.to_vec()) }
String::from_utf8(b.to_vec())
.expect("Bytes variant must contain valid UTF-8")

Copilot uses AI. Check for mistakes.
}
Comment on lines +186 to +196
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all code that constructs ArcVecString or Bytes variants

echo "=== Searching for ArcVecString construction ==="
rg -n -C3 'InnerString::ArcVecString'

echo -e "\n=== Searching for Bytes variant construction ==="
rg -n -C3 'InnerString::Bytes'

Repository: mxsm/cheetah-string

Length of output: 7539


🏁 Script executed:

# Let's look at the specific constructors mentioned
sed -n '262,266p' src/cheetah_string.rs
echo "---"
sed -n '318,322p' src/cheetah_string.rs
echo "---"
sed -n '346,350p' src/cheetah_string.rs
echo "---"
# Also check if there's a from_vec_checked constructor
sed -n '280,293p' src/cheetah_string.rs

Repository: mxsm/cheetah-string

Length of output: 934


Fix the UTF-8 safety invariant for ArcVecString and Bytes variants.

The unsafe conversions at lines 186-196 use from_utf8_unchecked assuming valid UTF-8, but public constructors violate this invariant:

  • from_vec (line 262) creates ArcVecString from arbitrary Vec<u8> without validation
  • from_arc_vec (line 318) creates ArcVecString from arbitrary Arc<Vec<u8>> without validation
  • from_bytes (line 346) creates Bytes variant from arbitrary bytes::Bytes without validation

Calling to_string() or as_str() on these unvalidated instances triggers undefined behavior. The safety comments claiming "all constructors ensure this invariant" are false.

Either validate UTF-8 in these constructors (like try_from_vec does at line 288) or mark them unsafe to explicitly document the responsibility falls to callers.

🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 186-196, the unsafe use of
String::from_utf8_unchecked for ArcVecString and Bytes assumes a UTF-8 invariant
that is not actually enforced by the public constructors at lines 262
(from_vec), 318 (from_arc_vec) and 346 (from_bytes) (only try_from_vec at 288
validates); fix by either (A) validating input in those constructors (use
std::str::from_utf8 or String::from_utf8 and return Result or panic on invalid
UTF-8) so the unchecked conversions are safe, or (B) mark those constructors
unsafe and update docs/comments to require callers ensure valid UTF-8; pick one
approach and make it consistent across all constructors and any places that rely
on the invariant (e.g., as_str/to_string) so no unchecked conversion can be
reached with unvalidated data.

CheetahString {
inner: InnerString::Empty,
} => String::new(),
Expand Down Expand Up @@ -240,6 +265,55 @@ impl CheetahString {
}
}

/// Creates a `CheetahString` from a byte vector with UTF-8 validation.
///
/// # Errors
///
/// Returns an error if the bytes are not valid UTF-8.
///
/// # Examples
///
/// ```
/// use cheetah_string::CheetahString;
///
/// let bytes = vec![104, 101, 108, 108, 111]; // "hello"
/// let s = CheetahString::try_from_vec(bytes).unwrap();
/// assert_eq!(s, "hello");
///
/// let invalid = vec![0xFF, 0xFE];
/// assert!(CheetahString::try_from_vec(invalid).is_err());
/// ```
pub fn try_from_vec(v: Vec<u8>) -> Result<Self, Utf8Error> {
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try_from_vec and try_from_bytes methods return Result<Self, Utf8Error> instead of using the new crate::Result<T> type alias (which is Result<T, crate::Error>). This creates an inconsistent error handling API. Consider changing the return type to crate::Result<Self> to maintain consistency with the error handling infrastructure introduced in this PR.

Suggested change
pub fn try_from_vec(v: Vec<u8>) -> Result<Self, Utf8Error> {
pub fn try_from_vec(v: Vec<u8>) -> crate::Result<Self> {

Copilot uses AI. Check for mistakes.
// Validate UTF-8
std::str::from_utf8(&v)?;
Ok(CheetahString {
inner: InnerString::ArcVecString(Arc::new(v)),
})
}

/// Creates a `CheetahString` from a byte slice with UTF-8 validation.
///
/// # Errors
///
/// Returns an error if the bytes are not valid UTF-8.
///
/// # Examples
///
/// ```
/// use cheetah_string::CheetahString;
///
/// let bytes = b"hello";
/// let s = CheetahString::try_from_bytes(bytes).unwrap();
/// assert_eq!(s, "hello");
///
/// let invalid = &[0xFF, 0xFE];
/// assert!(CheetahString::try_from_bytes(invalid).is_err());
/// ```
pub fn try_from_bytes(b: &[u8]) -> Result<Self, Utf8Error> {
let s = std::str::from_utf8(b)?;
Comment on lines +312 to +313
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try_from_bytes method returns Result<Self, Utf8Error> instead of using the new crate::Result<T> type alias (which is Result<T, crate::Error>). This creates an inconsistent error handling API. Consider changing the return type to crate::Result<Self> to maintain consistency with the error handling infrastructure introduced in this PR.

Suggested change
pub fn try_from_bytes(b: &[u8]) -> Result<Self, Utf8Error> {
let s = std::str::from_utf8(b)?;
pub fn try_from_bytes(b: &[u8]) -> crate::Result<Self> {
let s = std::str::from_utf8(b).map_err(Into::into)?;

Copilot uses AI. Check for mistakes.
Ok(CheetahString::from_slice(s))
}

#[inline]
pub fn from_arc_vec(s: Arc<Vec<u8>>) -> Self {
CheetahString {
Expand Down Expand Up @@ -280,9 +354,17 @@ impl CheetahString {
match &self.inner {
InnerString::ArcString(s) => s.as_str(),
InnerString::StaticStr(s) => s,
InnerString::ArcVecString(s) => std::str::from_utf8(s.as_ref()).unwrap(),
InnerString::ArcVecString(s) => {
// SAFETY: ArcVecString is only created from validated UTF-8 sources.
// All constructors ensure this invariant is maintained.
unsafe { std::str::from_utf8_unchecked(s.as_ref()) }
Comment on lines +358 to +360
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "ArcVecString is only created from validated UTF-8 sources" and "All constructors ensure this invariant is maintained." However, the from_arc_vec method at line 318 does not validate UTF-8, creating a path for invalid UTF-8 to reach this unsafe code. This violates the documented safety invariant and can cause undefined behavior.

Suggested change
// SAFETY: ArcVecString is only created from validated UTF-8 sources.
// All constructors ensure this invariant is maintained.
unsafe { std::str::from_utf8_unchecked(s.as_ref()) }
// Enforce that ArcVecString only contains valid UTF-8.
std::str::from_utf8(s.as_ref())
.expect("ArcVecString must contain valid UTF-8")

Copilot uses AI. Check for mistakes.
}
#[cfg(feature = "bytes")]
InnerString::Bytes(b) => std::str::from_utf8(b.as_ref()).unwrap(),
InnerString::Bytes(b) => {
// SAFETY: Bytes variant is only created from validated UTF-8 sources.
// The from_bytes constructor ensures this invariant.
unsafe { std::str::from_utf8_unchecked(b.as_ref()) }
}
Comment on lines +363 to +367
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment claims "Bytes variant is only created from validated UTF-8 sources" and "The from_bytes constructor ensures this invariant." However, the from_bytes method does not perform UTF-8 validation, which means this safety claim is incorrect and the use of from_utf8_unchecked here can cause undefined behavior if invalid UTF-8 is passed through that constructor.

Copilot uses AI. Check for mistakes.
Comment on lines +357 to +367
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Unsafe code relies on violated invariant - undefined behavior risk.

The safety comments claim "ArcVecString is only created from validated UTF-8 sources" and "Bytes variant is only created from validated UTF-8 sources," but as flagged in other comments, the following constructors violate this invariant:

  • from_vec (line 262)
  • from_arc_vec (line 318)
  • from_bytes (line 346)

This makes the from_utf8_unchecked calls here unsound. The invariant must be maintained by ALL constructors, or this code will cause undefined behavior.

🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 357 to 367, the unsafe from_utf8_unchecked
calls rely on an invariant that some constructors do not maintain, making this
unsound; fix by ensuring the invariant is enforced or by removing the unsafe
conversion: either validate UTF-8 in every constructor that can produce
ArcVecString/Bytes (from_vec, from_arc_vec, from_bytes) and return a Result or
panic on invalid UTF-8, or replace the unchecked conversions here with safe
std::str::from_utf8 and propagate or handle the error; update the safety
comments to reflect the chosen strategy.

InnerString::Empty => EMPTY_STRING,
}
}
Expand Down
46 changes: 46 additions & 0 deletions src/error.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
use core::fmt;
use core::str::Utf8Error;

/// Errors that can occur during CheetahString operations
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Error {
/// UTF-8 validation failed
Utf8Error(Utf8Error),
/// Index out of bounds
IndexOutOfBounds { index: usize, len: usize },
/// Invalid character boundary
InvalidCharBoundary { index: usize },
}

impl fmt::Display for Error {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self {
Error::Utf8Error(e) => write!(f, "UTF-8 error: {}", e),
Error::IndexOutOfBounds { index, len } => {
write!(f, "index {} out of bounds (len: {})", index, len)
}
Error::InvalidCharBoundary { index } => {
write!(f, "index {} is not a char boundary", index)
}
}
}
}

#[cfg(feature = "std")]
impl std::error::Error for Error {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
match self {
Error::Utf8Error(e) => Some(e),
_ => None,
}
}
}

impl From<Utf8Error> for Error {
fn from(e: Utf8Error) -> Self {
Error::Utf8Error(e)
}
}

/// Result type for CheetahString operations
pub type Result<T> = core::result::Result<T, Error>;
2 changes: 2 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,10 @@
//! ```
//!
mod cheetah_string;
mod error;

#[cfg(feature = "serde")]
mod serde;

pub use cheetah_string::CheetahString;
pub use error::{Error, Result};
Loading
Loading