-
Notifications
You must be signed in to change notification settings - Fork 6
[ISSUE #72]✨Add UTF-8 validation methods and deprecate unsafe conversions in CheetahString #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,4 +1,5 @@ | ||||||||||||||
| use core::fmt; | ||||||||||||||
| use core::str::Utf8Error; | ||||||||||||||
| use std::borrow::{Borrow, Cow}; | ||||||||||||||
| use std::cmp::Ordering; | ||||||||||||||
| use std::fmt::Display; | ||||||||||||||
|
|
@@ -44,9 +45,18 @@ impl<'a> From<&'a str> for CheetahString { | |||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /// # Safety Warning | ||||||||||||||
| /// | ||||||||||||||
| /// This implementation uses `unsafe` code and may cause undefined behavior | ||||||||||||||
| /// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_bytes()` | ||||||||||||||
| /// for safe UTF-8 validation. | ||||||||||||||
| /// | ||||||||||||||
| /// This implementation will be deprecated in a future version. | ||||||||||||||
| impl From<&[u8]> for CheetahString { | ||||||||||||||
| #[inline] | ||||||||||||||
| fn from(b: &[u8]) -> Self { | ||||||||||||||
| // SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8. | ||||||||||||||
| // This will be deprecated in favor of try_from_bytes in the next version. | ||||||||||||||
| CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(b) }) | ||||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
Comment on lines
+48
to
62
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Critical: This From implementation is unsound and must be deprecated immediately. This implementation creates
The safety warning in documentation doesn't prevent misuse from safe code. 🔎 Recommended fix: Add deprecation attribute+#[deprecated(since = "0.x.0", note = "Use `CheetahString::try_from_bytes()` instead for safe UTF-8 validation")]
impl From<&[u8]> for CheetahString {
#[inline]
fn from(b: &[u8]) -> Self {
- // SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
- // This will be deprecated in favor of try_from_bytes in the next version.
CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(b) })
}
} |
||||||||||||||
|
|
@@ -59,9 +69,18 @@ impl FromStr for CheetahString { | |||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /// # Safety Warning | ||||||||||||||
| /// | ||||||||||||||
| /// This implementation uses `unsafe` code and may cause undefined behavior | ||||||||||||||
| /// if the bytes are not valid UTF-8. Consider using `CheetahString::try_from_vec()` | ||||||||||||||
| /// for safe UTF-8 validation. | ||||||||||||||
| /// | ||||||||||||||
| /// This implementation will be deprecated in a future version. | ||||||||||||||
|
||||||||||||||
| /// This implementation will be deprecated in a future version. | |
| /// This implementation will be deprecated in a future version. | |
| #[deprecated(note = "use CheetahString::try_from_vec() instead for safe UTF-8 validation")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: This From implementation is unsound and must be deprecated immediately.
Same issue as From<&[u8]> - this uses from_utf8_unchecked from safe code without validation.
🔎 Recommended fix: Add deprecation attribute
+#[deprecated(since = "0.x.0", note = "Use `CheetahString::try_from_vec()` instead for safe UTF-8 validation")]
impl From<Vec<u8>> for CheetahString {
#[inline]
fn from(v: Vec<u8>) -> Self {
- // SAFETY: This is unsafe and may cause UB if bytes are not valid UTF-8.
- // This will be deprecated in favor of try_from_vec in the next version.
CheetahString::from_slice(unsafe { std::str::from_utf8_unchecked(&v) })
}
}🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 72 to 86, the From<Vec<u8>> implementation
unsafely uses from_utf8_unchecked; mark this impl as deprecated and make it safe
by delegating to the fallible conversion: add a deprecation attribute (e.g.
#[deprecated(since = "x.y.z", note = "unsound: use CheetahString::try_from_vec
instead")]) to the impl, and replace the body to call
CheetahString::try_from_vec(v).expect("From<Vec<u8>> is deprecated and requires
valid UTF-8; use try_from_vec for fallible conversion") so the code no longer
performs unchecked UTF-8 conversion while signaling deprecation and guiding
users to the safe API.
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safety comment claims "ArcVecString should only be created from valid UTF-8 sources", but the from_arc_vec method at line 318 does not validate UTF-8. This creates a path for invalid UTF-8 data to reach this unsafe code, which can cause undefined behavior when converting to String.
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safety comment claims "Bytes variant should only be created from valid UTF-8 sources", but the from_bytes method does not perform UTF-8 validation. This allows invalid UTF-8 to reach this unsafe code, which can cause undefined behavior when converting to String.
| // SAFETY: Bytes variant should only be created from valid UTF-8 sources | |
| unsafe { String::from_utf8_unchecked(b.to_vec()) } | |
| String::from_utf8(b.to_vec()) | |
| .expect("Bytes variant must contain valid UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all code that constructs ArcVecString or Bytes variants
echo "=== Searching for ArcVecString construction ==="
rg -n -C3 'InnerString::ArcVecString'
echo -e "\n=== Searching for Bytes variant construction ==="
rg -n -C3 'InnerString::Bytes'Repository: mxsm/cheetah-string
Length of output: 7539
🏁 Script executed:
# Let's look at the specific constructors mentioned
sed -n '262,266p' src/cheetah_string.rs
echo "---"
sed -n '318,322p' src/cheetah_string.rs
echo "---"
sed -n '346,350p' src/cheetah_string.rs
echo "---"
# Also check if there's a from_vec_checked constructor
sed -n '280,293p' src/cheetah_string.rsRepository: mxsm/cheetah-string
Length of output: 934
Fix the UTF-8 safety invariant for ArcVecString and Bytes variants.
The unsafe conversions at lines 186-196 use from_utf8_unchecked assuming valid UTF-8, but public constructors violate this invariant:
from_vec(line 262) createsArcVecStringfrom arbitraryVec<u8>without validationfrom_arc_vec(line 318) createsArcVecStringfrom arbitraryArc<Vec<u8>>without validationfrom_bytes(line 346) createsBytesvariant from arbitrarybytes::Byteswithout validation
Calling to_string() or as_str() on these unvalidated instances triggers undefined behavior. The safety comments claiming "all constructors ensure this invariant" are false.
Either validate UTF-8 in these constructors (like try_from_vec does at line 288) or mark them unsafe to explicitly document the responsibility falls to callers.
🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 186-196, the unsafe use of
String::from_utf8_unchecked for ArcVecString and Bytes assumes a UTF-8 invariant
that is not actually enforced by the public constructors at lines 262
(from_vec), 318 (from_arc_vec) and 346 (from_bytes) (only try_from_vec at 288
validates); fix by either (A) validating input in those constructors (use
std::str::from_utf8 or String::from_utf8 and return Result or panic on invalid
UTF-8) so the unchecked conversions are safe, or (B) mark those constructors
unsafe and update docs/comments to require callers ensure valid UTF-8; pick one
approach and make it consistent across all constructors and any places that rely
on the invariant (e.g., as_str/to_string) so no unchecked conversion can be
reached with unvalidated data.
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try_from_vec and try_from_bytes methods return Result<Self, Utf8Error> instead of using the new crate::Result<T> type alias (which is Result<T, crate::Error>). This creates an inconsistent error handling API. Consider changing the return type to crate::Result<Self> to maintain consistency with the error handling infrastructure introduced in this PR.
| pub fn try_from_vec(v: Vec<u8>) -> Result<Self, Utf8Error> { | |
| pub fn try_from_vec(v: Vec<u8>) -> crate::Result<Self> { |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try_from_bytes method returns Result<Self, Utf8Error> instead of using the new crate::Result<T> type alias (which is Result<T, crate::Error>). This creates an inconsistent error handling API. Consider changing the return type to crate::Result<Self> to maintain consistency with the error handling infrastructure introduced in this PR.
| pub fn try_from_bytes(b: &[u8]) -> Result<Self, Utf8Error> { | |
| let s = std::str::from_utf8(b)?; | |
| pub fn try_from_bytes(b: &[u8]) -> crate::Result<Self> { | |
| let s = std::str::from_utf8(b).map_err(Into::into)?; |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safety comment claims "ArcVecString is only created from validated UTF-8 sources" and "All constructors ensure this invariant is maintained." However, the from_arc_vec method at line 318 does not validate UTF-8, creating a path for invalid UTF-8 to reach this unsafe code. This violates the documented safety invariant and can cause undefined behavior.
| // SAFETY: ArcVecString is only created from validated UTF-8 sources. | |
| // All constructors ensure this invariant is maintained. | |
| unsafe { std::str::from_utf8_unchecked(s.as_ref()) } | |
| // Enforce that ArcVecString only contains valid UTF-8. | |
| std::str::from_utf8(s.as_ref()) | |
| .expect("ArcVecString must contain valid UTF-8") |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safety comment claims "Bytes variant is only created from validated UTF-8 sources" and "The from_bytes constructor ensures this invariant." However, the from_bytes method does not perform UTF-8 validation, which means this safety claim is incorrect and the use of from_utf8_unchecked here can cause undefined behavior if invalid UTF-8 is passed through that constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe code relies on violated invariant - undefined behavior risk.
The safety comments claim "ArcVecString is only created from validated UTF-8 sources" and "Bytes variant is only created from validated UTF-8 sources," but as flagged in other comments, the following constructors violate this invariant:
from_vec(line 262)from_arc_vec(line 318)from_bytes(line 346)
This makes the from_utf8_unchecked calls here unsound. The invariant must be maintained by ALL constructors, or this code will cause undefined behavior.
🤖 Prompt for AI Agents
In src/cheetah_string.rs around lines 357 to 367, the unsafe from_utf8_unchecked
calls rely on an invariant that some constructors do not maintain, making this
unsound; fix by ensuring the invariant is enforced or by removing the unsafe
conversion: either validate UTF-8 in every constructor that can produce
ArcVecString/Bytes (from_vec, from_arc_vec, from_bytes) and return a Result or
panic on invalid UTF-8, or replace the unchecked conversions here with safe
std::str::from_utf8 and propagate or handle the error; update the safety
comments to reflect the chosen strategy.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| use core::fmt; | ||
| use core::str::Utf8Error; | ||
|
|
||
| /// Errors that can occur during CheetahString operations | ||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub enum Error { | ||
| /// UTF-8 validation failed | ||
| Utf8Error(Utf8Error), | ||
| /// Index out of bounds | ||
| IndexOutOfBounds { index: usize, len: usize }, | ||
| /// Invalid character boundary | ||
| InvalidCharBoundary { index: usize }, | ||
| } | ||
|
|
||
| impl fmt::Display for Error { | ||
| fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
| match self { | ||
| Error::Utf8Error(e) => write!(f, "UTF-8 error: {}", e), | ||
| Error::IndexOutOfBounds { index, len } => { | ||
| write!(f, "index {} out of bounds (len: {})", index, len) | ||
| } | ||
| Error::InvalidCharBoundary { index } => { | ||
| write!(f, "index {} is not a char boundary", index) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #[cfg(feature = "std")] | ||
| impl std::error::Error for Error { | ||
| fn source(&self) -> Option<&(dyn std::error::Error + 'static)> { | ||
| match self { | ||
| Error::Utf8Error(e) => Some(e), | ||
| _ => None, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl From<Utf8Error> for Error { | ||
| fn from(e: Utf8Error) -> Self { | ||
| Error::Utf8Error(e) | ||
| } | ||
| } | ||
|
|
||
| /// Result type for CheetahString operations | ||
| pub type Result<T> = core::result::Result<T, Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation says "This implementation will be deprecated in a future version" but doesn't use the
#[deprecated]attribute. Consider adding the#[deprecated]attribute with a message directing users totry_from_bytes()to provide compiler warnings and IDE hints for users of this API.