-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Update docs on strings not being null-terminated #28470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @Sergio0694 I like this change. I tagged @stephentoub to review. I want to make sure we want to assert what you've stated about the |
| # Strings (C# Programming Guide) | ||
|
|
||
| A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters ('\0'). The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. | ||
| A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. The length of a C# string is stored in a dedicated field and it is not computed by iterating on the string data to find a null-terminator. Therefore, a C# string can contain any number of embedded null characters ('\0'). Note that C# strings are not just length-prefixed, but also internally null-terminated: this makes it safe to marshal them to native code expecting a null-terminated sequence of characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to include this in the docs, we should probably also include that ReadOnlySpan<char> is frequently used to represent strings or slices of strings and are not guaranteed to be null-terminated.
While it'd probably be a significant breaking change if we were to ever change that, I'm not sure we've ever technically guaranteed it, and I don't think the ECMA spec does. @jkotas? |
|
ECMA 334 says this in 23.7 (
These three combined do seem to imply that strings being null terminated is just guaranteed by the spec, given I know this is the C# spec and not the .NET spec, but these three kinda seem to suggest this should be true in general? |
|
I think an implementation of GetPinnableReference that allocated a new region of memory, copied the string data into it, null-terminated that, and then returned the address of that memory would still meet the letter of what's outlined above. We're obviously not going to do that given the current implementation of string, but imagine a hypothetical future where we changed string's representation to be UTF8 instead of UTF16... at that point GetPinnableReference returning a ref readonly char would effectively have to do something like I just outlined. |
|
Yeah I was thinking something like that as well, but there's this part in the spec that makes me wonder if that's allowed:
That "in the string" bit (and the fact it stresses the fact the string object is pinned) makes me think the spec is activel suggesting that the address has to be to the first character within the actual string object, not into another temporary buffer 🤔 |
|
It's stressing that you're getting a pointer back to the first character (as opposed to, say, the second character) and that the data won't move while fixed. I don't believe it's requiring that it be the exact underlying memory used for all other string operations by the runtime. I don't see anything here that prevents it from being a copy. |
Correct. We have explicitly added |
|
I see. To get to the issue with this PR, if the null-termination is only guaranteed by the spec to exist when accessing the buffer through a |
| # Strings (C# Programming Guide) | ||
|
|
||
| A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters ('\0'). The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. | ||
| A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. The length of a C# string is stored in a dedicated field and it is not computed by iterating on the string data to find a null-terminator. Therefore, a C# string can contain any number of embedded null characters ('\0'). Note that C# strings are not just length-prefixed, but also internally null-terminated: this makes it safe to marshal them to native code expecting a null-terminated sequence of characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the target audience for this document? The description says "Learn about strings in C# programming." that feels like L100. It does not sound right to go into details about strings interop in the first paragraph for L100 audience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, should we remove that mention of the null-terminator at the end of the string entirely? As in, people getting started with C# likely wouldn't know or care about what embedded null characters even are anyway, right? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the better change. (Historical note: This text has been around for a long time. I'm betting it exists because a large segment of the audience for the early docs were C++ developers. This note would have been important then.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I'm closing this as very stale. It's been more than 2 years since the last comment, and there are now conflicts. If you want to reopen the PR, address the conflicts and the comments, we'll take another look. |
Summary
The current C# docs stated that:
This is completely wrong, or at the least, very poorly phrased. I get that it probably meant to say that there is no null-terminator in the slice of size
Length, but the fact strings are still internally null-terminated should still be clear, as it's a fundamental aspect that's critical for whoever's working with native interop. This PR updates the docs accordingly.