Update docs on strings not being null-terminated #28470

Sergio0694 · 2022-03-03T11:49:55Z

Summary

The current C# docs stated that:

"There is no null-terminating character at the end of a C# string"

This is completely wrong, or at the least, very poorly phrased. I get that it probably meant to say that there is no null-terminator in the slice of size Length, but the fact strings are still internally null-terminated should still be clear, as it's a fundamental aspect that's critical for whoever's working with native interop. This PR updates the docs accordingly.

BillWagner · 2022-03-03T14:42:41Z

Thanks @Sergio0694 I like this change.

I tagged @stephentoub to review. I want to make sure we want to assert what you've stated about the string type, or if we want to word it as an implementation detail that might change in some future release.

stephentoub · 2022-03-03T14:43:24Z

docs/csharp/programming-guide/strings/index.md

 # Strings (C# Programming Guide)

-A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters ('\0'). The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object.  
+A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. The length of a C# string is stored in a dedicated field and it is not computed by iterating on the string data to find a null-terminator. Therefore, a C# string can contain any number of embedded null characters ('\0'). Note that C# strings are not just length-prefixed, but also internally null-terminated: this makes it safe to marshal them to native code expecting a null-terminated sequence of characters.


If we're going to include this in the docs, we should probably also include that ReadOnlySpan<char> is frequently used to represent strings or slices of strings and are not guaranteed to be null-terminated.

stephentoub · 2022-03-03T14:46:30Z

if we want to word it as an implementation detail that might change in some future release.

While it'd probably be a significant breaking change if we were to ever change that, I'm not sure we've ever technically guaranteed it, and I don't think the ECMA spec does. @jkotas?

Sergio0694 · 2022-03-03T15:32:00Z

ECMA 334 says this in 23.7 (fixed statement):

"An expression of type string, provided the type char* is implicitly convertible to the pointer type
given in the fixed statement. In this case, the initializer computes the address of the first character in
the string, and the entire string is guaranteed to remain at a fixed address for the duration of the
fixed statement. The behavior of the fixed statement is implementation-defined if the string
expression is null."

"A char* value produced by fixing a string instance always points to a null-terminated string. Within a fixed
statement that obtains a pointer p to a string instance s, the pointer values ranging from p to
p + s.Length - 1 represent addresses of the characters in the string, and the pointer value
p + s.Length always points to a null character (the character with value '\0')"

"[Note: The automatic null-termination of strings is particularly convenient when calling external APIs that
expect “C-style” strings. Note, however, that a string instance is permitted to contain null characters. If
such null characters are present, the string will appear truncated when treated as a null-terminated char*.
end note]"

These three combined do seem to imply that strings being null terminated is just guaranteed by the spec, given fixed here is documented to return the address of the first character in the string, and that that resulting char* buffer is null-terminated.

I know this is the C# spec and not the .NET spec, but these three kinda seem to suggest this should be true in general?
Unless there's still room for an ECMA-compliant .NET runtime to somehow implement string differently, internally? 🤔

stephentoub · 2022-03-03T15:38:19Z

I think an implementation of GetPinnableReference that allocated a new region of memory, copied the string data into it, null-terminated that, and then returned the address of that memory would still meet the letter of what's outlined above. We're obviously not going to do that given the current implementation of string, but imagine a hypothetical future where we changed string's representation to be UTF8 instead of UTF16... at that point GetPinnableReference returning a ref readonly char would effectively have to do something like I just outlined.

Sergio0694 · 2022-03-03T15:48:27Z

Yeah I was thinking something like that as well, but there's this part in the spec that makes me wonder if that's allowed:

"the initializer computes the address of the first character in the string, and the entire string is guaranteed to remain at a fixed address for the duration of the fixed statement"

That "in the string" bit (and the fact it stresses the fact the string object is pinned) makes me think the spec is activel suggesting that the address has to be to the first character within the actual string object, not into another temporary buffer 🤔
Am I reading this wrong?

stephentoub · 2022-03-03T15:51:59Z

It's stressing that you're getting a pointer back to the first character (as opposed to, say, the second character) and that the data won't move while fixed. I don't believe it's requiring that it be the exact underlying memory used for all other string operations by the runtime. I don't see anything here that prevents it from being a copy.

jkotas · 2022-03-03T15:53:20Z

I think an implementation of GetPinnableReference that allocated a new region of memory, copied the string data into it, null-terminated that, and then returned the address of that memory would still meet the letter of what's outlined above.

Correct. We have explicitly added GetPinnableReference method and updated Roslyn to use it to give us an option to experiment with alternative string implementations like that.

Sergio0694 · 2022-03-03T15:57:43Z

I see. To get to the issue with this PR, if the null-termination is only guaranteed by the spec to exist when accessing the buffer through a fixed statement, should we update the intro to reflect that? I guess I just feel like the docs stating that strings are "not null-terminated" without elaborating on what this actually means in practice might be a bit misleading for folks.

jkotas · 2022-03-03T16:06:28Z

docs/csharp/programming-guide/strings/index.md

 # Strings (C# Programming Guide)

-A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters ('\0'). The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object.  
+A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. The length of a C# string is stored in a dedicated field and it is not computed by iterating on the string data to find a null-terminator. Therefore, a C# string can contain any number of embedded null characters ('\0'). Note that C# strings are not just length-prefixed, but also internally null-terminated: this makes it safe to marshal them to native code expecting a null-terminated sequence of characters.


What is the target audience for this document? The description says "Learn about strings in C# programming." that feels like L100. It does not sound right to go into details about strings interop in the first paragraph for L100 audience.

If that's the case, should we remove that mention of the null-terminator at the end of the string entirely? As in, people getting started with C# likely wouldn't know or care about what embedded null characters even are anyway, right? 🤔

I think that's the better change. (Historical note: This text has been around for a long time. I'm betting it exists because a large segment of the audience for the early docs were C++ developers. This note would have been important then.)

BillWagner · 2024-10-16T17:27:53Z

@Sergio0694

I'm closing this as very stale. It's been more than 2 years since the last comment, and there are now conflicts. If you want to reopen the PR, address the conflicts and the comments, we'll take another look.

Update docs on strings not being null-terminated

502f536

Sergio0694 requested a review from BillWagner as a code owner March 3, 2022 11:49

dotnet-bot added fundamentals/subsvc dotnet-csharp/svc labels Mar 3, 2022

BillWagner requested a review from stephentoub March 3, 2022 14:41

stephentoub reviewed Mar 3, 2022

View reviewed changes

jkotas reviewed Mar 3, 2022

View reviewed changes

BillWagner added the rerun-action-opened label Jun 3, 2022

dotnet-bot removed the rerun-action-opened label Jun 3, 2022

dotnet-bot added this to the June 2022 milestone Jun 3, 2022

BillWagner closed this Oct 16, 2024

jkotas mentioned this pull request Oct 16, 2024

Delete mention of null characters from C# string intro #43097

Merged

Update docs on strings not being null-terminated #28470

Update docs on strings not being null-terminated #28470

Uh oh!

Conversation

Sergio0694 commented Mar 3, 2022

Summary

Uh oh!

BillWagner commented Mar 3, 2022

Uh oh!

stephentoub Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Mar 3, 2022

Uh oh!

Sergio0694 commented Mar 3, 2022

Uh oh!

stephentoub commented Mar 3, 2022

Uh oh!

Sergio0694 commented Mar 3, 2022

Uh oh!

stephentoub commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Mar 3, 2022

Uh oh!

Sergio0694 commented Mar 3, 2022

Uh oh!

jkotas Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

Sergio0694 Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

BillWagner Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

jkotas Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

BillWagner commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stephentoub commented Mar 3, 2022 •

edited

Loading