Skip to content

Conversation

@Sergio0694
Copy link
Contributor

Summary

The current C# docs stated that:

"There is no null-terminating character at the end of a C# string"

This is completely wrong, or at the least, very poorly phrased. I get that it probably meant to say that there is no null-terminator in the slice of size Length, but the fact strings are still internally null-terminated should still be clear, as it's a fundamental aspect that's critical for whoever's working with native interop. This PR updates the docs accordingly.

@BillWagner
Copy link
Member

Thanks @Sergio0694 I like this change.

I tagged @stephentoub to review. I want to make sure we want to assert what you've stated about the string type, or if we want to word it as an implementation detail that might change in some future release.

# Strings (C# Programming Guide)

A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters ('\0'). The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object.
A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. The length of a C# string is stored in a dedicated field and it is not computed by iterating on the string data to find a null-terminator. Therefore, a C# string can contain any number of embedded null characters ('\0'). Note that C# strings are not just length-prefixed, but also internally null-terminated: this makes it safe to marshal them to native code expecting a null-terminated sequence of characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to include this in the docs, we should probably also include that ReadOnlySpan<char> is frequently used to represent strings or slices of strings and are not guaranteed to be null-terminated.

@stephentoub
Copy link
Member

if we want to word it as an implementation detail that might change in some future release.

While it'd probably be a significant breaking change if we were to ever change that, I'm not sure we've ever technically guaranteed it, and I don't think the ECMA spec does. @jkotas?

@Sergio0694
Copy link
Contributor Author

ECMA 334 says this in 23.7 (fixed statement):

"An expression of type string, provided the type char* is implicitly convertible to the pointer type
given in the fixed statement. In this case, the initializer computes the address of the first character in
the string, and the entire string is guaranteed to remain at a fixed address for the duration of the
fixed statement. The behavior of the fixed statement is implementation-defined if the string
expression is null."

"A char* value produced by fixing a string instance always points to a null-terminated string. Within a fixed
statement that obtains a pointer p to a string instance s, the pointer values ranging from p to
p + s.Length - 1 represent addresses of the characters in the string, and the pointer value
p + s.Length always points to a null character (the character with value '\0')"

"[Note: The automatic null-termination of strings is particularly convenient when calling external APIs that
expect “C-style” strings. Note, however, that a string instance is permitted to contain null characters. If
such null characters are present, the string will appear truncated when treated as a null-terminated char*.
end note]"

These three combined do seem to imply that strings being null terminated is just guaranteed by the spec, given fixed here is documented to return the address of the first character in the string, and that that resulting char* buffer is null-terminated.

I know this is the C# spec and not the .NET spec, but these three kinda seem to suggest this should be true in general?
Unless there's still room for an ECMA-compliant .NET runtime to somehow implement string differently, internally? 🤔

@stephentoub
Copy link
Member

I think an implementation of GetPinnableReference that allocated a new region of memory, copied the string data into it, null-terminated that, and then returned the address of that memory would still meet the letter of what's outlined above. We're obviously not going to do that given the current implementation of string, but imagine a hypothetical future where we changed string's representation to be UTF8 instead of UTF16... at that point GetPinnableReference returning a ref readonly char would effectively have to do something like I just outlined.

@Sergio0694
Copy link
Contributor Author

Yeah I was thinking something like that as well, but there's this part in the spec that makes me wonder if that's allowed:

"the initializer computes the address of the first character in the string, and the entire string is guaranteed to remain at a fixed address for the duration of the fixed statement"

That "in the string" bit (and the fact it stresses the fact the string object is pinned) makes me think the spec is activel suggesting that the address has to be to the first character within the actual string object, not into another temporary buffer 🤔
Am I reading this wrong?

@stephentoub
Copy link
Member

stephentoub commented Mar 3, 2022

It's stressing that you're getting a pointer back to the first character (as opposed to, say, the second character) and that the data won't move while fixed. I don't believe it's requiring that it be the exact underlying memory used for all other string operations by the runtime. I don't see anything here that prevents it from being a copy.

@jkotas
Copy link
Member

jkotas commented Mar 3, 2022

I think an implementation of GetPinnableReference that allocated a new region of memory, copied the string data into it, null-terminated that, and then returned the address of that memory would still meet the letter of what's outlined above.

Correct. We have explicitly added GetPinnableReference method and updated Roslyn to use it to give us an option to experiment with alternative string implementations like that.

@Sergio0694
Copy link
Contributor Author

I see. To get to the issue with this PR, if the null-termination is only guaranteed by the spec to exist when accessing the buffer through a fixed statement, should we update the intro to reflect that? I guess I just feel like the docs stating that strings are "not null-terminated" without elaborating on what this actually means in practice might be a bit misleading for folks.

# Strings (C# Programming Guide)

A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. There is no null-terminating character at the end of a C# string; therefore a C# string can contain any number of embedded null characters ('\0'). The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object.
A string is an object of type <xref:System.String> whose value is text. Internally, the text is stored as a sequential read-only collection of <xref:System.Char> objects. The <xref:System.String.Length%2A> property of a string represents the number of `Char` objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the <xref:System.Globalization.StringInfo> object. The length of a C# string is stored in a dedicated field and it is not computed by iterating on the string data to find a null-terminator. Therefore, a C# string can contain any number of embedded null characters ('\0'). Note that C# strings are not just length-prefixed, but also internally null-terminated: this makes it safe to marshal them to native code expecting a null-terminated sequence of characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the target audience for this document? The description says "Learn about strings in C# programming." that feels like L100. It does not sound right to go into details about strings interop in the first paragraph for L100 audience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, should we remove that mention of the null-terminator at the end of the string entirely? As in, people getting started with C# likely wouldn't know or care about what embedded null characters even are anyway, right? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the better change. (Historical note: This text has been around for a long time. I'm betting it exists because a large segment of the audience for the early docs were C++ developers. This note would have been important then.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BillWagner
Copy link
Member

@Sergio0694

I'm closing this as very stale. It's been more than 2 years since the last comment, and there are now conflicts. If you want to reopen the PR, address the conflicts and the comments, we'll take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants