LDM notes for 2019-09-16 and 2019-09-18 #2826

MgSam · 2019-09-26T14:55:29Z

MgSam
Sep 26, 2019

There hasn't been a topic for the design notes in a while. Opening this one so people can discuss the most recent ones.

https://github.com/dotnet/csharplang/blob/master/meetings/2019/LDM-2019-09-16.md

https://github.com/dotnet/csharplang/blob/master/meetings/2019/LDM-2019-09-18.md

MgSam · 2019-09-26T14:58:30Z

MgSam
Sep 26, 2019
Author

My 2 cents- I don't see the need for a new ustring keyword. Utf8String isn't really that onerous to type (especially with auto-complete- you could type "U8S" and completion should take care of the rest). We've lived with having to type DateTime for all these years (and having no literal, an extremely common use-case), I think we can make do here as well.

It's also confusing in that it reads as "unsigned string", given the precedent for the numeric keywords.

0 replies

alrz · 2019-09-26T17:41:41Z

alrz
Sep 26, 2019

🍝 if Int32 then String8

0 replies

CyrusNajmabadi · 2019-09-26T19:05:22Z

CyrusNajmabadi
Sep 26, 2019
Collaborator

#bikeshedding.

I would just have it be utf or utf8.

0 replies

MgSam · 2019-09-26T19:10:51Z

MgSam
Sep 26, 2019
Author

@CyrusNajmabadi I agree on the bike shedding part as far as the class name goes. However, I do think the bar to adding language keywords should be much, much higher than adding a new class to the framework and keywords should only be added if they enable new functionality not possible otherwise or substantially reduce verbosity in an extremely common use case.

0 replies

CyrusNajmabadi · 2019-09-26T19:17:06Z

CyrusNajmabadi
Sep 26, 2019
Collaborator

@CyrusNajmabadi I agree on the bike shedding part as far as the class name goes. However, I do think the bar to adding language keywords should be much, much higher than adding a new class to the framework and keywords should only be added if they enable new functionality not possible otherwise or substantially reduce verbosity in an extremely common use case.

I agree. And i think this is such a case.

0 replies

Joe4evr · 2019-09-26T20:03:51Z

Joe4evr
Sep 26, 2019

Since I'm not seeing it mentioned, would there at least be a way to explicitly make a Utf8String literal?

0 replies

alrz · 2019-09-26T20:27:03Z

alrz
Sep 26, 2019

could just use a well-known implicit conversion from string, as long as it's not necessary to make it work with var.

edit: however that would make it impossible/ugly to overload on string and utf8 string at the same time.

0 replies

CyrusNajmabadi · 2019-09-26T20:39:03Z

CyrusNajmabadi
Sep 26, 2019
Collaborator

edit: however that would make it impossible/ugly to overload on string and utf8 string at the same time.

Yeah, i'm on the fence here. I think i'd prefer a hybrid approach. Allow explicit u"whatever" as well as a target-typed conversion from a string literal to a utf8 string. That allows you to just be simple in many cases, but you can be explicit when necessary.

It's similar if you had this:

void Foo(int v) { }
void Foo(long v) { }

You can write Foo(1) and it will call the int version, but you can also call Foo(1L) to call the long version.

Similar stuff would happen with string-literals and the different destination type. You could disambiguate in teh cases where it was necessary.

0 replies

alrz · 2019-09-26T20:59:12Z

alrz
Sep 26, 2019

alternatively we could just decide that utf8 is a "better" overload if that's always preferable when the framework offers both for binary compat.

I don't want to worry about utf when I pass a string to a framework method. if it can work with utf8 then great, prefer the new method. (note: this still can be considered as a breaking change for the framework if it depends on utf16 but it's only limited to direct literals)

0 replies

CyrusNajmabadi · 2019-09-26T21:01:08Z

CyrusNajmabadi
Sep 26, 2019
Collaborator

@alrz Definitely an interesting idea. I think that could make a lot of sense. There would be no binary breaks. There would be source breaks for how literals were treated. But it would only be for APIs effectively stating they supported both, where the presumption would likely be that utf8 would be preferred.

So that sgtm.

0 replies

alrz · 2019-09-26T21:07:57Z

alrz
Sep 26, 2019

this is similar to params Span proposal where it's preferred over params array so a mere recompilation would resolve all methods to the non allocating overload.

0 replies

CyrusNajmabadi · 2019-09-26T21:10:20Z

CyrusNajmabadi
Sep 26, 2019
Collaborator

Yup. I like the idea that the language can add things about new language/API features and say that if an API adds support for both, that htey'll prefer the new/faster thing.

0 replies

yaakov-h · 2019-09-26T23:37:13Z

yaakov-h
Sep 26, 2019

Same. That was the thing that really annoyed me about FormattableString, hopefully ValueFormattableString will get to "fix" that.

I hope the team doesn't make the same decision with Utf8String.

0 replies

qrli · 2019-09-27T04:10:27Z

qrli
Sep 27, 2019

Just to mention that not all utf-16 strings can be converted to utf-8 strings, which leads to non-standard wtf-8 encoding.

0 replies

orthoxerox · 2019-09-27T07:19:45Z

orthoxerox
Sep 27, 2019

Literals like u8"blah" and u16"blah" should help indicate the correct overload even if the LDT doesn't make one string type more equal than the other.

I also think a default indexer or enumerator on Utf8String is a bad idea. A good idea would be to bake in a special error code so instead of error CS1579: foreach statement cannot operate on variables of type 'Utf8String' because 'Utf8String' does not contain a public instance definition for 'GetEnumerator' and error CS0021: Cannot apply indexing with [] to an expression of type 'Utf8String' the compiler could produce something similar to Indexing into/Enumerating a UTF-8 encoded string is an ambiguous operation. Use 'AsBytes()', 'AsCodepoints()' or 'AsExtendedGraphemeClusters()' on this instance of 'Utf8String' to obtain the indexer/enumerator you need.

0 replies

YairHalberstadt · 2019-09-27T07:27:00Z

YairHalberstadt
Sep 27, 2019
Collaborator

A good idea would be to bake in a special error code

That could be done by implementing a throwing implementation of the enumerator/indexer and marking it as obsolete.

0 replies

orthoxerox · 2019-09-27T07:41:46Z

orthoxerox
Sep 27, 2019

@YairHalberstadt Wouldn't that result in a warning instead of an error?

0 replies

Joe4evr · 2019-09-27T09:01:08Z

Joe4evr
Sep 27, 2019

@YairHalberstadt Wouldn't that result in a warning instead of an error?

Not if it's [Obsolete("foo", error: true)].

0 replies

orthoxerox · 2019-09-27T09:45:57Z

orthoxerox
Sep 27, 2019

On an unrelated note, mixing declarations and variables in deconstruction is something that I would really love to see in the nearest release.

0 replies

alrz · 2019-09-28T09:02:51Z

alrz
Sep 28, 2019

Literals like u8"blah" and u16"blah" should help indicate the correct overload even if the LDT doesn't make one string type more equal than the other.

I agree, "str" would be utf neutral where it can convert to both variants. However, I think a postfix modifier is more appropriate if it doesn't affect parsing. plus we don't need to stack three literal modifiers in one place to have a utf8 verbatim interpolated string.

0 replies

qrli · 2019-09-29T05:52:52Z

qrli
Sep 29, 2019

I would avoid special literal like u8"blah" and u16"blah", given the sad history of C++ string literals catching up encoding changes over decades.
As long as such literals won't be used a lot - I hope so - it can be easily worked around by a cast:

    (Utf8String)"blah"

Which is not much longer. No new syntax either.

And this pattern can be applied to all other encodings in future, e.g. (Wtf8String)"blah", (Gb18030String)"blah", (UrlBase64String)"blah", (RegexString)"blah", etc.

0 replies

alrz · 2019-10-03T09:44:05Z

alrz
Oct 3, 2019

Is there any experimentation to do utf8 work in runtime?

From a gitter discussion this might result in fragmentation for quite some time until perhaps everyone moved to utf8 strings? and that's only besides of the fact that corefx has to support both, doubling the number of overloads.

I do agree with that concern as string is too fundamental to be fragmented in two types and this goes beyond mere adaptation - we'll be stuck with multiple variations of string forever.

Rust has the same problem with String and str which is a source of confusion for newcomers.

0 replies

orthoxerox · 2019-10-03T10:27:16Z

orthoxerox
Oct 3, 2019

@alrz

Rust has it easy. &str is char*, String is std::string. I'd say Haskell is the language that suffers from fragmented string-processing ecosystem the most.

0 replies

alrz · 2019-10-03T10:49:06Z

alrz
Oct 3, 2019

Rust has it easy. &str is char*, String is std::string.

And that is adjusted by different capabilities each type provides. My point is that as long as the only difference between string and Utf8String is the internal representation, there shouldn't be a separate type for each encoding - even if we have to have utf8 literals, it makes more sense to use the same string type.

0 replies

YairHalberstadt · 2019-10-03T10:54:04Z

YairHalberstadt
Oct 3, 2019
Collaborator

A possibility might be that string could become a sort of DU over a normal string and a utf8string.

When it is stores a utf8string, it can store the location of the last indexed char to make sequential indexing (the most common sort) O(1) rather than O(n).

I imagine the devil's in the details, and this will actually turn out to be impossible/ridiculously complex in practice.

0 replies

svick · 2019-10-06T13:44:14Z

svick
Oct 6, 2019
Collaborator

@alrz

My point is that as long as the only difference between string and Utf8String is the internal representation, there shouldn't be a separate type for each encoding - even if we have to have utf8 literals, it makes more sense to use the same string type.

Except the different internal representation would be quite visible in terms of performance and also confusing, to maintain compatibility with the current string.

Regarding performance:

string s16 = "…";
char c = s16[i]; // O(1)
string s8 = u8"…";
c = s8[i]; // O(n)

As for confusion, I'm talking about the fact that Length of an UTF-8 string would still have to return the number of UTF-16 code units.

0 replies

alrz · 2019-10-27T08:57:23Z

alrz
Oct 27, 2019

@svick

If a codebase is using indexing extensively, with an opt-in approach you could just decide to not move to utf8. That could also make it possible to alter Length behavior or at least choosing a sensible default and provide a way to get an encoding-aware value.

See #184 (comment)

0 replies

LDM notes for 2019-09-16 and 2019-09-18 #2826

Uh oh!

Replies: 27 comments

Uh oh!

Uh oh!

MgSam Sep 26, 2019 Author

Uh oh!

Uh oh!

CyrusNajmabadi Sep 26, 2019 Collaborator

Uh oh!

MgSam Sep 26, 2019 Author

Uh oh!

CyrusNajmabadi Sep 26, 2019 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi Sep 26, 2019 Collaborator

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi Sep 26, 2019 Collaborator

Uh oh!

Uh oh!

CyrusNajmabadi Sep 26, 2019 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YairHalberstadt Sep 27, 2019 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YairHalberstadt Oct 3, 2019 Collaborator

Uh oh!

svick Oct 6, 2019 Collaborator

Uh oh!

MgSam
Sep 26, 2019
Author

CyrusNajmabadi
Sep 26, 2019
Collaborator

MgSam
Sep 26, 2019
Author

CyrusNajmabadi
Sep 26, 2019
Collaborator

CyrusNajmabadi
Sep 26, 2019
Collaborator

CyrusNajmabadi
Sep 26, 2019
Collaborator

CyrusNajmabadi
Sep 26, 2019
Collaborator

YairHalberstadt
Sep 27, 2019
Collaborator

YairHalberstadt
Oct 3, 2019
Collaborator

svick
Oct 6, 2019
Collaborator