[Proposal] String deinterpolation ‒ interpolated string pattern #7580

IS4Code · 2023-10-06T13:32:52Z

IS4Code
Oct 6, 2023

Summary

This is an alternative approach to #7576, allowing enhanced string matching and parsing by leveraging existing interpolated string syntax as a pattern. This feature provides the ability to split a string, extract values and store them into new variables, similarly to other patterns, by using existing and well-known interpolated string syntax.

Motivation

String parsing is a common operation in many C# programs. It often includes checking if a string starts or ends with a specific substring, locating other delimiters, and splitting the string and storing the intermediate values into variables. This process can be time-consuming, prone to errors, and can make the code more complex. This feature would bring additional symmetry to the language, often allowing one to naturally reverse formatting (as string interpolation) to parsing as string deinterpolation, making the code more readable and efficient. Existing solutions often require one to consider other factors, such as when using regular expressions (normal, compiled, cached, culture-invariant, code-generated), leading one to revert to simpler checks using StartsWith etc., often missing out on performance due to not using StringComparison.Ordinal.

Design

The simplest use of string deinterpolation would be as follows:

if (str is $"prefix{string extractedVariable}suffix")

This checks if str starts with "prefix" and ends with "suffix", then creates a new variable extractedVariable and stores the substring between those parts in it. This syntax is consistent with existing string interpolation syntax, just used as a pattern with "substituted" variable declarations.

Similarly in a case label:

case $"[{string a}x{string b}]": // matches strings formed from $"[{a}x{b}]" and extracts a and b

A discard _ could also be used instead of a variable declaration.

Semantics

A pattern like $"prefix{var1}infix1{var2}infix2…{varN}suffix" should:

Check that the compared value starts with "prefix" (if given), as-if by StartsWith("prefix", StringComparison.Ordinal).
Check that it ends with "suffix" (if given), as-if by EndsWith("suffix", StringComparison.Ordinal).
Check that the sequence of infixes "infix1", "infix2"… occurs in the compared value sequentially (and without overlaps), as-if by IndexOf("infixK", StringComparison.Ordinal) between the corresponding positions.
If all checks succeed, the pattern is matched and "var1", "var2"…"varN" are initialized with the substrings occuring between the matched parts of the string.
- If the variable is declared as string (or var?), it is assigned a value obtained as-if by calling Substring with the appropriate positions.
- If the variable is ReadOnlySpan<char>, it is assigned as-if by calling AsSpan().Slice accordingly.
- Other string-related types may be useful as well, such as ReadOnlyMemory<char> or StringSegment.
Otherwise, the pattern is unmatched and the variables are considered unassigned.

It will also be necessary to specify what strategy of infix matching is chosen ‒ lazy or greedy ‒ to arrive at a particular combination. To illustrate:

"[a,b,c,d]" is $"[{var a},{var b}]" could initialize (a, b) with either ("a", "b,c,d") (lazy) or ("a,b,c", "d") (greedy). There are other combinatorial options in general (and ("a,b", "c,d") as a third option here), but most would expect one of the two results.

I believe the lazy option (using IndexOf as opposed to LastIndexOf) better suits common cases, as structured information is usually placed at the beginning of a string and not at the end, as well as being consistent with things like macro expansion in other languages, despite not being the default when using regular expressions.

This design permits any literal part of the interpolated string to be empty, including individual infixes. Regardless of the infix matching strategy however, {var1}{var2} should always lead to one of the variables be initialized with an empty string, so it is redundant and such a code should produce a warning (this may however be legitimate if there are sub-format restrictions, as outlined in the extensions).

Implementation

The behaviour of this feature should not depend upon concrete implementation, but there are a few possibilities:

For the simplest of cases with zero infixes (no splitting), the compiler could simply emit the corresponding StartsWith, EndsWith, and Substring calls.
For the case of one infix, it is still fairly straightforward to add IndexOf (for the lazy approach) to the previous calls to split the string accordingly.
For other cases, I don't think this feature as specified so far would run into issues that need backtracking or regex-like automata. For the lazy approach, IndexOf can be called sequentially for the infixes to determine the boundaries of extracted substrings. For the greedy approach, LastIndexOf can be called starting from the last infix to determine the same thing, and as soon as one of those method fails, it is known the pattern does not match.

A model implementation of this feature could use a regular expression for "simplicity", such as @"^prefix(?<var1>.*?)infix1(?<var2>.*?)infix2…(?<varN>.*?)suffix$".

This feature as specified is conceived in the language, but it is also possible to start in the runtime, with methods like:

public class String
{
    // General but not the most performant solution
    public static bool Unformat(string value, string format, out string[] matches)

    // The runtime could provide an implementation for each tuple type to fill in the output (or just have multiple overloads for low arities)
    public static bool Unformat<TTuple>(string value, string format, out TTuple matches) where TTuple : struct, ITuple;
}

Or similarly to FormattedString or CompositeFormat.

The downside is that depending on the runtime support only would lock in to particular runtimes, while the compiler-generated code is compatible with older runtimes, therefore the runtime implementation should be picked only if available for the target.

Possible optimizations

If the compiler parses the string on its own, it can perform some optimizations that would be cumbersome to write by hand. As pointed out by @Rekkonnect, the compiler can statically determine the minimum length necessary for the string to match, and exit early if the input is less than that. For example:

$"prefix{var a}infix1{var b}infix2{var c}suffix" ‒ the generated code can ensure str.Length >= 24 (the length of "prefixinfix1infix2suffix").
Once "infix1" is found and a is initialized, it can once again ensure that the part after all of it is at least 12 characters long (the length of "infix2suffix"). Technically the suffix does not have to be a part of the length check since it was matched by EndsWith earlier, but omitting it does not save any instructions anyway.

Potential future extensions

Other string-like types

This pattern was described only for the case of string, but other string-like types might be used, possibly using duck-typing, since only a handful of methods are needed. If a type supports StartsWith, EndsWith, IndexOf/LastIndexOf (taking either string or ReadOnlySpan<char>) and Substring/Slice, it could be potentially used the same way. The extension to UTF-8 literals also works without significant changes.

In such a case, {var x} substitution should declare x to be the same type returned by Substring/Slice, for example a Span<char> if the compared value is a Span<char>.

Non-string types and format specifiers

This proposal only specifies extraction of substrings, but the syntax permits extending the range of allowed types to potentially any parsable type. If it is determined the variable requires parsing (i.e. it is not the natural type of Substring/Slice nor a span that could be obtained from the compared value), it could use the (duck-typed) IParsable/ISpanParsable implementation on the variable type to attempt to parse it to the target type (accepting a format specifier using the traditional : syntax).

In the basic form, extracting strings cannot run into dead ends, but now it is possible a particular match would require backtracking if a substring fails TryParse:

"[a,b,0]" is $"[{var a},{int b}]" ‒ the lazy approach first initializes a with "a", but "b,0" cannot be parsed as int, so the next option for a has to be chosen.
This is the case where a potential runtime implementation would be more beneficial, since all this complexity could be hidden under a call to String.Unformat(str, "[{0},{1}]", out string a, out int b) or alike.

Conditions and sub-patterns

Extending the previous extension, it may also be beneficial to be able to specify the postconditions on the extracted variables using a natural pattern syntax, or using when:

"[abc]" is $"[{ { Length: > 3 } str}]" ‒ fails since str has incorrect length.
"[a,b,c]" is $"[{ { Length: > 1 } a},{var b}]" ‒ matches (a, b) to ("a,b", "c").
"[a,b,c]" is $"[{var a},{var b}]" && a.Length > 1 ‒ fails since the deinterpolation is deterministic and can produce only one result for the code that follows.
"[a,b,c]" is $"[{var a},{var b}]" when a.Length > 1 ‒ succeeds since the when check could be incorporated into each step to ensure that other combinations are attempted.
",0,0,0,0,3,0,6,0,1,0,0," is $"{_}{int a},{_}{int b},{_}{int c}," when a + b + c == 10 ‒ finds the first 3 numbers that sum to 10, because why not.

Custom regular expression handling

In case of a potential String.Unformat method being utilized, it might also be useful to be able to change it to another one in some way. For example, assuming this code generation:

str is $"x{var a}y{int b}z"
// resulting in a call to
String.Unformat(str, "x{0}y{1}z", out string a, out int b);

One could add a special syntax that changes the interpretation of the pattern:

str is (Regex) $"x{var a:...}y{int b:\\d{2}}z" // just example syntax
// resulting in a call to
Regex.Unformat(str, @"x{0:...}y{1:\d{2}}z", out string a, out int b);
// transforming the format string to the pattern ^x(...)y(\d{2})z$

This approach is beneficial because it can potentially use any sort of pattern syntax if there is runtime support, such as wildcards or other regex dialects. For the case of Regex.Unformat, some code generator could potentially rewrite it completely in the next step, but the compiler doesn't have to understand regex in order to make this possible.

There are other possibilities for this extension, perhaps something like this:

str is (FormattedRegex) $"x{var a:...}y{int b:\\d{2}}z"
// caches the pattern in a variable
static FormattedRegex pattern = new(@"x{0:...}y{1:\d{2}}z");
// the actual call simply being
pattern.Unformat(str, out string a, out int b);

Going further with this, perhaps it would be possible to use something more in the line with existing custom interpolation handlers, where the handler is an actual object with methods like MatchLiteral or MatchFormatted supplanting earlier StartsWith, IndexOf etc.

Alternatives

Basically what the implementation could be ‒ check the individual parts by hand, use a regular expression, or #7576.

Conclusion

The proposed feature would enhance the language capabilities, making it more powerful and useful for developers. Even though it adds a new syntax to the language, it will make code more readable, concise and potentially efficient.

Rekkonnect · 2023-10-06T14:40:11Z

Rekkonnect
Oct 6, 2023

I would call this outerpolation

0 replies

Rekkonnect · 2023-10-06T18:38:53Z

Rekkonnect
Oct 6, 2023

For other cases, I don't think this feature as specified so far would run into issues that need backtracking or regex-like automata. For the lazy approach, IndexOf can be called sequentially for the infixes to determine the boundaries of extracted substrings. For the greedy approach, LastIndexOf can be called starting from the last infix to determine the same thing, and as soon as one of those method fails, it is known the pattern does not match.

If my understanding is correct, you always guarantee a match out of all the possible ones just by using a wildcard match between the string literal. And because of that, you would advance further into the pattern the instant you encounter the character(s) of your interest, as they exactly match in between the captured variables.

Your optimizations can then include:

first iterating whether the string starts and ends with the given prefix and suffix (if present)
evaluating whether the remaining string is long enough, and exiting early
- For example, we have 6 variables to capture, and we have captured 3. The remaining literal is 50 characters long, and 3 more variables to capture. If during capturing we have fewer characters than the mandatory for the rest of the literal, we immediately quit and determine the pattern unmatched.

0 replies

iam3yal · 2023-10-06T20:14:03Z

iam3yal
Oct 6, 2023

I like the idea but I'd say that something like string patterns or regex patterns is a better name for it and I think that it's nicer to have the syntax of pattern matching so here are few examples based on the OP:

if (str is { "prefix(.*)suffix": var middle })

if (str is { "prefix(\w+)(\d+))": var word, var number })

if (str is { "(prefix)(\w+)(\d+))": var prefix, var word, var number })

5 replies

HaloFour Oct 6, 2023

I don't think C# could wade into that without the regex spec being effectively merged into the C# language spec.

iam3yal Oct 6, 2023

@HaloFour You're right although it's possible to design this differently for example something similar to range patterns where you can have [start..end] so something like this:

if (str is { "prefi(..)suffix": var middle })

if (str is { "prefix(..^4)(..)": var word, var number })

if (str is { "(prefix)(..^4)(..)": var prefix, var word, var number })

They would still have to spec this but it might be better than adding the regex spec into the language.

Rekkonnect Oct 6, 2023

Yeah if you're gonna go the extra mile to start adding regex-like features, you might as well admit that you want regex support natively in the language. Which I am not against, but definitely gets this proposal very overboard.

HaloFour Oct 6, 2023

I'd personally love to see regex support in pattern matching accomplished through a form of active patterns.

IS4Code Oct 7, 2023
Author

You could achieve some of that with some sort of a "custom deinterpolation handler" abusing format specifiers. Let's say that there is a runtime support for it already, meaning you could do this:

str is $"x{var a}y{int b}z"
// resulting in a call to
String.Unformat(str, "x{0}y{1}z", out string a, out int b);

Now all this needs is to smuggle a pattern that looks like a format specifier there:

str is $"x{var a:..}y{int b:[1-9]\d+}z"
// resulting in a call to
String.Unformat(str, @"x{0:..}y{1:[1-9]\d+}z", out string a, out int b);

Of course String.Unformat should not understand regular expressions, but if you were somehow able to change it to another method, it could very easily convert it to a pattern like ^x(?<arg0>..)y(?<arg1>[1-9]\d+)z$.

I ~~will add~~ have added this as another potential extension.

TonyValenti · 2023-10-07T10:55:23Z

TonyValenti
Oct 7, 2023

Having read through through this, I really feel like the "right" implementation would be for the language to have first-class support for regexes. Specifically, when you do a named capture in the regex, the value for the named capture should be promoted to a synthetic member of the Match object (much like how tuple names override Item# members). This would simplify matching and eliminate the hassle of magic strings with named captures (today you have to use the named capture string in the regex and when retrieving the value which is clunky and error prone.)

1 reply

IS4Code Oct 7, 2023
Author

That is an option too. I have added another potential extension where the compiler is able to delegate the format string to a custom unformatter which performs the actual matching. This adds the possibility of using any pattern matching language, not just regex.

[Proposal] String deinterpolation ‒ interpolated string pattern #7580

Uh oh!

Uh oh!

Summary

Motivation

Design

Semantics

Implementation

Possible optimizations

Potential future extensions

Other string-like types

Non-string types and format specifiers

Conditions and sub-patterns

Custom regular expression handling

Alternatives

Conclusion

Replies: 4 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IS4Code Oct 7, 2023 Author

Uh oh!

Uh oh!

IS4Code Oct 7, 2023 Author

Replies: 4 comments 6 replies

IS4Code Oct 7, 2023
Author

IS4Code Oct 7, 2023
Author