[Proposal] String deinterpolation ‒ interpolated string pattern #7580
Replies: 4 comments 6 replies
-
I would call this outerpolation |
Beta Was this translation helpful? Give feedback.
-
If my understanding is correct, you always guarantee a match out of all the possible ones just by using a wildcard match between the string literal. And because of that, you would advance further into the pattern the instant you encounter the character(s) of your interest, as they exactly match in between the captured variables. Your optimizations can then include:
|
Beta Was this translation helpful? Give feedback.
-
I like the idea but I'd say that something like string patterns or regex patterns is a better name for it and I think that it's nicer to have the syntax of pattern matching so here are few examples based on the OP:
|
Beta Was this translation helpful? Give feedback.
-
Having read through through this, I really feel like the "right" implementation would be for the language to have first-class support for regexes. Specifically, when you do a named capture in the regex, the value for the named capture should be promoted to a synthetic member of the Match object (much like how tuple names override Item# members). This would simplify matching and eliminate the hassle of magic strings with named captures (today you have to use the named capture string in the regex and when retrieving the value which is clunky and error prone.) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
This is an alternative approach to #7576, allowing enhanced string matching and parsing by leveraging existing interpolated string syntax as a pattern. This feature provides the ability to split a string, extract values and store them into new variables, similarly to other patterns, by using existing and well-known interpolated string syntax.
Motivation
String parsing is a common operation in many C# programs. It often includes checking if a string starts or ends with a specific substring, locating other delimiters, and splitting the string and storing the intermediate values into variables. This process can be time-consuming, prone to errors, and can make the code more complex. This feature would bring additional symmetry to the language, often allowing one to naturally reverse formatting (as string interpolation) to parsing as string deinterpolation, making the code more readable and efficient. Existing solutions often require one to consider other factors, such as when using regular expressions (normal, compiled, cached, culture-invariant, code-generated), leading one to revert to simpler checks using
StartsWith
etc., often missing out on performance due to not usingStringComparison.Ordinal
.Design
The simplest use of string deinterpolation would be as follows:
This checks if
str
starts with"prefix"
and ends with"suffix"
, then creates a new variableextractedVariable
and stores the substring between those parts in it. This syntax is consistent with existing string interpolation syntax, just used as a pattern with "substituted" variable declarations.Similarly in a
case
label:A discard
_
could also be used instead of a variable declaration.Semantics
A pattern like
$"prefix{var1}infix1{var2}infix2…{varN}suffix"
should:"prefix"
(if given), as-if byStartsWith("prefix", StringComparison.Ordinal)
."suffix"
(if given), as-if byEndsWith("suffix", StringComparison.Ordinal)
."infix1"
,"infix2"
… occurs in the compared value sequentially (and without overlaps), as-if byIndexOf("infixK", StringComparison.Ordinal)
between the corresponding positions."var1"
,"var2"
…"varN"
are initialized with the substrings occuring between the matched parts of the string.string
(orvar
?), it is assigned a value obtained as-if by callingSubstring
with the appropriate positions.ReadOnlySpan<char>
, it is assigned as-if by callingAsSpan().Slice
accordingly.ReadOnlyMemory<char>
orStringSegment
.It will also be necessary to specify what strategy of infix matching is chosen ‒ lazy or greedy ‒ to arrive at a particular combination. To illustrate:
"[a,b,c,d]" is $"[{var a},{var b}]"
could initialize(a, b)
with either("a", "b,c,d")
(lazy) or("a,b,c", "d")
(greedy). There are other combinatorial options in general (and("a,b", "c,d")
as a third option here), but most would expect one of the two results.I believe the lazy option (using
IndexOf
as opposed toLastIndexOf
) better suits common cases, as structured information is usually placed at the beginning of a string and not at the end, as well as being consistent with things like macro expansion in other languages, despite not being the default when using regular expressions.This design permits any literal part of the interpolated string to be empty, including individual infixes. Regardless of the infix matching strategy however,
{var1}{var2}
should always lead to one of the variables be initialized with an empty string, so it is redundant and such a code should produce a warning (this may however be legitimate if there are sub-format restrictions, as outlined in the extensions).Implementation
The behaviour of this feature should not depend upon concrete implementation, but there are a few possibilities:
StartsWith
,EndsWith
, andSubstring
calls.IndexOf
(for the lazy approach) to the previous calls to split the string accordingly.IndexOf
can be called sequentially for the infixes to determine the boundaries of extracted substrings. For the greedy approach,LastIndexOf
can be called starting from the last infix to determine the same thing, and as soon as one of those method fails, it is known the pattern does not match.A model implementation of this feature could use a regular expression for "simplicity", such as
@"^prefix(?<var1>.*?)infix1(?<var2>.*?)infix2…(?<varN>.*?)suffix$"
.This feature as specified is conceived in the language, but it is also possible to start in the runtime, with methods like:
Or similarly to
FormattedString
orCompositeFormat
.The downside is that depending on the runtime support only would lock in to particular runtimes, while the compiler-generated code is compatible with older runtimes, therefore the runtime implementation should be picked only if available for the target.
Possible optimizations
If the compiler parses the string on its own, it can perform some optimizations that would be cumbersome to write by hand. As pointed out by @Rekkonnect, the compiler can statically determine the minimum length necessary for the string to match, and exit early if the input is less than that. For example:
$"prefix{var a}infix1{var b}infix2{var c}suffix"
‒ the generated code can ensurestr.Length >= 24
(the length of"prefixinfix1infix2suffix"
)."infix1"
is found anda
is initialized, it can once again ensure that the part after all of it is at least 12 characters long (the length of"infix2suffix"
). Technically the suffix does not have to be a part of the length check since it was matched byEndsWith
earlier, but omitting it does not save any instructions anyway.Potential future extensions
Other string-like types
This pattern was described only for the case of
string
, but other string-like types might be used, possibly using duck-typing, since only a handful of methods are needed. If a type supportsStartsWith
,EndsWith
,IndexOf
/LastIndexOf
(taking eitherstring
orReadOnlySpan<char>
) andSubstring
/Slice
, it could be potentially used the same way. The extension to UTF-8 literals also works without significant changes.In such a case,
{var x}
substitution should declarex
to be the same type returned bySubstring
/Slice
, for example aSpan<char>
if the compared value is aSpan<char>
.Non-string types and format specifiers
This proposal only specifies extraction of substrings, but the syntax permits extending the range of allowed types to potentially any parsable type. If it is determined the variable requires parsing (i.e. it is not the natural type of
Substring
/Slice
nor a span that could be obtained from the compared value), it could use the (duck-typed)IParsable
/ISpanParsable
implementation on the variable type to attempt to parse it to the target type (accepting a format specifier using the traditional:
syntax).In the basic form, extracting strings cannot run into dead ends, but now it is possible a particular match would require backtracking if a substring fails
TryParse
:"[a,b,0]" is $"[{var a},{int b}]"
‒ the lazy approach first initializesa
with"a"
, but"b,0"
cannot be parsed asint
, so the next option fora
has to be chosen.This is the case where a potential runtime implementation would be more beneficial, since all this complexity could be hidden under a call to
String.Unformat(str, "[{0},{1}]", out string a, out int b)
or alike.Conditions and sub-patterns
Extending the previous extension, it may also be beneficial to be able to specify the postconditions on the extracted variables using a natural pattern syntax, or using
when
:"[abc]" is $"[{ { Length: > 3 } str}]"
‒ fails sincestr
has incorrect length."[a,b,c]" is $"[{ { Length: > 1 } a},{var b}]"
‒ matches(a, b)
to("a,b", "c")
."[a,b,c]" is $"[{var a},{var b}]" && a.Length > 1
‒ fails since the deinterpolation is deterministic and can produce only one result for the code that follows."[a,b,c]" is $"[{var a},{var b}]" when a.Length > 1
‒ succeeds since thewhen
check could be incorporated into each step to ensure that other combinations are attempted.",0,0,0,0,3,0,6,0,1,0,0," is $"{_}{int a},{_}{int b},{_}{int c}," when a + b + c == 10
‒ finds the first 3 numbers that sum to 10, because why not.Custom regular expression handling
In case of a potential
String.Unformat
method being utilized, it might also be useful to be able to change it to another one in some way. For example, assuming this code generation:One could add a special syntax that changes the interpretation of the pattern:
This approach is beneficial because it can potentially use any sort of pattern syntax if there is runtime support, such as wildcards or other regex dialects. For the case of
Regex.Unformat
, some code generator could potentially rewrite it completely in the next step, but the compiler doesn't have to understand regex in order to make this possible.There are other possibilities for this extension, perhaps something like this:
Going further with this, perhaps it would be possible to use something more in the line with existing custom interpolation handlers, where the handler is an actual object with methods like
MatchLiteral
orMatchFormatted
supplanting earlierStartsWith
,IndexOf
etc.Alternatives
Basically what the implementation could be ‒ check the individual parts by hand, use a regular expression, or #7576.
Conclusion
The proposed feature would enhance the language capabilities, making it more powerful and useful for developers. Even though it adds a new syntax to the language, it will make code more readable, concise and potentially efficient.
Beta Was this translation helpful? Give feedback.
All reactions