-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Fix Regex.split/2 edge case with empty chunks #14468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/elixir/lib/regex.ex
Outdated
| iex> Regex.split(~r//, "abc") | ||
| ["", "a", "b", "c", ""] | ||
| ["a", "b", "c", ""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned here, this might be a breaking change, but is consistent with :re.split/2
|
Perhaps we can add a clause where we only check the length if offset != 0? |
|
@josevalim do you mean e212e3f ? Also, I added a test for c22e459, but I'm not sure what's the correct expectation here? (with the fix, we remove one |
| assert Regex.split(~r/[Ei]/, "Elixir", include_captures: true, parts: 3, trim: true) == | ||
| ["E", "l", "i", "xir"] | ||
|
|
||
| assert Regex.split(~r/b\K/, "abab", include_captures: true) == ["ab", "", "ab", "", "", ""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected? There are no captures, is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind, I got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure TBH 😅
Do you have a good idea of a good test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a good test, because it tests the very specific behaviour of \K which forgets the previous match as part of the capture. Another good test would be with b\Kc matching on "abcabc" that shows only c is included. Although, comparatively, it seems the test above is included one extra "":
iex(1)> Regex.split(~r/b\Kc/, "abcabc", include_captures: true)
["ab", "c", "ab", "c", ""]
So maybe there is more work to be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I would say:
Yes, it is a good test, because it tests the very specific behaviour of \K which forgets the previous match as part of the capture. Another good test would be with b\Kc matching on "abcabc" that shows only c is included. Although, comparatively, it seems the test above is included one extra "":
Regex.split(~r/b\Kc/, "abcabc", include_captures: true)
Regex.split(~r/b\K/, "abab", include_captures: true)
Should behave the same, except in one the capture is "" and the other it is "c".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed answer, this is really helpful 💜
|
@josevalim OK to backport after merge? |
|
@sabiwara I would not backport it because it is not an urgent fix (it has been here for several years). |
lib/elixir/lib/regex.ex
Outdated
| string | ||
|
|
||
| if keep == 0 and trim do | ||
| if keep == 0 and (trim or (offset != 0 and length == 0)) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need this to be a cond:
cond do
keep == 0 and offset != 0 and length == 0 ->
do_split([h | t], string, new_offset, counter - 1, trim, true)
keep == 0 and trim ->
[match | do_split([h | t], string, new_offset, counter - 1, trim, true)]
true ->
[part, match | do_split([h | t], string, new_offset, counter - 1, trim, true)]
end
I haven't tested it but it may fix my additional reports below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closes #14467