Skip to content

Conversation

@sabiwara
Copy link
Contributor

@sabiwara sabiwara commented May 4, 2025

Closes #14467

iex> Regex.split(~r//, "abc")
["", "a", "b", "c", ""]
["a", "b", "c", ""]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned here, this might be a breaking change, but is consistent with :re.split/2

@josevalim
Copy link
Member

Perhaps we can add a clause where we only check the length if offset != 0?

@sabiwara
Copy link
Contributor Author

sabiwara commented May 4, 2025

@josevalim do you mean e212e3f ?

Also, I added a test for c22e459, but I'm not sure what's the correct expectation here? (with the fix, we remove one "")

assert Regex.split(~r/[Ei]/, "Elixir", include_captures: true, parts: 3, trim: true) ==
["E", "l", "i", "xir"]

assert Regex.split(~r/b\K/, "abab", include_captures: true) == ["ab", "", "ab", "", "", ""]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected? There are no captures, is there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nevermind, I got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure TBH 😅
Do you have a good idea of a good test for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a good test, because it tests the very specific behaviour of \K which forgets the previous match as part of the capture. Another good test would be with b\Kc matching on "abcabc" that shows only c is included. Although, comparatively, it seems the test above is included one extra "":

iex(1)> Regex.split(~r/b\Kc/, "abcabc", include_captures: true)
["ab", "c", "ab", "c", ""]

So maybe there is more work to be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I would say:

Yes, it is a good test, because it tests the very specific behaviour of \K which forgets the previous match as part of the capture. Another good test would be with b\Kc matching on "abcabc" that shows only c is included. Although, comparatively, it seems the test above is included one extra "":

Regex.split(~r/b\Kc/, "abcabc", include_captures: true)
Regex.split(~r/b\K/, "abab", include_captures: true)

Should behave the same, except in one the capture is "" and the other it is "c".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed answer, this is really helpful 💜

@sabiwara
Copy link
Contributor Author

sabiwara commented May 4, 2025

@josevalim OK to backport after merge?

@josevalim
Copy link
Member

@sabiwara I would not backport it because it is not an urgent fix (it has been here for several years).

string

if keep == 0 and trim do
if keep == 0 and (trim or (offset != 0 and length == 0)) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need this to be a cond:

    cond do
      keep == 0 and offset != 0 and length == 0 ->
        do_split([h | t], string, new_offset, counter - 1, trim, true)

      keep == 0 and trim ->
        [match | do_split([h | t], string, new_offset, counter - 1, trim, true)]

      true ->
        [part, match | do_split([h | t], string, new_offset, counter - 1, trim, true)]
    end

I haven't tested it but it may fix my additional reports below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabiwara sabiwara changed the title Fix Regex.split/2 to be consistent with :re.split/2 Fix Regex.split/2 edge case with empty chunks May 4, 2025
@sabiwara sabiwara merged commit 5c928c9 into elixir-lang:main May 4, 2025
11 checks passed
@sabiwara sabiwara deleted the re-split branch May 4, 2025 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Regex.split with trailing \K works incorrectly

2 participants