Skip to content

Conversation

@rruuaanng
Copy link
Contributor

@rruuaanng rruuaanng commented Sep 26, 2024

@rruuaanng
Copy link
Contributor Author

I think this feature may be merged as soon as possible, or before release, because this will cause many systems that use uuid1 to generate uuids that may not be unique.

@JacobCoffee
Copy link
Member

@rruuaanng the issue has been open for some time, and 3.13 is in rc2 and the final is due next week according to PEP 719 so I wouldn't expect this to be expedited

JacobCoffee

This comment was marked as resolved.

@rruuaanng
Copy link
Contributor Author

@rruuaanng the issue has been open for some time, and 3.13 is in rc2 and the final is due next week according to PEP 719 so I wouldn't expect this to be expedited

Okay, Maybe it doesn't need to be handled immediately, but I think this PR is worth merging to resolve this lingering issue.

@rruuaanng
Copy link
Contributor Author

would we want an entry for .. versionchanged in Doc/library/uuid.rst? Some might rely on past behavior maybe and look to the docs to see changes

Regarding the documentation, perhaps no changes are needed since users won't notice that the UUID has changed.

@picnixz
Copy link
Member

picnixz commented Sep 26, 2024

Note that RFC 4122/9562 does NOT require the clock sequence to be incremented (emphasis mine):

If the previous value of the clock sequence is known, it MAY be incremented; otherwise it SHOULD be set to a random or high-quality pseudorandom value.

RFC 4122 originally said something similar:

If the previous value of the clock sequence is known, it can just be incremented; otherwise it should be set to a random or high-quality pseudo-random value.

The Python implementation decided not to increment the clock sequence, probably to avoid a stable storage (as specified by the comment) and to keep a bit of unpredictability. As such, I'm not sure that this change should be done (this specific behaviour has been there for years).


If you want to change this behaviour, tests must be added and this would only be a feature rather than a bug fix IMO.

@rruuaanng
Copy link
Contributor Author

want

Oh, it seems I missed updating the current clock sequence with the result of the previous increment.

@rruuaanng
Copy link
Contributor Author

And I don't think we need to add tests for this because there are already tests for uuid1. My changes just make it more random rather than changing the result format entirely.

@picnixz
Copy link
Member

picnixz commented Sep 26, 2024

We still need tests. A change of implementation needs to be tested. I'm also worried about the fact that the counter is never reset. The counter should be reset at some point when you change the timestamp.

(And I'm sorry but nothing is more random. It's actually more predictable since the clock sequence is incremented deterministically when possible).

@rruuaanng
Copy link
Contributor Author

We still need tests. A change of implementation needs to be tested. I'm also worried about the fact that the counter is never reset. The counter should be reset at some point when you change the timestamp.

(And I'm sorry but nothing is more random. It's actually more predictable since the clock sequence is incremented deterministically when possible).

Hmmm, Maybe you're right, I'm thinking.

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Sep 26, 2024

Note that RFC 4122/9562 does NOT require the clock sequence to be incremented (emphasis mine):

If the previous value of the clock sequence is known, it MAY be incremented; otherwise it SHOULD be set to a random or high-quality pseudorandom value.

RFC 4122 originally said something similar:

If the previous value of the clock sequence is known, it can just be incremented; otherwise it should be set to a random or high-quality pseudo-random value.

The Python implementation decided not to increment the clock sequence, probably to avoid a stable storage (as specified by the comment) and to keep a bit of unpredictability. As such, I'm not sure that this change should be done (this specific behaviour has been there for years).

If you want to change this behaviour, tests must be added and this would only be a feature rather than a bug fix IMO.

But without making changes, it simply increments the timestamp, which makes no sense.

timestamp = _last_timestamp + 1

@rruuaanng
Copy link
Contributor Author

@picnixz I have make change, please review!

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Sep 26, 2024

However, I remain skeptical about testing this, since there seems to be a check for uuid1 already. And my changes pass the existing test. The existing test includes a check on the duplication of large quantities of UUID.

@picnixz
Copy link
Member

picnixz commented Sep 26, 2024

But without making changes, it simply increments the timestamp, which makes no sense.

Without this change, there is a random clock sequence if you don't specify one so it's not an issue. The implementation IS compliant, it's just that it does not use all the suggestions from the RFC (some lines below, there is a if clock_seq is None: clock_seq = random.getrandbits(14) so it is ok).

However, I remain skeptical about testing this, since there seems to be a check for uuid1 already

Yes but this doesn't check anything new. You MUST check that the counter is correctly incremented. In addition, it must be initialized to a RANDOM value (this is what the RFC states):

The clock sequence MUST be originally (i.e., once in the lifetime of a system) initialized to a random number to minimize the correlation across systems.

I will add a DO-NOT-MERGE label until all the interrogations have been answered. Note that regenerating the clock sequence for every non-specified clock sequence is fine and does not require all this. So I'm not convinced at all.

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Sep 26, 2024

But without making changes, it simply increments the timestamp, which makes no sense.

Without this change, there is a random clock sequence if you don't specify one so it's not an issue. The implementation IS compliant, it's just that it does not use all the suggestions from the RFC (some lines below, there is a if clock_seq is None: clock_seq = random.getrandbits(14) so it is ok).

However, I remain skeptical about testing this, since there seems to be a check for uuid1 already

Yes but this doesn't check anything new. You MUST check that the counter is correctly incremented. In addition, it must be initialized to a RANDOM value (this is what the RFC states):

The clock sequence MUST be originally (i.e., once in the lifetime of a system) initialized to a random number to minimize the correlation across systems.

I will add a DO-NOT-MERGE label until all the interrogations have been answered. Note that regenerating the clock sequence for every non-specified clock sequence is fine and does not require all this. So I'm not convinced at all.

For the suggestions mentioned above, I will make the following improvements:

  1. Add a check to increment the clock sequence.
  2. Check if the clock sequence is a random number (but I can't prove if the clock sequence is random, you know why).

@picnixz
Copy link
Member

picnixz commented Sep 26, 2024

Check if the clock sequence is a random number (but I can't prove if the clock sequence is random, you know why).

There is a simple way to do it, namely mocking the interface. But honestly, I think all this change is really not needed and I am really not convinced by that. There is no issue with the current generation of the clock sequence. Even when you use the same timestamp, you will have different clock sequences so you will not have uniqueness issues (well, the entropy is only 14 bits so you can have collisions, but with a low probablity).

Finally, UUIDv1 are historic and UUIDv6 should be used as a replacement now that RFC 9562 has been accepted. I do have a PR that implements v6 and I think v1 should not be used anymore nowawadays. If people are concerned about the uniqueness, then they should generate their own clock sequence (hence the reason why we allow them to pass the clock sequence explicitly).

@rruuaanng
Copy link
Contributor Author

Check if the clock sequence is a random number (but I can't prove if the clock sequence is random, you know why).

There is a simple way to do it, namely mocking the interface. But honestly, I think all this change is really not needed and I am really not convinced by that. There is no issue with the current generation of the clock sequence. Even when you use the same timestamp, you will have different clock sequences so you will not have uniqueness issues (well, the entropy is only 14 bits so you can have collisions, but with a low probablity).

Finally, UUIDv1 are historic and UUIDv6 should be used as a replacement now that RFC 9562 has been accepted. I do have a PR that implements v6 and I think v1 should not be used anymore nowawadays. If people are concerned about the uniqueness, then they should generate their own clock sequence (hence the reason why we allow them to pass the clock sequence explicitly).

I agree with your point of view. Perhaps I just want to put an end to this legacy issue, as Jacob mentioned, this problem has been open for a long time.

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Sep 27, 2024

I almost forgot, I need to respond to your first point again.

Without this change, there is a random clock sequence if you don't specify one so it's not an issue. The implementation IS compliant, it's just that it does not use all the suggestions from the RFC (some lines below, there is a if clock_seq is None: clock_seq = random.getrandbits(14) so it is ok).

Regarding the above, As mentioned in the RFC.

If the previous value of the clock sequence is known, it can just be incremented; otherwise it should be set to a random or high-quality pseudo-random value.

It's not that a new one is generated when the current clock sequence is empty, but rather that the current clock sequence being passed should be modified when the previous one does not exist.

Regarding the tests, I think you are right. I am working on writing the relevant tests. Maybe you can help me remove the tags once they are completed.

@rruuaanng rruuaanng requested a review from picnixz September 27, 2024 03:04
@rruuaanng
Copy link
Contributor Author

rruuaanng commented Sep 27, 2024

Although I’m reluctant, it seems that the existing test_uuid1 does not check timestamp = _last_timestamp + 1 whether the original timestamp is incrementing (which I found to be difficult during implementation).

So I think it seems that there is no need to add corresponding tests, because the original one does not implement this approach.
And when the rollback is triggered, if the auto-increment does not run, it may be a problem with the interpreter.

@picnixz
Copy link
Member

picnixz commented Sep 27, 2024

Although I’m reluctant, it seems that the existing test_uuid1 does not check timestamp = _last_timestamp + 1 whether the original timestamp is incrementing (which I found to be difficult during implementation).

Then this is an issue. The coverage is therefore insufficient. You can complete this test if you want but checking the clock sequence increment must be done in a test.

So I think it seems that there is no need to add corresponding tests, because the original one does not implement this approach.

Not having tests does not mean that new behaviour should not be tested.

It's not that a new one is generated when the current clock sequence is empty, but rather that the current clock sequence being passed should be modified when the previous one does not exist.

The RFC suggests to either increment the value or to always randomly generate a new clock sequence. The current implementation chooses this second alternative and thus is compliant with the standards.


Perhaps I just want to put an end to this legacy issue, as Jacob mentioned, this problem has been open for a long time.

I must disagree (again). The only issue I see with the current implementation is that the probabilty of having a collision within the same timestamp on the clock sequence is not that low (50% of chance of having a collision if you generate 128 UUIDs with the same timestamp).

The implementation follows one recommendation, maybe not what the original author expected, yet it is a correct implementation. One could always reduce the possibility of a UUIDv1 collision when generating many UUIDs within the same timestamp but once the clock sequence reaches $2^{14}$, you will have an overflow on the clock sequence and you will get the an already-used clock sequence, independently of the timestamp value.

Example:

  • last_clock_seq = 0
  • Timestamp = 0
  • Generate $N_1$ UUIDs: your clock sequence will be last_clock_seq, ..., last_clock_seq + N_1.
  • Timestamp = 1
  • Generate $N_2$ UUIDs: your clock sequence will be last_clock_seq + N_1 + 1, ..., last_clock_seq + N_1 + N_2.

Since only the 14 lowest bits of the clock sequence are kept when generating the UUID, once $N_1 + N_2 > 2^{14}$, you will have the same clock sequences for UUIDs with different timestamps but they will be entirely predictable. You do NOT want predictability in general.
Similarly, if $N_1 > 2^{14}$, you'll also generate UUIDs with the same timestamp and with the same clock sequence. This is the reason why you should re-randomize last_clock_seq if the timestamp changed.

So, the only reason why we would (but not necessarily should) change the current implementation is to avoid such issue. But UUIDv1 is historical and honestly no more recommended for production (UUIDv6 being the one being promoted (see #120650) or v4 if you want pure randomness). As such, I don't see the point of improving such legacy component (and for now, no one ever complained on this specific matter).

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want tests. In addition, you need to take care of the existing code L694-696 which explicitly says that we do not use a stable storage.

# if there is a previous clock sequence, then increment
# otherwise, the clock sequence will be regenerated
if _last_clock_seq is not None:
clock_seq = _last_clock_seq + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not replace this one if clock_seq has been given by the user.

Copy link
Contributor Author

@rruuaanng rruuaanng Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you're right about this. If the clock_seq goes past the max value for a long, it'll reset to 0. This means there's a chance it could clash with a UUID that was made when it was 0 (though for the clock to roll back, it would have to go back to the exact same time as that initial 0 moment).

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Sep 27, 2024

I will close this PR. Thank @picnixz for your attention to this incident during this period.

@rruuaanng rruuaanng closed this Sep 27, 2024
@rruuaanng rruuaanng reopened this Sep 27, 2024
@rruuaanng rruuaanng closed this Sep 27, 2024
@rruuaanng rruuaanng deleted the gh88786 branch September 27, 2024 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants