Skip to content

Conversation

@loverobean
Copy link

ue to the "birthday problem," the randomly generated link IDs have an unexpectedly high probability of collisions, exceeding 50% when the transactions per day exceed 255:

100 10%
255 50%
464 90%

These changes use a guaranteed unique random string generator that raises StopIteration when the number of available random strings is exhausted (default: 4097).

These changes also allow configuration of the link format, including changing the number of random characters and changing the date format.

@redstreet
Copy link
Owner

redstreet commented Feb 12, 2025

I like the overall idea, and thanks for the great analysis. I would like for a much simpler solution: to increase the default number of random chars so the chance of collision is acceptably low.

Your changes to configure the link format are great. In addition to that, a simple:

secrets.token_hex(RANDOM_LEN)

should suffice. Thoughts?

Originally posted by @redstreet in #34 (comment)

@loverobean
Copy link
Author

loverobean commented Feb 12, 2025

I would really prefer this solution:

  1. secrets.token_hex has no magic to avoid the birthday problem: even with 30 characters, there is still a nonzero possibility of a collision.
  2. Your README advertises the links as "human-readable": given guaranteed uniqueness, I'm considering using RANDOM_LEN: 2, to give me 256/day. I don't like the aesthetics of a long hex string.
  3. The number of values to be "Acceptably low" means different things to different users (a student with minimal transactions vs a heavy trader).
  4. Now that the date format can be adjusted, we cannot guarantee that we have different date prefixes (i.e., '' is a valid date format). A user who chooses a truncated or empty date format. would need many more characters than a user who chooses a full date format.
  5. It is better for the plugin to inform the user that the chosen RANDOM_LEN is too short (via StopIteration) rather than silently failing. Not all users understand the birthday problem, hexidecimal, etc.
  6. Now that it's written, the code to generate the unique sequences is relatively simple: rand_counter is only 9 lines. Almost all of the PR was test cases.

In short, there is no such thing as a collision-free hash. Providing a randomized sequence of strings provides good performance and efficiently uses the available strings. Guaranteeing uniqueness seems like a better practice than hoping for it.

@loverobean
Copy link
Author

I may be able to simplify my changes with some refactoring. I will leave this PR open to preserve the discussion, but please hold off on merging.

@loverobean
Copy link
Author

Closing; replacing with refactored #37

@loverobean loverobean closed this Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants