I’ve gotten this question in two interviews. It’s likely popular because it has a simple solution that you can’t get right by producing a word salad of buzzwords. However, its ubiquity raises the risk of memorization.

Xu, vol. 1, ch 8 (shorten as much as possible)

In Xu’s version, the following requirements are introduced, in addition to the usual availability/performance requirements (which boil down to, “it needs to have nice performance performance characteristics”):

  • If you give it the same long URL, you need to get the same short URL; and
  • It must be as short as possible.

Solution 1: Hashing

His first solution is to hash the URL, then truncate it to a minimum hash length. The resulting partial hash code is more likely to have a hash collision than the original hash code. Therefore, we must perform hash collision resolution, whose worst-case performance is in the number of previously stored codes. He notes that Bloom filters can improve on this time complexity. By that point, however, we’ve over-engineered.

Solution 2: Monotonic identifiers

His preferred solution is to generate a globally monotonic identifier, then encoding it into an efficient digit set. He chooses base-62 ([0-9,a-z,A-Z]), presumably because it is much shorter than the native base-10 but renderable in all locales. This is an elegant solution, though I see a couple of issues:

  1. If two people shorten the same URL, they will get different shortcodes. People expect their shortcodes to work forever, so may eventually be a problem. On the other hand, shortcode generators are often offered for free as a way of collecting tracking information for ads. In this case, unique IDs for the same URL are absolutely preferable!
  2. The bigger problem is that pretty much anyone interested in cryptography would hypothesize that this was the encoding, reverse-engineer it, and recover other people’s shortcodes.

Prioritizing security over optimality

If we relax the requirement that the code be as short as possible, we can create a secure, deterministic solution.

  1. If your database of existing shortcodes already has a code for this website, use that code.
  2. If not, generate a hash using a hashing function known for short hash codes. Check for and handle collisions. (Collisions would be vanishingly rare, but it’s possible. This will basically be a prefactor.)
  3. Encode your hash in base-62.
  4. Store the shortcode in the database.

This solution operates in a non-sequential identifier space, removing the security issue.

There’s also something called a collision-resistant compression function, which could potentially ensure both the shortest possible code and independence from one code to the next.