I’ve gotten this question in two interviews. It’s likely popular because it has a simple solution that you can’t get right by producing a word salad of buzzwords. However, its ubiquity raises the risk of memorization.
Xu, vol. 1, ch 8 (shorten as much as possible)
In Xu’s version, the following requirements are introduced, in addition to the usual availability/performance requirements (which boil down to, “it needs to have nice performance performance characteristics”):
- If you give it the same long URL, you need to get the same short URL; and
- It must be as short as possible.
Solution 1: Hashing
His first solution is to hash the URL, then truncate it to a minimum hash length. The resulting partial hash code is more likely to have a hash collision than the original hash code. Therefore, we must perform hash collision resolution, whose worst-case performance is
Solution 2: Monotonic identifiers
His preferred solution is to generate a globally monotonic identifier, then encoding it into an efficient digit set. He chooses base-62 ([0-9,a-z,A-Z]), presumably because it is much shorter than the native base-10 but renderable in all locales. This is an elegant solution, though I see a couple of issues:
- If two people shorten the same URL, they will get different shortcodes. People expect their shortcodes to work forever, so may eventually be a problem. On the other hand, shortcode generators are often offered for free as a way of collecting tracking information for ads. In this case, unique IDs for the same URL are absolutely preferable!
- The bigger problem is that pretty much anyone interested in cryptography would hypothesize that this was the encoding, reverse-engineer it, and recover other people’s shortcodes.
Prioritizing security over optimality
If we relax the requirement that the code be as short as possible, we can create a secure, deterministic solution.
- If your database of existing shortcodes already has a code for this website, use that code.
- If not, generate a hash using a hashing function known for short hash codes. Check for and handle collisions. (Collisions would be vanishingly rare, but it’s possible. This will basically be a prefactor.)
- Encode your hash in base-62.
- Store the shortcode in the database.
This solution operates in a non-sequential identifier space, removing the security issue.
There’s also something called a collision-resistant compression function, which could potentially ensure both the shortest possible code and independence from one code to the next.