Translate

Sunday, January 23, 2022

How to Do UUID as Primary Keys the Right Way

 Source: https://dzone.com/articles/uuid-as-primary-keys-how-to-do-it-right?edition=598292

How to Do UUID as Primary Keys the Right Way

UUID V4 or its COMB variant are great to mitigate various security, stability, and architectural issues, but be aware of various pitfalls when using them.

  · Database Zone · Analysis

Why Do We Need Technical IDs in the First Place?

Any properly designed relational database table owns a Primary Key (PK) allowing you to uniquely and stably identify each record. Even if the primary key can be composite (built of several columns), it is a widespread good practice to dedicate a special column (often named id or id_<table name>) to this end. This special column is used to technically identify records and can be used as foreign keys in relations.

NOTE: Do not confuse technical (also named "surrogate") keys with function keys. The most important tables (so-called entities in domain-driven design) may contain an alternate human-readable ID column (like the customer ID "G2F6D"). In this article, we will focus only on technical PKs. They should only be processed and readable by machines, not humans.

How to Choose Good IDs

Good IDs are:

  • Unique (no collisions). This is enforced by the UNIQUE constraint automatically added by RDBMS on every PK.
  • Not reusable. Reusing a deleted row PK is technically possible in relational databases but is a very bad idea because it contributes to generating confusion (for example, an older log file can reference an ID reused in the meantime by a new entity, thus conducting to false deductions).
  • Meaningless. Developers should not parse IDs for the wrong reasons. Imagine an ID starting with the current year. Some developers may ignore that a date_creation column exists and will only rely on the PK's first four digits. If the ID format changes or is buggy (because of bad timezones handling for instance), some subtle issues may arise. Even if this is largely discussed, I would warn against using natural keys as PK altogether. It may limit your options in the future. For example, if you use e-mail as PK, you implicitly forbid modification of it in future releases: never say "never." Another problem with natural keys is the difficulty of ensuring unicity due to functional issues even when everything has been done to avoid them. I once worked for a French governmental agency and observed both issues in different projects: 1) Legacy code relied on the first digit of the NIR (social identity number) to get the people type thus ignoring possible type reassignments (though the current type was available as a dedicated column); 2) We recently discovered that this unique ID was not so unique (for example, an ID shared temporarily by several members of an immigrant family or collisions following cities merging). The real world is just too complex to make any assumption about unicity.

Which ID Format to Choose?

The two formats matching these rules are AFAIK:

  1. Auto-incremented integers (starting at 1 or any larger value: 1, 100, 154555). Most RDBMS like PostgreSQL or Oracle provides the SEQUENCE object allowing to auto-increment a value while respecting the ACID principles. MySQL provides the AUTO_INCREMENT attribute.
  2. Using a text-based random UUID V4 (universally unique identifier), also referred to as GUID (globally unique identifier) by Microsoft. Example: 9d17210c-2d5f-11ea-978f-2e728ce88125.

When working on existing projects, I often observe that PK is designed as auto-incremented integers. Though this may be considered an obvious and no-brainer choice, it may be a bad idea in the long run...

Let's consider both options. Each argument is provided along with an importance weight (from 1=minor to 5=major).

Why You Should Use Auto-Incremented Integers

  • [importance: 3] Known, understood, and simple solution. Leverages sequences on modern RDBMS. Comes with very little possibility of performance issues due to bad design.

  • [importance: 1] Pretty easy to read and verbalize when the number of digit remains reasonable (but, as stated before, technical PK should not be used by humans anyway — use a function key instead).

Why You Should Avoid Auto-Incremented Integers

Risks to Introduce Bugs

  • [importance 2] In some SGBDR like PostgreSQL, a nextval operation is not truly transactional. In rollback cases, the value is still incremented. It is hence possible to get "holes" (absent values) in PK sequences. This is not an issue by itself as unicity is preserved but can induce subtle bugs if developers rely on PK to count the number of items instead of using a proper COUNT query.
  • [importance 2] Likewise, developers may rely on PK values to sort items chronologically instead of using a dedicated date column. In case of sequence reset and ID reuse, this may conduct the wrong sorts.
  • [importance 1] The key can be huge and if developers used a int variable to map the PK instead of a long one, you can encounter silent overflow errors. For instance, in Java, if you map a PK to an Integer or primitive int and the PK gets larger than 2,147,483,647, the variable will silently map to the opposite (negative) value.

Security Risks When Using Auto-Incremented Integers as PK

Using them clearly makes your application an easier target:

  • [importance 3] Auto-incremented integers leak the number of items treated by a unit of time. A competitor can easily deduce how many sales you made in a month. Or an attacker can get a good idea of how many requests your system is supposed to handle and finely tune a DDOS attack.
  • [importance 5] Auto-incremented integers are predictable. If used in URLs, they become a traversal directory (also referred to as a "path traversal") exploit vector. For example, an HTTP GET URL can easily be forged from a regular URL path (https://.../1234/...) to another one (like https://.../1235/...). If the application implements a proper access management, the attacker would get a 403 code (as expected) but if it is not the case or if some endpoints have been forgotten, he can get sensitive data. The defense in-depth principle promotes several layers of controls, and never relies on a single one.
  • [importance 4] Similarly, auto-incremented integers make possible large bulk data scraping (in case of bad access controls). It is trivial to write scripts looping on IDs (like a curl inside a for loop in bash).

Architectural Issues

  • [importance 4] Auto-incremented integers make the integration of two systems more difficult. Imagine that your company buys another one and you have to merge an existing customer database into yours using an ETL. If both systems use auto-incremented integers as PK, you will have to avoid collisions by resetting sequences to new highs and not already used values. All foreign keys (FK) will have to be recomputed.
  • [importance 2] I think that from an architectural viewpoint, a database should only store data, not create it. With sequences, we leave it to the RDBMS to create logic.
  • [importance 1] With sequences, we mix inserts and data generation (insert into ... values nextval('id_seq')) and we have to keep the new value returned by the INSERT clause if we want to use it in the following queries. A creation function returning a value does not appear very logical to me. It is also possible to perform a SELECT nextval('id_seq') followed by an INSERT clause. This doesn't appear more logical to me to have to read something to be able to make a creation...

Operations Risks

  • [importance 1] When using integers as keys for every entity, it is much easier to confuse an item with another (for instance confusing request_id=10 with article_id=10).
  • [importance 1] When deleting an item, an operator can confuse a value with another (delete ... where id=4 instead of delete ... where id=40 for instance). This problem doesn't affect UUID as it is virtually impossible to type a matching UUID by chance.

The Other Way: Random UUID

The alternative approach is to use UUID (RFC 4122, ISO/IEC 9834-8:2005) version 4 or variants.

UUID Pitfalls

  • Using UUID V1, V2: only the V4 (random value) version of UUID is acceptable. UUID based on timestamps (V1, V2) and MAC address may lead to collisions at very high generation frequency (in the same millisecond), but worse, they leak important data (time of the generation and machine identification data). That could help attackers or give bad ideas to developers (see above why IDs should be meaningless).
  • Using the wrong database type: Most modern RDBMS come with a UUID type. In PostgreSQL, a UUID uses 128 bits of storage size, not 288 as we may infer naively from a UUID textual format.
  • Changing your mind: if you went with integers, stick with it on existing projects.
  • Not using a cryptographically-secure pseudorandom number generator (CSPRNG): you will encounter collisions and create security flaws. When using a low-quality or buggy pseudorandom generator, the collision risk is very high and may occur several times by day or even hour. Under Linux, use /dev/random and not /dev/urandom.
  • Using a CSPRNG but blocking your application when entropy is exhausted: If using /dev/random under Linux, a great solution is to use the haveged daemon to feed the CSPRNG.

UUID Misconceptions

  • Using UUID requires that you for collisions. As explained on this Wikipedia page, the risk of collision is so infinitesimal that it can be ignored. There is a collision probability of 50% every 2.71E18 generations (if you generate without stopping 10 IDs per second, you can expect a 50% probability of collision every 8.5 billion years). The sole control I would advise is to correctly trap SQL errors, as a collision would throw a UNIQUE constraint violation error. Any good code would handle this type of technical error and retry anyway. Real-world production databases already throw erratic SQL errors on a regular basis (like ObjectOptimisticLockingFailureException using Hibernate for instance) so the work is probably already done or it should be.
  • "UUID is more difficult to read and verbalize." As explained before, UUID is by no way meant for humans. Instead, use additional functional values for this. When well designed, they would be better than long integers. But UUID is often read by developers as well (when working on test doubles for instance). I observed in several projects that even then, UUID readiness is not an issue and no developers complained about it. We even figured out that transmitting UUID between developers (by instant messaging for instance) is safer than transmitting integers because nobody would type them and copy/paste prevents typos.

NOTE: NoSQL databases do not rely on integers as keys but on UUID (see MongoDB _id or CouchDB id attributes on documents for instance). This is due to their distributed nature, but I have never heard developers complain about it.

UUID V4 Advantages

  • [importance 5] Ensures a total non-significance. URLs containing PK are totally unpredictable. This prevents various exploits like path traversal or mass data downloads.
  • [importance 3] Greatly reduces the complexity of integration between databases.
  • [importance 2] Prevents all potential bugs and operations errors listed above.
  • [importance 1] No more sequence required: the business code can generate UUID by itself without using the database.
  • [importance 2] The code is easier to test because it is trivial to mock UUID without any RDBMS and their sequences features.
  • [importance 2] Most languages, frameworks, and tools support them.

UUID Real But Negligible Issues

  • UUID uses more space on disks and in memory (buffers). On most databases, a long one uses 64 bits whereas a UUID takes 128 bits. On a large database, it would only add 8MB every one million items.
  • There can be an impact on INSERT latency. Inserting one million lines against a PostgreSQL database takes about 25 seconds using UUID V4 and 6 with integers. This is noticeable only by very write-oriented workloads.
  • SQL queries require more CPU cycles to be performed because of the key size (two cycles for 128 bits vs a single one for 64 bits integers). In practice, the overhead is negligible.
  • In some very seldom cases (when containing only digits), UUID could be confused by badly parameterized WAF (web application firewall) with credit card numbers. Think about it when using F5 ASM for instance.

UUID V4 Real Issues and How to Fix Them

  • [importance 3] The UUID V4 looks fairly simple to implement but requires a minimal amount of skills and knowledge. If your team lacks tech leaders/software architects and has no idea of how to get a good source of randomness or of the difference between UUID versions, go for auto-incremented integers — it may save you from painful refactorings.
  • [importance 4] On most RDBMS, using genuine UUID V4 on large databases is not appreciated by DBAs because it fragments indexes, hence slowing them down when refreshing and during queries. If too fragmented, indexes have to be loaded entirely into memory, generating important performance issues if they don't fully fit into RAM and if disks have to be accessed.
  • [importance 2] Another performance issue deals with journals caused by fragmentation. Defragmentation (REINDEX or VACUUM) can become much slower and data replication (when enabled) can be impacted if relying on journals. On PostgreSQL, this phenomenon is called "WAL write amplification" by DBAs. Note, however, that the storage hardware has a large impact on performance when dealing with this issue. With SSD and NVMe disks, making random data access by design greatly mitigates this issue.

These two last performance issues can easily be fixed using a UUID V4 variant: the UUID short prefix COMB. COMB means "combined" because it mixes UUID V4 randomness with a hint of time. Its principle is to "sacrifice" two bytes of randomness and use them as a rolling sequence based on the current time (epoch value) with a minute-wide precision. Every minute, the prefix is incremented (it will thus run through all values from 0000 to FFFF in about 45 days). A sample sequence of such UUID could be:

This way, UUID PKs induce far less fragmentation and index performances are similar to the ones observed with auto-incremented integers.

Several implementations exist. If you use Java, check this library to generate PKs from your application code.

If you prefer letting the RDBMS create the UUID itself, several implementations exist (like this PostgreSQL extension) but this adds a bit of complexity to install and configure the RDBMS.

We observed a few minor drawbacks with this method though:

  • It is a bit more difficult for developers to distinguish UUID as they start with the same bytes if created in a small amount of time. They have to check the last bytes.
  • Loss of entropy slightly increases the chance of collisions as only 12 bytes over 14 are now truly shuffled. However, the two bytes rolling prefix still add a nonnegligible entropy that is based on the current time. If we estimate that we actually lose only a single byte of entropy, the collisions risk is still negligible. You now have a 50% chance to get a collision every 1.05E16 generated UUID. If you generate continuously 10 UUID per second, you have a 50% chance to get a collision every 33.5 million year.
  • If PKs have to be generated by an ETL (typically during a migration process), replacing the built-in standard UUID V4 generator with a short prefix COMBO may require a few lines of code and/or some integration work. For instance, for PENTAHO, we had to integrate a Java library into the stream.