Translate

Sunday, January 23, 2022

How to Do UUID as Primary Keys the Right Way

 Source: https://dzone.com/articles/uuid-as-primary-keys-how-to-do-it-right?edition=598292

How to Do UUID as Primary Keys the Right Way

UUID V4 or its COMB variant are great to mitigate various security, stability, and architectural issues, but be aware of various pitfalls when using them.

  · Database Zone · Analysis

Why Do We Need Technical IDs in the First Place?

Any properly designed relational database table owns a Primary Key (PK) allowing you to uniquely and stably identify each record. Even if the primary key can be composite (built of several columns), it is a widespread good practice to dedicate a special column (often named id or id_<table name>) to this end. This special column is used to technically identify records and can be used as foreign keys in relations.

NOTE: Do not confuse technical (also named "surrogate") keys with function keys. The most important tables (so-called entities in domain-driven design) may contain an alternate human-readable ID column (like the customer ID "G2F6D"). In this article, we will focus only on technical PKs. They should only be processed and readable by machines, not humans.

How to Choose Good IDs

Good IDs are:

  • Unique (no collisions). This is enforced by the UNIQUE constraint automatically added by RDBMS on every PK.
  • Not reusable. Reusing a deleted row PK is technically possible in relational databases but is a very bad idea because it contributes to generating confusion (for example, an older log file can reference an ID reused in the meantime by a new entity, thus conducting to false deductions).
  • Meaningless. Developers should not parse IDs for the wrong reasons. Imagine an ID starting with the current year. Some developers may ignore that a date_creation column exists and will only rely on the PK's first four digits. If the ID format changes or is buggy (because of bad timezones handling for instance), some subtle issues may arise. Even if this is largely discussed, I would warn against using natural keys as PK altogether. It may limit your options in the future. For example, if you use e-mail as PK, you implicitly forbid modification of it in future releases: never say "never." Another problem with natural keys is the difficulty of ensuring unicity due to functional issues even when everything has been done to avoid them. I once worked for a French governmental agency and observed both issues in different projects: 1) Legacy code relied on the first digit of the NIR (social identity number) to get the people type thus ignoring possible type reassignments (though the current type was available as a dedicated column); 2) We recently discovered that this unique ID was not so unique (for example, an ID shared temporarily by several members of an immigrant family or collisions following cities merging). The real world is just too complex to make any assumption about unicity.

Which ID Format to Choose?

The two formats matching these rules are AFAIK:

  1. Auto-incremented integers (starting at 1 or any larger value: 1, 100, 154555). Most RDBMS like PostgreSQL or Oracle provides the SEQUENCE object allowing to auto-increment a value while respecting the ACID principles. MySQL provides the AUTO_INCREMENT attribute.
  2. Using a text-based random UUID V4 (universally unique identifier), also referred to as GUID (globally unique identifier) by Microsoft. Example: 9d17210c-2d5f-11ea-978f-2e728ce88125.

When working on existing projects, I often observe that PK is designed as auto-incremented integers. Though this may be considered an obvious and no-brainer choice, it may be a bad idea in the long run...

Let's consider both options. Each argument is provided along with an importance weight (from 1=minor to 5=major).

Why You Should Use Auto-Incremented Integers

  • [importance: 3] Known, understood, and simple solution. Leverages sequences on modern RDBMS. Comes with very little possibility of performance issues due to bad design.

  • [importance: 1] Pretty easy to read and verbalize when the number of digit remains reasonable (but, as stated before, technical PK should not be used by humans anyway — use a function key instead).

Why You Should Avoid Auto-Incremented Integers

Risks to Introduce Bugs

  • [importance 2] In some SGBDR like PostgreSQL, a nextval operation is not truly transactional. In rollback cases, the value is still incremented. It is hence possible to get "holes" (absent values) in PK sequences. This is not an issue by itself as unicity is preserved but can induce subtle bugs if developers rely on PK to count the number of items instead of using a proper COUNT query.
  • [importance 2] Likewise, developers may rely on PK values to sort items chronologically instead of using a dedicated date column. In case of sequence reset and ID reuse, this may conduct the wrong sorts.
  • [importance 1] The key can be huge and if developers used a int variable to map the PK instead of a long one, you can encounter silent overflow errors. For instance, in Java, if you map a PK to an Integer or primitive int and the PK gets larger than 2,147,483,647, the variable will silently map to the opposite (negative) value.

Security Risks When Using Auto-Incremented Integers as PK

Using them clearly makes your application an easier target:

  • [importance 3] Auto-incremented integers leak the number of items treated by a unit of time. A competitor can easily deduce how many sales you made in a month. Or an attacker can get a good idea of how many requests your system is supposed to handle and finely tune a DDOS attack.
  • [importance 5] Auto-incremented integers are predictable. If used in URLs, they become a traversal directory (also referred to as a "path traversal") exploit vector. For example, an HTTP GET URL can easily be forged from a regular URL path (https://.../1234/...) to another one (like https://.../1235/...). If the application implements a proper access management, the attacker would get a 403 code (as expected) but if it is not the case or if some endpoints have been forgotten, he can get sensitive data. The defense in-depth principle promotes several layers of controls, and never relies on a single one.
  • [importance 4] Similarly, auto-incremented integers make possible large bulk data scraping (in case of bad access controls). It is trivial to write scripts looping on IDs (like a curl inside a for loop in bash).

Architectural Issues

  • [importance 4] Auto-incremented integers make the integration of two systems more difficult. Imagine that your company buys another one and you have to merge an existing customer database into yours using an ETL. If both systems use auto-incremented integers as PK, you will have to avoid collisions by resetting sequences to new highs and not already used values. All foreign keys (FK) will have to be recomputed.
  • [importance 2] I think that from an architectural viewpoint, a database should only store data, not create it. With sequences, we leave it to the RDBMS to create logic.
  • [importance 1] With sequences, we mix inserts and data generation (insert into ... values nextval('id_seq')) and we have to keep the new value returned by the INSERT clause if we want to use it in the following queries. A creation function returning a value does not appear very logical to me. It is also possible to perform a SELECT nextval('id_seq') followed by an INSERT clause. This doesn't appear more logical to me to have to read something to be able to make a creation...

Operations Risks

  • [importance 1] When using integers as keys for every entity, it is much easier to confuse an item with another (for instance confusing request_id=10 with article_id=10).
  • [importance 1] When deleting an item, an operator can confuse a value with another (delete ... where id=4 instead of delete ... where id=40 for instance). This problem doesn't affect UUID as it is virtually impossible to type a matching UUID by chance.

The Other Way: Random UUID

The alternative approach is to use UUID (RFC 4122, ISO/IEC 9834-8:2005) version 4 or variants.

UUID Pitfalls

  • Using UUID V1, V2: only the V4 (random value) version of UUID is acceptable. UUID based on timestamps (V1, V2) and MAC address may lead to collisions at very high generation frequency (in the same millisecond), but worse, they leak important data (time of the generation and machine identification data). That could help attackers or give bad ideas to developers (see above why IDs should be meaningless).
  • Using the wrong database type: Most modern RDBMS come with a UUID type. In PostgreSQL, a UUID uses 128 bits of storage size, not 288 as we may infer naively from a UUID textual format.
  • Changing your mind: if you went with integers, stick with it on existing projects.
  • Not using a cryptographically-secure pseudorandom number generator (CSPRNG): you will encounter collisions and create security flaws. When using a low-quality or buggy pseudorandom generator, the collision risk is very high and may occur several times by day or even hour. Under Linux, use /dev/random and not /dev/urandom.
  • Using a CSPRNG but blocking your application when entropy is exhausted: If using /dev/random under Linux, a great solution is to use the haveged daemon to feed the CSPRNG.

UUID Misconceptions

  • Using UUID requires that you for collisions. As explained on this Wikipedia page, the risk of collision is so infinitesimal that it can be ignored. There is a collision probability of 50% every 2.71E18 generations (if you generate without stopping 10 IDs per second, you can expect a 50% probability of collision every 8.5 billion years). The sole control I would advise is to correctly trap SQL errors, as a collision would throw a UNIQUE constraint violation error. Any good code would handle this type of technical error and retry anyway. Real-world production databases already throw erratic SQL errors on a regular basis (like ObjectOptimisticLockingFailureException using Hibernate for instance) so the work is probably already done or it should be.
  • "UUID is more difficult to read and verbalize." As explained before, UUID is by no way meant for humans. Instead, use additional functional values for this. When well designed, they would be better than long integers. But UUID is often read by developers as well (when working on test doubles for instance). I observed in several projects that even then, UUID readiness is not an issue and no developers complained about it. We even figured out that transmitting UUID between developers (by instant messaging for instance) is safer than transmitting integers because nobody would type them and copy/paste prevents typos.

NOTE: NoSQL databases do not rely on integers as keys but on UUID (see MongoDB _id or CouchDB id attributes on documents for instance). This is due to their distributed nature, but I have never heard developers complain about it.

UUID V4 Advantages

  • [importance 5] Ensures a total non-significance. URLs containing PK are totally unpredictable. This prevents various exploits like path traversal or mass data downloads.
  • [importance 3] Greatly reduces the complexity of integration between databases.
  • [importance 2] Prevents all potential bugs and operations errors listed above.
  • [importance 1] No more sequence required: the business code can generate UUID by itself without using the database.
  • [importance 2] The code is easier to test because it is trivial to mock UUID without any RDBMS and their sequences features.
  • [importance 2] Most languages, frameworks, and tools support them.

UUID Real But Negligible Issues

  • UUID uses more space on disks and in memory (buffers). On most databases, a long one uses 64 bits whereas a UUID takes 128 bits. On a large database, it would only add 8MB every one million items.
  • There can be an impact on INSERT latency. Inserting one million lines against a PostgreSQL database takes about 25 seconds using UUID V4 and 6 with integers. This is noticeable only by very write-oriented workloads.
  • SQL queries require more CPU cycles to be performed because of the key size (two cycles for 128 bits vs a single one for 64 bits integers). In practice, the overhead is negligible.
  • In some very seldom cases (when containing only digits), UUID could be confused by badly parameterized WAF (web application firewall) with credit card numbers. Think about it when using F5 ASM for instance.

UUID V4 Real Issues and How to Fix Them

  • [importance 3] The UUID V4 looks fairly simple to implement but requires a minimal amount of skills and knowledge. If your team lacks tech leaders/software architects and has no idea of how to get a good source of randomness or of the difference between UUID versions, go for auto-incremented integers — it may save you from painful refactorings.
  • [importance 4] On most RDBMS, using genuine UUID V4 on large databases is not appreciated by DBAs because it fragments indexes, hence slowing them down when refreshing and during queries. If too fragmented, indexes have to be loaded entirely into memory, generating important performance issues if they don't fully fit into RAM and if disks have to be accessed.
  • [importance 2] Another performance issue deals with journals caused by fragmentation. Defragmentation (REINDEX or VACUUM) can become much slower and data replication (when enabled) can be impacted if relying on journals. On PostgreSQL, this phenomenon is called "WAL write amplification" by DBAs. Note, however, that the storage hardware has a large impact on performance when dealing with this issue. With SSD and NVMe disks, making random data access by design greatly mitigates this issue.

These two last performance issues can easily be fixed using a UUID V4 variant: the UUID short prefix COMB. COMB means "combined" because it mixes UUID V4 randomness with a hint of time. Its principle is to "sacrifice" two bytes of randomness and use them as a rolling sequence based on the current time (epoch value) with a minute-wide precision. Every minute, the prefix is incremented (it will thus run through all values from 0000 to FFFF in about 45 days). A sample sequence of such UUID could be:

This way, UUID PKs induce far less fragmentation and index performances are similar to the ones observed with auto-incremented integers.

Several implementations exist. If you use Java, check this library to generate PKs from your application code.

If you prefer letting the RDBMS create the UUID itself, several implementations exist (like this PostgreSQL extension) but this adds a bit of complexity to install and configure the RDBMS.

We observed a few minor drawbacks with this method though:

  • It is a bit more difficult for developers to distinguish UUID as they start with the same bytes if created in a small amount of time. They have to check the last bytes.
  • Loss of entropy slightly increases the chance of collisions as only 12 bytes over 14 are now truly shuffled. However, the two bytes rolling prefix still add a nonnegligible entropy that is based on the current time. If we estimate that we actually lose only a single byte of entropy, the collisions risk is still negligible. You now have a 50% chance to get a collision every 1.05E16 generated UUID. If you generate continuously 10 UUID per second, you have a 50% chance to get a collision every 33.5 million year.
  • If PKs have to be generated by an ETL (typically during a migration process), replacing the built-in standard UUID V4 generator with a short prefix COMBO may require a few lines of code and/or some integration work. For instance, for PENTAHO, we had to integrate a Java library into the stream.

Friday, January 14, 2022

Raising the Bar on Security by Purging Credentials From the Cloud

 Source: https://dzone.com/articles/raising-the-bar-on-security-by-purging-credentials?edition=457210

Raising the Bar on Security by Purging Credentials From the Cloud

In this post, dig into elemental cloud security challenges, such as a centralized native cloud-only model for identity verification and authentication.

  · Cloud Zone · Analysis

 

 

The Cloud is ubiquitous: any company looking to ramp up quickly will provision its compute, networking, and storage with its preferred cloud provider, and get started rolling out their products.  That makes total sense from a business perspective.  The Cloud has simplified development and automation exponentially over the years, and emerging tech such as AI and IoT will only accelerate this.  

However, the catch is that the very same foundational architectures which drive the Cloud’s efficiency, flexibility, and cost benefits ultimately also are its weakest links from a security perspective.   The result is the daily march of headlines we all read about: ever larger and deeper breaches of data and systems.

There are three fundamental drivers that serve as the basis for the large majority of access hacks, data security breaches, and privacy vulnerabilities:

  1. A centralized native cloud-only model for identity verification and authentication

  2. A fundamentally flawed architecture for data security and privacy  

  3. The complexity of the native cloud-based solutions and active security measures that are layered over this inherently flawed foundation

This series of articles will dig into each of these elemental cloud security challenges that need to be addressed if we really want to improve things. Today, we will focus on the first. 

Ok, What’s the Problem With Existing Password-Less Authentication?

Credentials-based authentication - that is, username and password - is notoriously insecure and the source of 80%+ of data breaches today.  Yet most end users still gain access to their websites and applications this way.  Even worse, so do many DevOps and cloud engineers to their sensitive production cloud environments.   Thankfully, a long-overdue migration to password-less authentication is now underway, with dozens of tools vying for supremacy, and more and more sites like GitHub are deprecating password-based authentication entirely.  

While this is an important step forward, it still only solves part of the problem: not all password-less authentication solutions are created equal.  The new wave of solutions eliminates the password, but all still have one or several intrinsic weaknesses derived from the same foundational flaw.  

They Rely on the Cloud. . . And the Cloud Is Compromised

The benefits of the cloud - its flexibility, scalability, and distributed access - are indisputable.  However, the practice of having all of your much-needed services available centrally, from the data lakes to the databases, to the IAM and much more, have essentially created a security monster.  

At AWS Re:Invent just a couple of weeks ago, Werner Vogels, AWS CTO said that there is virtually no service their IAM framework doesn’t touch.  If that doesn’t raise the hairs on the back of your neck, I don’t know what will.

Therefore, as any security engineer knows, the Cloud’s strengths are ultimately also the Cloud’s main weaknesses.  The scale and complexity of the Cloud make it practically impossible to defend, while the amount and value of data and capabilities that are concentrated into enormous cloud-based data lakes make attacks relatively easy and incredibly profitable.  

So, if you really want to avoid being the next big headline, you should start with the working premise that the Cloud is essentially compromised. With this assumption in mind, here is where the current crop of password-less authentication solutions remains vulnerable: 

  1. They still rely on cloud databases of user credentials, which remain as vulnerable as ever.   In fact, most are aggregating even more data on users' identities, their activities, their history, and so on in a single place, making a hacker’s job easier than ever.  

  1. They rely heavily on user self-identification and authorization, ignoring the vulnerability of the means used to execute this: SMS and email, two extremely spoofable channels.

  1. They add friction for the end-user, which inevitably reduces compliance.   Let’s face it: users are lazy.  Additional steps with authenticator apps, hardware tokens, and even SMS-based SFA are too much for many of them.

Thanks to this set of assumptions, attacks can be executed literally from anywhere in the world, from any device, and by any hacker - whether an amateur or more sophisticated professional.  Even worse, it doesn’t even require a human attacker: software, especially bots, malware, ransomware, and even AI software, can probe and drive attacks that are often one step ahead of the AI monitoring being used to stop them. 

Yikes, that's scary. Until we change things, the attackers will always have the advantage. 

So What Can We Do?

We can design and build a new approach to cybersecurity based on one simple premise: 

The Cloud is compromised, so eliminate the attack surface in the Cloud.



Learn more about what's coming next in cloud security.

Based on that premise, here are the core pillars of the approach that can be architected to address the problem of identity security via credential-free authentication.

1.  Enforce Access to Resources With Cryptographic Keys, Not Credentials

As long as credentials exist, they can and will be stolen if the resources they access have value.  Encryption, especially AES 256 encryption, works.  Use it.  

2.  Bind Cryptographic Access to Devices, and Devices to Users

The good news is that both human and machine users have devices, and devices can be made secure with two simple steps.

  1. Bind keys to devices.  In addition to the Trusted Platform Module, there are other methods to uniquely restrict the use of encryption keys to the device on which they were created and authorized.
  2. Bind devices to their authenticated users.  Once the device can truly be trusted, now authentication methods that bind a device to the unique human that is authorized to use or control it can be added.  

3.  Decentralize and Bind MFA to Devices   

All MFA today ultimately relies on cloud-based databases and servers, and a Chain of Trust to them for execution. To take the Cloud off the table, we can decentralize and distribute MFA out to users that are bound to devices, and eliminate the role of any central authority. 

The beauty of the first three steps is that while immensely enhancing the security of access, they simultaneously reduce both friction for end users and support overhead for IT staff. The user's device is the log-in: passwords, second device authentications, one-time codes, push notifications, or hardware tokens can all be avoided.

Instead, we can leverage cryptographic device-based authentication with the user's device to deliver two strong factors to assure identity:

  1. something you “own" - the possession of private AES 256 key; and

  2. something you “are” -  leveraging biometrics in tandem with cloud and device-based monitoring.

4.  When It Really Matters, Introduce Zero Trust via Human in the Loop

The steps above will bring a new level of security - and privacy - to the vast majority of authentication use cases and needs.   Still, there are times when the resources accessed by a person and their devices are valuable enough to warrant continuous identity verification to assure no wrongful access. In those use cases, the only way to truly be sure of the identity of the user of a device requesting access is to enforce a direct human-to-human interaction with someone that knows them directly into the authentication flow in an automated way.   We have developed a way to do exactly that.

What Would That Mean?

If these pillars could be applied to truly reduce the attack surface for authentication and identity verification in the Cloud to ZERO, what would that mean for cybersecurity? There are lots of wins, but these are the three most important:

1.  We Can Eliminate Credentials-Based Attacks   

Phishing, spoofing, theft and loss, social-engineering attacks, you name it: all futile if there is no password, no credentials to steal.

2.  We Can Eliminate Cloud-Based Company and System-Wide Attacks   

If the only way to gain access to a resource is to go one by one after each endpoint device, we have raised the bar for attacks beyond the reach of the vast majority of hackers.

3.  We Can Escalate and Enforce a Human-In-The-Loop Zero Trust Identity Verification When It Really Matters   

At the end of the day, every cloud-based authentication today relies upon software or the endpoint device to verify identity.  When your context demands real-time reauthentication of a device or a user, another authorized human that knows the user can do this job based on a pre-configured automated software-driven authentication flow.

Will this stop every possible attack? Of course not. However, it can for the first time in a long time change the equation in favor of the defense.

A Revolution Is Coming

As long as the industry continues to cling to fundamentally vulnerable, centralized native cloud-only security architectures and layer on ever more complex solutions on top of a cracked foundation, we can not only expect but guarantee that we will continue to get the same outcomes.  The endless game of whack-a-mole, chasing down the last breach rather than preventing the next one, will go on and on.

Unless we do a hard reset.  It’s possible.