published on

UUID as a Random String

Okay, to be more precise, the issue is when we are using UUID4. Recently, I’ve been playing around with distributed tracing and the packages and methodologies lying behind it.

This is just a shot I took in Amsterdam

One of the base ideas of distributed tracing is using a random identifier that can identify a request bouncing between different systems and collecting information that can be identified at the end. Simple.

Another concept that comes up is sampling. Distributed tracing is not meant for massive analysis of requests. Especially if you have systems that have a high load. Tracing might cause your entire cluster to slow down. To resolve this, sampling was introduced. Sampling needs to be an algorithm that is unified in all systems so that all of them know when they need to add data to the request at hand.

UUID4 is not (that) random

As you might figure by now, I noticed an issue in the above system. The identifier used for identifying traces needs to be unique. According to the UUID4 Wikipedia entry:

Randomly generated UUIDs have 122 random bits. One’s annual risk of being hit by a meteorite is estimated to be one chance in 17 billion, that means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate.

Sounds pretty convincing and intuitive to use this, right? It actually solves the problem (right about) perfectly. However, if you read the entire entry and the RFC for UUID4, you will soon learn that there are some parts of the string, that are, in fact, not-so-random. Let’s see this in action using the go programming language and the uuid package provided by Google.

package main

import (
	"log"

	"github.com/google/uuid"
)

func main() {
	id, err := uuid.NewRandom()
	if err != nil {
		log.Fatalln(err)
	}
	log.Println(id.String())
}

The above code is a fairly simple way to receive a UUID of version 4 according to the specifications. Let’s see how the individual hex values behave in different parts of the string.

package main

import (
	"log"
	"reflect"

	"github.com/google/uuid"
)

func main() {
	for uuidIndex := 0; uuidIndex < 32; uuidIndex++ {

		letterMap := make(map[string]int)

		for i := 0; i < 100000; i++ {
			id, err := uuid.NewRandom()
			if err != nil {
				log.Fatalln(err)
			}

			letterAtIndex := string(id.String()[uuidIndex])
			counter, exists := letterMap[letterAtIndex]
			if exists {
				letterMap[letterAtIndex] = counter + 1
			} else {
				letterMap[letterAtIndex] = 1
			}
		}

		log.Println(uuidIndex, len(reflect.ValueOf(letterMap).MapKeys()))
	}
}

After running the code above, you should see something like this:

~/go/src/tuuid ᐅ go run main.go
0 16
1 16
2 16
3 16
4 16
5 16
6 16
7 16
8 1
9 16
10 16
11 16
12 16
13 1
14 1
15 16
16 16
17 16
18 1
19 4
20 16
21 16
22 16
23 1
24 16
25 16
26 16
27 16
28 16
29 16
30 16
31 16

What you can see above, is the index of the string that we are currently looking at in the UUIDs and the number of different characters that appeared on that index (the code also counts the frequency of the different characters, which we will ignore for now).

As you can see, most of the fields have 16 as the number of types of characters that exist on the index on the UUIDs. We also see a couple of 1s, which are the - characters, completely understandable. However, on index 19, we can see a weird number: 4. These are the bytes that are there to indicate the version number of the given UUID. If you want to hash or sample according to this bit, you will have a bad time.

What could be a solution?

We have this issue where we need a convenient method for generating a unique identifier for our traces. However, we need to calculate the sampling according to something that is not entirely as unique as we initially anticipated. We have 2 paths to take in this case: 1. Changing the unique identifier algorithm. We can generate a truly random string with some different algorithm. A benefit of this is that we only need to update this in the entry points of our systems. Meaning that not all services need to be updated. 2. Changing the sampling algorithm. We can create a new sampling algorithm that would stabilize the faulty byte that we might have been using for calculations, however, we do need to make sure that the sampling algorithm that we write is tied to the tracing id that we went with, otherwise, we will lose valuable trace information.

Conclusion

UUID version 4 is an excellent algorithm to generate non-clashing unique identifiers. However, we need to be mindful if we want to use this identifier for any calculations.

That’s all I wanted to share today. Until next time!