Why string identifiers (rather than UUID)?

Jimit_Ndiaye · September 14, 2011, 12:05am

I’ve noticed that identifiers in events have changed from UUID to string. What was the reasoning behind this?

Allard · September 14, 2011, 6:56am

Hi Jimit,

the reason is quite simple: the exact format doesn’t really matterm, as long as Axon can compare identifiers. I am working on a high performance command bus (> 1M commands per second) and noticed that UUID generation is limited to 250k per second. By changing it to a String, it allows developers to define a custom IdentifierFactory. A machine identified sequential UUID generation strategy is much faster and doesn’t need as much concurrency control as the random UUID generation.

Cheers,

Allard

Jimit_Ndiaye · September 14, 2011, 9:32am

Were the performance gains from storage/retrieval or purely from generation? If the latter may I ask what generation strategy?

Allard · September 14, 2011, 1:53pm

It’s the latter. I did a PoC with Johann Burkard’s implementation: http://johannburkard.de/software/uuid/. That one uses the sequential UUID version, which is drastically faster. In that PoC, I process 250K commands per second, each generating 1 or 2 events, which are also stored to a disk-based event store.

Jimit_Ndiaye · September 14, 2011, 3:08pm

Interesting. I’ll have to do a PoC to compare that algorithm with the Jimmy Nilson’s guid.comb algorithm http://www.informit.com/articles/article.aspx?p=25862. It should be faster, since the latter first generates a regular, random guid, then messes around with the bytes to make it sequential, where as johann’s build’s one from scratch based on the MAC address I presume.

Jimit_Ndiaye · September 16, 2011, 12:03pm

After some investigation it seems Johann’s UUID algorithm is indeed faster (for a sufficiently large sample size) though it will be initially slower due to the initial scanning for MAC address in the static constructor of UUIDGen. But that’s a one time call and if you eliminate that it is indeed way faster than randomly generated UUIDs. And they have the advantage of being sequential.
However I found that the above was true only when sequentially generating UUIDs (single thread). When generating the same number of UUIDs distributing the work to multiple threads, I found random UUIDs to be faster for some reason. Did you encounter anything similar?

Allard · September 16, 2011, 1:42pm

Hi Jimit,

I didn’t do a pure UUID-generation benchmark. In my benchmark, UUID was just one of the steps taken. Most of the ID’s were generated by a single thread.
It does make sense, though, because sequential ID’s require coordination between cores. If only a single core generates the id’s, the path prediction algorithms will make sure everything runs fast. Random UUID generation probably requires less coordination (maybe even none at all).

Cheers,

Allard

Jimit_Ndiaye · September 18, 2011, 5:53pm

Actually after removing extraneous influences from the benchmark, parallel generation of UUIDs was the fastest of the four scenarios with parallel generation of random UUIDs being the slowest.

Nils_Kilden-Pedersen · November 7, 2011, 4:03pm

It makes sense to move away from UUID for the reasons you describe, but why to a String then? Why not simply to an Object?

Allard · November 7, 2011, 8:20pm

Hi,

I’ve considered alowing any object as identifier at the time. There is a problem though. Axon, as well as some backing technologies rely on the ability to use sorting, hashing etc. on the identifiers. A String can do all of that. Besides, I am convinced that any sane identifier can be uniquely and consistently represented as a String.

In the end, you can still use any object you like as an identifier. Axon will use the asString() method to perform quick lookups in e.g. the Event Store. But the AggregateIdentifier you provide will be used to construct it.

If you don’t share this opinion, let me know. I’m open to suggestions.
Cheers,

Allard

Nils_Kilden-Pedersen · November 7, 2011, 9:46pm

I don’t have any performance numbers on string creation and comparison, but was merely concerned that it wasn’t necessarily the fastest possible way to do lookup, e.g. a custom object comparing on one or more primitives would be faster.

Chad_Wilson · November 7, 2011, 11:20pm

How about a simple <T> on the classes/interfaces that need to identify
the ID value? That way the application developer can choose their own
implementation.

Also, I use JUG (http://wiki.fasterxml.com/JugHome), specifically I
use variant 1 (time + ethernet MAC) UUID's for events, and internal
objects, and clients use variant 4 (random) UUID's to communicate with
the server.

Chad_Wilson · November 7, 2011, 11:31pm

Also, with UUID generation, when using random id's you can get a
significant performance boost if you pre-generate a pool of ID's per
server (that way you aren't blocking for your PRNG/RNG). This can
also be important to do with the time based variant if you find that
you are generating more id's than the time window allows (I think it's
100 per microsecond, but don't quote me on that!)

Allard · November 8, 2011, 2:17pm

Chad, you said:

How about a simple on the classes/interfaces that need to identify
the ID value? That way the application developer can choose their own
implementation.

Can you give an example of what you mean here?

The current solution allows you to choose any identifier that you like. Whether it’s a random UUID, a sequential UUID or a combination of 3 integers, Axon doesn’t care. However, many subsystems rely on the ability to do lookups based on that identifier. Therefore, I have chosen to force all identifier to be representable as a String. This String is only used by e.g. the Event Store to find the correct entries and return them. For the rest, your own identifier implementation is used.

While Axon defaults to a java.util.UUID.random() based identifier, it is easy to plug your own IdentifierGenerator. The sequential approach is tremendously faster than the random version. Again, Axon doesn’t care where the identifier comes from, or what it looks like.

Cheers,

Allard

Nils_Kilden-Pedersen · November 8, 2011, 2:29pm

So, you don’t use a generic Object, because you don’t trust equals/hashCode being correctly implemented, is that correct?

Allard · November 8, 2011, 3:05pm

Not quite. I don’t use a generic Object because many systems don’t work with them. The JPA Event Store for example, will have a hard time finding events based on a generic object which needs to be available in the aggregateId column.
At the time (event when the identifier was a hard-coded UUID), the Event Store used a String to store the identifier.

Note that I am only mentioning why the identifier needed to be representable as a String at the time. If anyone finds a way to get rid of this, please let me know.

Cheers,

Allard

Chad_Wilson · November 8, 2011, 6:29pm

Instead of having a concrete (or interface) class that defines an
identifier (AggregateIdentifier), simply use generics on all the
classes that need to have the ID class on their API. An example would
be AggregateRoot<I> and the getIdentifier method would be defined as:

I getIdentifier();

For the repositories, instead of being defined as Repository<T extends

you'd have it be Repository<T extends AggregateRoot, I>

and the same would apply for the methods as above (use the generic
type instead of the current AggregateIdentifier).

Since we already have a toString() method on Object and you're keying
off of String values for lookups, just use the toString() method call,
and the application developers can make their own wrapper classes if
the standard object toString() method does not work for their
purposes. This would work for UUID values (most JPA implementations
do not allow for UUID typed column values anyway).

In regards to JPA, I'm not quite following what the impedance mismatch
is. Can you provide an example? You can always have the JPA
implementation of the event store just call toString on whatever
identifier that the app dev is using, and have your internal ORM
object use String for the column mapping.

Chad

Allard · November 8, 2011, 8:11pm

The generics part makes sense. I was under the impression that you meant it would solve the ID formatting issue. Haven’t seen generics do that yet

At the time, I chose to make my expectancies of the String representation explicit by enforcing an asString() method. This way, a developer could not forget to implement it.
The downside of allowing “an object” (which is the case, even if you use generics), is that there is no way to enforce the availability of a “good” toString() method.
On the other hand, I am charmed by the decoupling of app-specific code from Axon. No more Axon-specific classes needed in the public API components.

Just thinking out loud: some components (JPA Event Store is an example) will work a lot better if it can search events based on a single column type, such as String. That means the JPA Event Store will need to use toString() to convert the identifier to a String. It can do a check to see whether the toString() method from the identifier is “declared” on Object. If so, it has not been overridden, and is most likely unsuitable as an identifier to use with the JpaEventStore.

Chad, you said you had created a JDBC event store. What do you use to search for relevant events? In other words, how do you map the identifier object to query parameters?

Cheers,

Allard

Chad_Wilson · November 9, 2011, 2:39am

In terms of the dev forgetting to implement toString(), I think that
more often than naught, ID's will be either UUID's or longs, both of
which will be properly implemented. I hear what you're saying there,
but there's never a guarantee on a well formatted string if they're
implementing it themselves either. Also on your idea to see if it's
reloaded, how can you tell the difference between a core class (UUID,
Long, Integer, etc.) and a custom class? Personally I think that's a
bit too much automagic as I don't think the instances of application
developers using custom types will be higher than 5%.

On the JDBC Event Store, it's currently hard coded to my
implementations so that's the first reason why I haven't sent anything
to github yet The other reason, is it's currently a thin wrapper
around Spring's JDBCTemplate.

For ID's I'm using UUID's and because my underlying storage mechanism
is PostgreSQL, it's JDBC driver will take UUID objects directly (it
also stores them as byte arrays instead of Strings like most other
db's so performance win there, sadly it still calls toString() in the
driver). In regards to searches, that's also a little bit tricky
because my event store does not have a concept of sequence like Axon
does. My columns are String event, Long when, String type, UUID
aggregateId, UUID eventId (these correspond to text, bigint, varchar,
uuid, uuid in PostgreSQL). I have a simple RowMapper class that maps
the event values to the columns.

For lookups, I only have two queries, one that gets all for an
aggregate, and another that gets all for an aggregate up to a certain
value of when. For my application, I need the ability to Point in
Time grab any aggregate, and as you probably noticed, I use a 64 bit
integer value for my dates to circumvent all those nasty time
translating Java date class issues with JDBC, nor do I want my data
store doing temporal arithmetic

...

In thinking about this a little more, I'd like to frame the discussion
another way. Do we need to provide users the ability to have custom
ID values on a standard Event Store (e.g. a JDBC event store that can
store many types of ID)? If that's the case, I think that will be a
fairly troublesome, monumental undertaking. My personal take is
provide users with a few standard options (UUID, Integer, Long) and
have the implement a solution for any other custom ID's.

There is also the issue of the underlying database with the JDBC
solution as well. Even with JPA, the automatic table generation
facilities always screw up the column types for PostgreSQL. I'm a
firm believer in having the database be generated by script for the
specific database in question, and the UUID on PostgreSQL is a good
example of why (JPA creates a varchar entry for UUID's).

Thoughts?

Chad

Allard · November 9, 2011, 8:03am

In thinking about this a little more, I’d like to frame the discussion
another way. Do we need to provide users the ability to have custom
ID values on a standard Event Store (e.g. a JDBC event store that can
store many types of ID)? If that’s the case, I think that will be a
fairly troublesome, monumental undertaking. My personal take is
provide users with a few standard options (UUID, Integer, Long) and
have the implement a solution for any other custom ID’s.

There is also the issue of the underlying database with the JDBC
solution as well. Even with JPA, the automatic table generation
facilities always screw up the column types for PostgreSQL. I’m a
firm believer in having the database be generated by script for the
specific database in question, and the UUID on PostgreSQL is a good
example of why (JPA creates a varchar entry for UUID’s).

Thoughts?

The JPA Event Store as I have currently implemented it already allows for quite a bit of customization. The default implementation will just rely on doing toString() on the identifier to perform the lookup for relevant events. It’s up to the app developer to either use an ID format that has a toString() method on it, such as UUID, String, Number, etc, or make sure that the event store is customized to be able to use the custom identifier.

This problem only exists for event stores. When using relational storage for aggregates, any valid JPA primary key can be used to perform the lookup. That’s because the table format is defined per aggregate type.

I’m just going to take a go at this, and see where it lands. We’ll see what trouble we come across.

Cheers,

Allard