Migrating event table from 2.x to 3.x (PostgreSQL)

Steven_Grimm · January 4, 2018, 7:35am

Axon 3 adds a new autoincrement primary key to the DomainEventEntry table and removes a column from the unique index that is the primary key under Axon 2. That presents a bit of a migration challenge for nontrivial data sets on applications that have to minimize downtime.

Anyone have any migration tools/scripts to share? Here’s the SQL script (with a tiny bit of pseudocode) I came up with. I’d do all this while the application is still on Axon 2, which will just ignore the new column, so that it’s out of the critical path for deploying the Axon 3 version of the code. The goal here is to do the migration while the Axon 2 app is live, even if the migration takes longer than it might take if we took the app offline. This could be done arbitrarily far in advance of rolling out Axon 3. Critiques/comments appreciated!

-Steve

– Add the column in nullable form first.
ALTER TABLE domainevententry ADD globalindex BIGINT;

– Update the existing rows.
WITH time_ordered AS (
�� SELECT eventidentifier, ROW_NUMBER() OVER (ORDER BY timestamp, sequencenumber) AS seq
�� FROM domainevententry
)
UPDATE domainevententry
SET globalindex = time_ordered.seq
FROM time_ordered
WHERE domainevententry.eventidentifier = time_ordered.eventidentifier;

– Index the column. Newly-inserted rows are null at this point, but that’s okay.
– We’ll be renaming this index later.
CREATE UNIQUE INDEX CONCURRENTLY tmp_idx ON domainevententry (globalindex);

– The above two steps probably took a long time; update newly-inserted rows.
WITH time_ordered AS (
�� SELECT eventidentifier, ROW_NUMBER() OVER (ORDER BY timestamp, sequencenumber) AS seq
�� FROM domainevententry
)
UPDATE domainevententry
SET globalindex = time_ordered.seq
FROM time_ordered
WHERE domainevententry.eventidentifier = time_ordered.eventidentifier
AND globalindex IS NULL;

– Create the sequence, leaving a small gap above the current maximum.
SELECT MAX(globalindex) FROM domainevententry;

– The value 50 here is kind of arbitrary; it just needs to be greater than
– the number of events that could possibly be inserted between the SELECT
– above and the ALTER COLUMN below, but smaller than the largest gap the tracking
– event processor will skip over.
CREATE SEQUENCE domainevententry_globalindex_seq
�� START + 50
�� OWNED BY domainevententry.globalindex;

– Start assigning numbers to newly-inserted rows.
ALTER TABLE domainevententry
� � ALTER COLUMN globalindex SET DEFAULT NEXTVAL(‘domainevententry_globalindex_seq’);

– Fill in the numbers for any rows that were inserted while creating the sequence.
WITH…

– Now there should be no null values, so we can make the column NOT NULL.
ALTER TABLE domainevententry
�� ALTER COLUMN globalindex SET NOT NULL;

– At this point the primary key is still on (aggregateidentifier, sequencenumber, type)
– whereas on Axon 3 the PK is on globalindex and there is a unique constraint on
– (aggregateidentifier, sequencenumber). First create the Axon-3 style unique index.
CREATE UNIQUE INDEX CONCURRENTLY domainevententry_aggregateidentifier_sequencenumber_key
�� ON domainevententry (aggregateidentifier, sequencenumber);

– Make it a constraint to keep the table metadata tidy.
ALTER TABLE domainevententry
� � ADD UNIQUE USING INDEX domainevententry_aggregateidentifier_sequencenumber_key;

– Drop the primary key and its associated index. The unique constraint we just added
– will keep Axon 2 sufficiently fast.
ALTER TABLE domainevententry
�� DROP CONSTRAINT domainevententry_pkey CASCADE;

– Rename the new index so it follows the PostgreSQL index naming convention.
ALTER INDEX tmp_idx RENAME TO domainevententry_pkey;

– Promote the globalindex index to primary key.
ALTER TABLE domainevententry
�� ADD PRIMARY KEY USING INDEX domainevententry_pkey;

allardbz · January 4, 2018, 8:21am

Hi Steven,

in one of the Axon 2-based projects that I was involved in, we added an auto-increment primary key for performance reasons. Apparently, it made the index smaller, because secondary keys refer to the primary key, which in turn was a combination of fields. That wasn’t on an Postgres database, but it might increase performance a bit.

Just wondering, instead of the arbitrary “jump” of 50, couldn’t you do a table lock, an empty rows count, and use that count as the “jump”? It would cause a very short stop-the-world, that your application might, or might not notice. At least you’ll be able to generate a gapless sequence. More importantly, you don’t run the risk of jumping short. You could even do that at the very beginning of your migration.

Also, I am wondering how Postgres deals with null values in unique columns. MySQL will only allow a single entry to be null in this case.

Dropping the aggregate type from the index has been subject to a lot of debate in the Axon 3 design phase. In the end, we decided that an aggregate identifier should be unique within the context that the application runs in. If more fine-grained scoping is required, then that must be part of the identifier. For example: ORDER-<uuid_here>. We have left the column in the DomainEventEntry table, as it does help when debugging. Axon doesn’t use that column anymore operationally, however.

Hope this helps.
Cheers,

Allard

Steven_Grimm · January 4, 2018, 4:44pm

Thanks for the quick response! I’m not a Postgres guru but I think I know the answers to a few of the questions you raise. If anyone knows differently, please correct me.

Integer primary keys: A single-column unique index is faster than a multi-column index or a single-column index on a text column for any queries where the planner decides to use that particular index. But the query planner doesn’t treat primary keys as anything special; a primary key is just one of the available indexes it can choose from. For any query whose criteria don’t include the primary key column, it is completely irrelevant to read performance. PostgreSQL doesn’t refer to the primary key in other indexes as far as I know; it uses an internally-generated row identifier.

Null values: A unique index allows multiple null values in PostgreSQL, because the SQL standard says null isn’t equal to anything else including another null. MySQL’s behavior here is arguably wrong.

db=# create table null_test (val integer, unique (val));
CREATE TABLE
db=# insert into null_test values (1);
INSERT 0 1
db=# insert into null_test values (2);
INSERT 0 1
db=# insert into null_test values (null);
INSERT 0 1
db=# insert into null_test values (null);
INSERT 0 1

Locking the table to fill in the last few missing IDs is a possibility and should be quick enough to not cause application problems. Good idea; it’s preferable to have no gaps in the sequence if possible. Thanks.

I would have come down on the side of keeping the type as part of the identifier, just because from an application semantics point of view, I don’t immediately see what new functionality it enables, and it forces the application to preprocess IDs to prevent collisions. I expect I’m simply failing to think of the upside, though; what benefit do non-type-scoped IDs bring?

Unscoped IDs present a couple Axon 2 migration issues which I haven’t fully thought through how I’m going to solve. First of all, we have some duplicate IDs in our database, since there was no reason to avoid them in Axon 2. At the very least, that requires updating those specific IDs to disambiguate them.

Adding type tags to the duplicates, or to all the existing aggregates, is one option, but that means there will be two ID formats in play for some amount of time. Should the TargetAggregateIdentifier value in a command include the type tag or not? Depending on whether the aggregate is pre- or post-migration, either choice could be wrong. My guess is that I’m going to have to add logic to try the tagged ID first and retry without the tag if that fails. Updating all the IDs to include type tags would take long enough that the application would have to be able to cope with a database with a mix of ID formats, at least temporarily. But like I said, I haven’t fully thought that through yet.

-Steve

allardbz · January 10, 2018, 1:12pm

Hi Steven,

to answer this question (and in a best attempt to help you further), let’s separate a few concepts first.

The @TargetAggregateIdentifier annotated field must contain the identifier of the aggregate, as it is provided to the Repository. An EventSourcingRepository uses that identifier to load an EventStream from the EventStore.
The “problem” of duplicate identifiers is one in the EventStore, not so much in the Repository, because the latter is already scoped per aggregate.

I basically see 2 approaches from here (apart from generating new, unique identifiers for the duplicates, which I assume is not feasible):

create a subclass of the Repository that basically adds a prefix or suffix to identifiers, translating an AggregateIdentifier to the identifier used in the Event Stream. This would require you to override the doLoadWithLock(…) method and call super.doLoadWithLock(…) with pre/suffixed identifier. You will also need to wrap the EventStore to add these pre/suffixes to the aggregate identifier as events are published.
add “aggregateType” to the unique index. This will allow multiple events with the same Id and Seq# to co-exist. It also means that when loading an event stream, you will receive messages that don’t purely belong to the aggregate that was being queried. Subclassing the EventSourcingRepository and filtering out the events that have an unexpected aggregateType would be a way to only read the events you’re actually looking for.

These solutions don’t excel in elegance, I admit. At the time we designed Axon 3, we noticed that all the parties we knew about were using (mostly) client-generated, unique UUIDs for aggregates.
Keep me posted on your approach.

Cheers,

Allard

Steven_Grimm · July 11, 2019, 3:16pm

Coming back to this after a while. One question: Aside from the cost of calling String.equals() on each event, is there a reason to not make the standard EventSourcingRepository discard events with a mismatched aggregate type? Seems like a harmless sanity check in the expected no-colliding-IDs case, and obviously it’s preferable to not have to maintain custom subclasses of Axon components. I’d be happy to submit a pull request for this if it’d be accepted.

The reason this is a problem for us is that our application has an external-facing batch API and clients pass in a batch ID as well as an ID for each transaction in a batch. Transactions and batches are modeled as separate aggregates since they have distinct lifecycles. The aggregate ID is just the client-supplied ID with a prefix identifying which client it’s from. A couple of customers, since it is perfectly legal in our API (and causes no problems in Axon 2) use the ID of the first transaction in the batch as the batch ID. So as a result we have a number of batch and transaction aggregates with the same IDs.

Keeping the scoping by aggregate type seems like the cleaner solution to me if only because it follows DRY; it’d be kind of ugly to have the aggregate type in its own column in the events table and then also need to include the type as part of the aggregate ID. We could change our API to forbid duplication of IDs, but that horse has kind of already left the barn since we have existing requests in our database.

-Steve

allardbz · July 12, 2019, 7:52am

Hi Steven,

I also commented on the PR, but for visibility, let me post here, too. I like the idea of being able to filter based on aggregate type. However, adding this as new default behavior may break existing implementations, if they have changed the aggregate name over time. Something that could easily be done until now, without any consequences.

Instead of making it default, it would be nice to be able to make it easier to configure. In the current situation, you’d have to create a subclass of the EventSourcedRepository, and override the “readEvents” method to apply filters. I would suggest adding new methods to the builder, that allow you to define a filter to apply on incoming events. Or perhaps just a boolean to indicate whether it should apply “strict rules” when reading aggregate events.

Cheers,

Allard