Commands cascading create Gaps in Event Store

Lars_Karschen · October 16, 2018, 9:29am

Hi,

we are currently struggling with our way of properly cleaning up the business case of deleting a tenant in our architecture. As it is now, deleting a tenant creates a cascade of commands and events throughout our services, which are separately handling specific aggregates.

In our conception, we wanted to do it straightforward: DeleteTenantCommand -> TenantDeletedEvent, which is picked up by Bounded Contexts related to a tenant. But when we pick up a TenantDeletedEvent in the EventListener of a related Organisation service and send the corresponding DeleteOrganisationByTenantDeletionCommand to result in a OrganisationDeletedByTenantDeletionEvent, we experience gaps in our globalIndex. Since this is just the very top of the cascade, we are worried about gaps that would halt our TrackingProcessors, when thousands of Asset Aggregates are deleted depending on the Organisation and so on.

Deleting a single Organisation in this context does not create a gap, nor does deleting a Tenant which only has one Organisation, it just appears when there are more than one Organisation.

Is there anything wrong with the approach? Should the process be wrapped in a Saga, and if so, how would that work?

Sincerely,
Lars Karschen

allardbz · October 26, 2018, 9:56am

Hi Lars,

gaps in itself should not be a big problem. At least, not when they’re temporary and don’t come in the thousands at the same time. I would definitely not redesign the process, just to avoid these gaps.
Note that gaps occur because of a difference between insert-time (where sequence numbers are generated) and commit-time (where sequence numbers become visible to other processes). Built-for-purpose event stores do not have this “problem”. Do note that the EventStorageEngine implementations are built to work around these gaps and make them as opaque as possible. There is a limit, though, as gaps may be temporary, or permanent, and that the client is unable to distinguish between the two. Therefore, gaps are considered permanent under certain conditions (based on age, for example). You’d have to double check if these settings are reliable enough for your setup.

Cheers,

Allard

Lars_Karschen · October 30, 2018, 9:09pm

Hi Allard,

I’m not entirely sure I explained the issue completely, as it will indeed create gaps on committing cascaded events to the (JDBC) event store. The order itself is consecutive, meaning the globalIndex has no Gaps during adding the events, but after insertion, the autoincrement is higher than the last inserted event globalIndex. Also, I fail to see a pattern, other that it seems to be -1 in the test cases that we try and are able to reproduce the issue.

I tried to remove as many problem areas as possible, reworking to use a saga wrapping the whole child deletion process, and the saga, all other tracking processors and the CommandGateway using the same TransactionManager. I suspected the UnitOfWorkConnectionProvider which was missing and some specific Hikari config parameters, but nothing works. I also tried to not send Commands but publish the events themselves, to the same effect. To me it looks as if there are Transactions pending and being rolled back at the end of the multiple Commands handled that we are just not aware of and might be of some sort of misconfiguration, but right now I’m at a loss, as no errors and Trace logging really show something wrong.

Regards,
Lars

allardbz · November 16, 2018, 8:05am

Hi Lars,

despite the gaps, do you see all the events you would expect from the process, or is something rolled back? Is the globalIndex column using a sequence that it perhaps shared with another column?
Again, gaps themselves are not a problem, and nothing to worry about if they happen occasionally. The JDBC EventStorageEngine does have an issue when there are more consecutive gaps than the batch size.

Allard

Lars_Karschen · January 2, 2019, 2:01pm

Hi Allard,

sorry for not replying for some time, but we had to set aside the problem for a while and I just now came around again to dig a little more. I turned out we had some misconfiguration - we we using our own Autoconfiguration for certain parts of axon framework configuration, overriding and ignoring parts of the default axon-autoconfig. On the services we encountered problems because apparently we instantiated two CommandBuses. Once I found that, I was able to fix the configuration and everything is now behaving as intended - after testing a cascading delete of about 550 aggregates, the globalIndex column now shows the next autoincrement exactly following the last event index id.

Regards,
Lars