Deleting lots of aggregates

klauss42 · June 3, 2021, 3:36pm

Hi
My system consists of multiple bounded contexts. Most BC’s depend on a “customer” object, as it is a multi tenant system. The aggregates in the different BC’s always use a customerId value object to deine the ownership of the objects. In some contexts there may be multiple thousands of aggregates for each customer. We are struggling finding the right approach to delete a customer with all its data from the system.

We are using Axon Server SE in latest versions (4.5.x).

Our current approach is outlined as follows:
In the Customer BC the DeleteCustomerCommand fires a CustomerDeletedEvent. Other contexts handle the CustomerDeletedEvent and create Delete…Commands in their own context.

For example the Equipment context has the following event handlers (beware, it’s Kotlin:-)

@EventHandler
fun handle(evt: CustomerDeletedEvent) {
    equipmentRepository.findByCustomerId(evt.customerId.id()).forEach {
        // use sendAndWait to avoid AXONIQ-4003 errors (too many commands)
        commandGateway.sendAndWait<DeleteEquipmentCommand>(
            DeleteEquipmentCommand(EquipmentId(it.equipmentId), CustomerId(it.customerId))
        )
    }
}

@EventHandler
fun handle(evt: EquipmentDeletedEvent) {
    equipmentRepository.findById(evt.equipmentId.id()).ifPresent { entity: EquipmentProjection ->
        equipmentRepository.delete(entity)
    }
}

The CustomerDeletedEvent initiated a “storm” of DeleteEquipmentCommand’s which are then handled in the EquipmentAggregate and of course the aggregate creates the EquipmentDeletedEvent and so on.

This works fine so far, but we encountered weird errors if deleting customers with lots of objects in other contexts and the approach simply does not feel stable. Axon server seems to have problems with too many commands sent in short time (therefore we tried sendAndWait() in the loop above.
The event handler with the loop above can take “a while” if there are tens of thousands of commands to be sent.
I am not sure but it feels like it is not such a good idea to have event handlers that are busy a long time.

Are there any recommended patterns to implement these kind of operations in Axon framework?
Would it be a better approach to use sagas to handle this mass deletion?

Thanks
Klaus

klauss42 · June 11, 2021, 12:20pm

Hello again,
nobody any comment on my questions?

Klaus

Steven_van_Beelen · June 14, 2021, 8:20am

It would be interesting to know what these “weird errors” are actually! That can better help us guide you to a workable solution, Klaus.

To the problem at hand, though, there are many knobs and bolts you can turn to optimize big bursts of commands. As I am uncertain which to use (since the errors are unknown to me), I’ll give a list of things you can try out to optimize your setup.

But first, let me tackle your last two questions:

There is no one size fits all solution to this when it comes to Axon. An answer that doesn’t really help is “it depends,” but it does fit the bill. Thus, I believe the approach is fine; we need to make some other adjustments to optimize the approach.

Sagas tend to be more complex to deal with, and they can generate additional overhead as they are essentially serialized and stored transactions. So my gut tells me Sagas are not the solution you want.

Now, on to the list I wanted to give you:

Have you defined a different local segment? The local segment is the local part of the CommandBus that does the actual command handling. This local segment defaults to a SimpleCommandBus, which can take a single command at a time. Adjusting this to the AsynchronousCommandBus already increases throughput, as more threads will deal with commands. If that doesn’t cut it, you can try out the DisruptorCommandBus.
Axon Server has a set cache size for the commands it receives. That cache will fill up if the command handlers cannot handle the commands fast enough. You can change the command-cache-capacity property to adjust the default cache of 10.000 to a higher number. This page, by the way, shows the configuration options of Axon Server.
Have you tried spinning up an additional command handling node? Adding more instances that can handle commands will automatically parallelize the load. Axon does this by using a Consistent Hashing algorithm to route commands consistently to command handling nodes. You can even define more performant nodes to get more load, with the load factor setting.

Let’s see if any of them help. If not, you know where to ask further questions!

klauss42 · June 14, 2021, 9:29am

Hi Steven,
thanks for your answers so far.
Good advice to not go with sagas here. I will try the other CommandBus variants as we are currently using the SimpleCommandBus as this sounds promising. And I will increase the command-cache-capacity.
Regarding scaling: Currently we run 3 nodes without separation between command and query handling, so it is (currently) a monolith serving both commands and projections. I have to do some testing if horizontal scaling changes the behavior.

Regarding the “weird” errors:
I wrote “weird” errors because I could not describe or reproduce the errors completely and I didn’t want to blow up the original post with too much other stuff. Let me try to explain:
If I perform such a delete customer command as described in my initial post on a system without additional load, everything works fine and correct. All commands and events are processed, thousands of aggregates get deleted (markAsDeleted) and projections finally catch up and delete tens of thousand of rows in Postgres.
If I run such a delete task while doing other operations on the system that also produce load (e.g. an import job that creates and modifies lots of aggregates) I encounter “hickups”:

In another posting described one observation (TEP switching).
Projections get slow: Our batch processes rely on queries (e.g. for validations) and thus depend on TEP’s handle events and store stuff in the database. As we did not yet find a better way to find out when an event is finally stored, we do stupid retries to check if data is stored. If running a customer delete command the TEP’s get much slower. Unfortunately it looks as if there is no load distribution of TEP load across multiple nodes. I have no proper metrics but it does not feel faster with 3 nodes than with only 1 node.

Maybe you have some more ideas?

Klaus