UOW blocking when processing a batch of nested commands

Jakob_Hatzl · October 19, 2021, 6:56pm

Hey there Axon Community,

we are currently dealing with a problem when processing a batch of nested commands.

We are developing an application based on Axon 4.4 that consists of several separate aggregate roots which are related to each other in 1:n relationships. The relationships are expressed as the aggregate identifier of the single-sided aggregate root stored as parent id in the multi-sided aggregate root.
We deliberately have chosen to use distinct aggregate roots instead of aggregate members due to scalability reasons (mainly to avoid huge numbers of events stored to a single aggregate instance for a hierarchical object graph, see example below)

Auto Generated Inline Image 1

There are some use-cases, where we need to propagate an action through the whole tree of connected aggregates. To achieve such operations we maintain a command model that stores the parent-child relationships. Propagating such an action is usually done in the following way (example use case is granting access to a project and all of it’s subitems):

publish ProjectGrantAccessCommand
ProjectAggregate:handle(ProjectGrantAccessCommand)
a. publish ProjectAccessGrantedEvent
b. CommandModel:getDescendants
c. for each descendant image root: publish ImageGrantAccessCommand

this pattern is repeated in the next level of the hierarchy for each of the descendants

ImageAggregate:handle(ImageGrantAccessCommand)
a. publish ImageAccessGrantedEvent
b. CommandModel:getDescendants
c. for each descendant annotation root: AnnotationGrantAccessCommand

… and so forth with the further levels of the hierarchy.

Having relationships of 1 Project → 80 Images/Project → 1000 Annotations/Image this produces ~80k commands. We are using Axon Server EE and the DisruptorCommandBus as the local segment of the distributed command bus. Since we experience extreme slow processing of the 80k commands (~8 hrs) we have set up two micrometer timers on the command handlers and the Unit of work like below:

public class CommandTimingInterceptor implements MessageHandlerInterceptor<CommandMessage<?>> {

    private MeterRegistry registry;

    public CommandTimingInterceptor(MeterRegistry registry) {
        this.registry = registry;
    }

    @Override
    public Object handle(UnitOfWork<? extends CommandMessage<?>> unitOfWork, InterceptorChain interceptorChain) throws Exception {
        final Optional<LongTaskTimer.Sample> uowSample =
            buildUowTimer(unitOfWork.getMessage().getCommandName()).map(LongTaskTimer::start);
        unitOfWork.onCleanup(uow -> uowSample.ifPresent(this::stopTimer));
        final Optional<LongTaskTimer.Sample> cmdSample =
            buildCommandTimer(unitOfWork.getMessage().getCommandName()).map(LongTaskTimer::start);
        try {
            return interceptorChain.proceed();
        } finally {
            cmdSample.ifPresent(this::stopTimer);
        }
    }

   // ... 
}

and the pattern we see is the following:

the commands are finishing really fast (under a second, not visible in the graph), but the related unit of work (red line for the ProjectGrantAccessCommand) seems to block until all of the child commands are processed
the pattern repeats for the ImageGrantAccessCommand and Unit of Work (blue lines / spikes)

Since we are using the async CommandGateway::send which returns immediately and let’s the command handler finish I’m a bit puzzled what may be blocking the Unit of Work until it’s cleanup phase (where we stop the micrometer timer)? It even looks as if all of the subsequent commands somehow “block” the Unit of Work of the ProjectGrantAccessCommand handling.

I started some serious debugging of the framework to better understand the Unit of Work handling but it’s quite hard to track down. Maybe someone here has an idea or experienced a similar behaviour - appreciate every hint where to look.

Best Regards,
Jakob

Jakob_Hatzl · November 26, 2021, 2:41pm

Some more investigation showed that we are having different problems here that slow down the command processing:

a) We have a subscribing event processor that operates on the …AccessGranted events on each level which itself has a performance problem that is not related to the axon framework. Since the event handlers are executed in the same thread as the command handler (https://docs.axoniq.io/reference-guide/v/4.4/axon-framework/events/event-processors#event-processors)

b) the assumption that using CommandGateway::send to send nested commands inside a command handler also ends the Unit of Work of the current command handler seems not to be true. It seems that every time a command is sent from within an existing unit of work, the cleanup phase of the new commands unit of work is attached to the parent unit of work and (if the child command is routed to the same application instance, which in our case with a single app instance is true) the parent unit of work is only finished when all child units of work / commands finished. This nesting of units of work actually also locks the parent aggregate until all child commands are processed, which means no other commands can be processed by these locked aggregates.

Our solution strategy to this is now to move the sending of the commands to child aggregates to an external command handler or tracking event processor that sends the initial and subsequent …GrantAccess commands to avoid nesting units of work in aggregates and locking the parent aggregate.

Steven_van_Beelen · November 29, 2021, 12:01pm

Thanks for sharing that with all of us here, @Jakob_Hatzl.
I am wondering, is there still something the community can help out with?

I currently don’t have anything else in mind from what you’ve proposed right now.
But, maybe, I’m missing something obvious. So, please let me know whether I/we can still be of guidance.