Best approach for batching requests to external services?

Steven_Grimm · April 25, 2015, 8:54pm

Part of my application involves talking to an external service with a high per-interaction overhead that accepts work in batches. The work is typically triggered by client requests that themselves arrive in batches, either literally (a list of work items in a single message over the network) or in the form of multiple individual requests in rapid succession.

My intended solution is to publish a “queue a job” command to the command bus for each aggregate root that needs to talk to the external service. The command handler adds each command to a work queue, writes it to persistent storage so it’ll survive server restarts, then returns a “success” result. There is another piece of code that listens for additions to the work queue. As soon as it sees its first piece of work, it waits either until a certain amount of time has passed or the queue has reached a maximum batch size, pulls the commands off the queue, and fires off the interaction with the external service, publishing a “sent as part of a batch” event to Axon’s event bus for each queued command so the sagas associated with the aggregate roots can transition to “waiting for the external service” states. When the external service finishes the results are published as events on the event bus and the sagas pick those up and issue additional commands to trigger appropriate business logic.

Axon’s built-in support for queuing of messages seems to be tied to the distributed event bus and to require setting up a separate message broker such as RabbitMQ, which is something this application doesn’t otherwise call for and which seems like it’d add a bunch of unnecessary complexity.

Is there another approach I should be looking at?

One Axon-centric implementation I considered was to define an “interact with the external service” saga that’s not associated directly with any of my existing aggregate roots (it instead maintains a collection of pending work items, though each of those items does contain an aggregate root reference) and uses the built-in persistence and timer features of sagas to do its work. So basically I’d be using Axon’s saga management to implement a persistent timed job queue. This seems like it might work but feels like it violates the spirit of sagas in that it isn’t really representing a business process over a related set of aggregate roots. But I’m new enough to CQRS and Axon that my intuition about the spirit of sagas is not well-developed yet, and maybe this is actually a completely appropriate use of them.

Thanks in advance for any ideas or pointers to existing discussions of this topic!

-Steve

rschoenmaker · April 26, 2015, 1:09pm

I wouldn’t publish a “queue a job” command and do the queing in the command handler, but publish an “execute a job” command, The sender doesn’t generally has to be aware that the job is queued to satisfy the requirements of the service. The job of the commandhandler is just to check if the command is legitimate, and shouldn’t do any queuing.

I would personally go with your saga approach: The association property in this case is not the aggregate identifier like it usually is, but a unique constant that all concerning aggregates and the saga know about.

-Rob

Allard · April 28, 2015, 9:15am

Hi Steve,

the most important thing to keep in mind, is that the Commands and Events you use should have a functional meaning. If the notions of a Batch and Scheduling Jobs are terms your product owner uses, it’s fine to use them in your commands as well. Otherwise, proceed with caution.
Generally, Saga’s are a good means for performing actions that involve external data. Especially if accessing that data has a large overhead cost.

In your case, you don’t seem to have a Saga that really associates with anything. As Rob suggested, you can use an association that has a fixed value, to ensure there is only one instance of a Saga running. The advantage is that you’ll have test fixtures, persistence and event scheduling out of the box. However, using Sagas does have a small overhead. In your case, you could also use a “normal” @EventHandler that uses some persistent storage to store the entries to fetch. You can still use the EventScheduler, but you may want to consider using Quartz directly.

Hope this helps.
Cheers,

Allard

Peter_Davis · June 30, 2015, 3:48pm

We’ve been using Sagas and Quartz to schedule jobs for integration with external systems. I agree with you and Allard: this is more natural if there’s some relation to your domain.

Quartz is a pretty heavy-weight framework. It works pretty well for big, slow jobs, not so well if you have thousands of small, quick jobs. Also, if you use DistributedCommandBus, beware that Quartz will happily fire its trigger on any of the servers in your cluster (randomly), not necessarily the same one that originally processed the saga. So you either have to turn off saga caching (performance hit), or make your Quartz-triggered event trigger a command, just to be routed to the correct server, to trigger the “real” event, to trigger the saga to do the work – pretty ugly. (I understand distributed sagas are a very hard problem in general.)

I personally would prefer your messaging option if it would work for you, despite your concerns. However I think I would make an event handler send the message directly rather than using a distributed event bus terminal. Your MQ system will take care of the queueing, durability, delivery, batching, and gives you some tools to retry after errors. You could run ActiveMQ embedded in your app if you don’t like the idea of setting up RabbitMQ; this is comparable to running Quartz. ActiveMQ even has distributed and discovery capabilities. Axon can’t integrate the EventBus directly with ActiveMQ out of the box, but that’s OK if you roll your own message publisher as I suggested. Saga + Quartz could be a more natural fit if each job must be stateful and integrates tightly with your Axon aggregates. But integration through messaging is a tried and true method.

-Peter

Steven_Grimm · June 30, 2015, 5:05pm

Thanks for the advice. One point of confusion I have about your suggestion: I thought it was generally considered bad practice to perform actions in event handlers (other than updating internal data structures) because then you lose the ability to replay the event stream if you need to generate a new view of your data to run queries against. How would you deal with that? The obvious solution that comes to mind is to set an “I’m doing a replay” flag that the event handler checks before sending its message to the external service.

-Steve

Allard · June 30, 2015, 6:12pm

Hi Steven,

event handlers (the beans, not the aggregate) can do anything they need to. It is, however, bad practice to combine both state updating handlers and those communicating with external components in the same cluster. In a replay, you generally don’t want to invoke all these external components again.

Cheers,

Allard

Peter_Davis · July 3, 2015, 12:55am

Question: sagas are nearly always both stateful and side-effecting event handlers, so are they bad practice?

One approach for combining state and side effects in a replayable event handler – other than the “am I replaying” flag – is to make it idempotent by building deduplication logic into the handler. For example, a saga can record the last event sequence number of every event it handles for each aggregate, and ignore events if it’s already processed that sequence number. We’ve been doing this to make replayable sagas which are not otherwise possible. It’s a bit ugly to have this deduplication code everywhere but I have found it’s hard to make it generic without sacrificing flexibility. This technique has allowed us to repair bugs in sagas in production. It’s also pretty much mandatory if you want to use a distributed or asynchronous event bus since no messaging technology can guarantee “exactly once” delivery.

-Peter