Saga scaling

Prem_C1 · January 3, 2018, 11:07am

Hello,

We have been using Axon for the last few years and it’s been a very pleasant experience. One feature we chose not to use was Sagas. The main issue we had was around scaling of sagas beyond a single instance. When we deployed multiple instances, we noticed that the same saga events would get handled in more than one instance and result in optimistic concurrency exceptions resulting in the saga never completing. This was true in cases where events for the same saga happened concurrently.

One way to avoid the problem is by using a dispatch and handle mechanism similar to commands. I understand this is not the case (yet). Is this in the plan to be enhanced/fixed in the near to medium term?

Or are there ideas to enable deployment of multiple saga managers? Thoughts welcome!

allardbz · January 3, 2018, 8:00pm

Hi Prem,

since Axon 3.1, there is support for distributed Sagas. With the current implementation, they are limited to TrackingProcessors. Based on the “segment” each processor has claimed, it will process a certain portion of the Saga instances. This claim also ensures that no two processors will ever process the same saga in two different nodes.

Actually, for Tracking Processors, when running in multi-threaded mode, it doesn’t matter whether the threads are running in the same JVM, or in different ones.
See the reference guide: https://docs.axonframework.org/part3/event-processing.html#parallel-processing

Cheers,

Allard

Steven_Grimm · January 3, 2018, 11:50pm

In our Axon 2.4 application we use sagas extensively in a multi-host environment. We use the DistributedCommandBus to spread work among hosts, and each host runs saga code locally as it generates events. Row-level locks are the key: you don’t want two hosts to load the same saga at the same time.

We made some changes to the saga manager code, which were merged into Axon’s 2.4.x branch (see PRs #411 and #427) to avoid deadlocks by imposing an update order on sagas. Then it’s a matter of locking each saga row as it’s read from the repository, which we do with a custom saga schema class:

public class LockingSagaSqlSchema extends PostgresSagaSqlSchema {
  public static boolean shouldLock = true;

  private final SagaSchema sagaSchema;

  public LockingSagaSqlSchema(SagaSchema sagaSchema) {
    super(sagaSchema);
    this.sagaSchema = sagaSchema;
  }

  @Override
  public PreparedStatement sql_loadSaga(Connection connection, String sagaId) throws SQLException {
    if (shouldLock) {
      final String sql =
          "SELECT serializedSaga, sagaType, revision"
              + " FROM "
              + sagaSchema.sagaEntryTable()
              + " WHERE sagaId = ?"
              + " FOR UPDATE";
      PreparedStatement preparedStatement = connection.prepareStatement(sql);
      preparedStatement.setString(1, sagaId);
      return preparedStatement;
    } else {
      return super.sql_loadSaga(connection, sagaId);
    }
  }
}

I’m working on porting our application to Axon 3.1 at the moment and am planning to use the same setup; only a subset of our application’s events get persisted to an event store, so we can’t use tracking saga managers. But if all your events are persisted, Axon 3.1’s distributed saga support would probably be a cleaner and easier solution.

-Steve

Prem_C1 · January 4, 2018, 10:15am

Thanks Allard! Sounds good! Will take a look! Will keep this group posted on how we move forward.

Prem

Prem_C1 · January 4, 2018, 12:41pm

Thanks for sharing your approach Steve. Locking the database row for the saga instance seems like a simple way to fix it. I think that should fix most of our problems with being able to distribute saga managers across multiple instances.

We are using the DistributedCommandBus as well to scale command handling. All our production deployments are on Axon 2.x Our next gen services are 3.x but are yet to be deployed in production. We’ll upgrade to the latest 2.x version to get the benefit of using Sagas.

Thanks for your help!
Prem

Steven_Grimm · January 5, 2018, 5:04am

The row-locking technique seems to work with Axon 3.1 (not tested extensively yet, just a quick smoke test) but a small change to the code is required. The JDBC connection is set to autocommit by default which means the row lock is released as soon as it’s acquired.

@Override
public PreparedStatement sql_loadSaga(Connection connection, String sagaId) throws SQLException {
  if (shouldLock) {
    final String sql =
        "SELECT serializedSaga, sagaType, revision"
            + " FROM "
            + sagaSchema.sagaEntryTable()
            + " WHERE sagaId = ?"
            + " FOR UPDATE";

    // If the connection is set to autocommit, the lock will be released immediately.
    if (connection.getAutoCommit()) {
      CurrentUnitOfWork.get()
          .afterCommit(Unchecked.consumer(ignore -> connection.setAutoCommit(true)));
      CurrentUnitOfWork.get()
          .onRollback(Unchecked.consumer(ignore -> connection.setAutoCommit(true)));
      connection.setAutoCommit(false);
    }

    PreparedStatement preparedStatement = connection.prepareStatement(sql);
    preparedStatement.setString(1, sagaId);
    return preparedStatement;
  } else {
    return super.sql_loadSaga(connection, sagaId);
  }
}

The “Unchecked” call there is from the jOOL library and just rethrows checked exceptions; replace it with a “try” block if you like.

Axon 3.1 has an internal locking mechanism for sagas but it is local to the JVM so doesn’t help with multiple hosts. An alternative to the row-locking solution would be to provide a network-aware implementation of Axon’s LockFactory interface, perhaps using the JGroups lock service.

-Steve