Scalability of Axon Multitenancy support

Oliver_Libutzki · October 27, 2022, 6:57am

Hello everyone,

I highly appreciate that your provide a multitenancy extension for Axon Framework.

It’s a little bit unfortunate that you need an Axon Server Enterprise License to give the demo app a try, but I was able to introspect it nonetheless.

As far as I can see Axon starts a separate thread for each EventProcessor and each tenant. Given we have 4 eventprocessors and 1.000 tenants this would result in 4.000 threads.

Beside the laarge number of threads I assume that there is a lot of interaction with the database to claim and update tokens.

Is there a number of maximum tenants you aim for? Do you want to support thounsands of tenants or do you focus on supporting a small number (10? 50? 100?)?

Thanks for clarifying
Oliver

stefand · October 27, 2022, 7:41am

Good observation, extension scales components that each tenant use, which means it will scale event processors linearly too. How many tenants a single app can hold is bounded by resources given to the application, also I believe the maximum number of connections to the single database ranges from ±100 up to 200.

Currently, we don’t have an answer on how to support 1000 tenants in a single app, but one idea is that you can try the usage of Loom and Virtual threads. It could be set up like this:


config.registerTrackingEventProcessorConfiguration(configuration ->
                TrackingEventProcessorConfiguration.forParallelProcessing(4)
                        .andInitialSegmentsCount(4)
                        .andThreadFactory(s -> Thread
                                        .ofVirtual()
                                        .allowSetThreadLocals(true)
                                        .inheritInheritableThreadLocals(false)
                                        .name(s + "-", 0).factory()
                                )
        );

Also, I would be happy to provide you with a trial Axon Server license to try out the demo app.

Oliver_Libutzki · October 27, 2022, 8:15am

Thanks for your answer. Virtual threads might be an option… I will check how it behaves using virtual threads.

Steven_van_Beelen · October 27, 2022, 8:22am

On a similar note, but then for the PooledStreamingEventProcessor:
You can configure the ScheduledExecutorService used for coordination and event handling if you want.
Thus, you can define a single ScheduledExecutorService used throughout a number of PooledStreamingEventProcessors with ease.

Oliver_Libutzki · October 28, 2022, 5:52am

I played with the PooledStreamingEventProcessor using the multi-tenant demo-app and it looks quite good to me…

That solves the issue of having a lot of threads. What I have to investigate is the number of token updates as they might cause a lot of load on an idle system.

As a positive side-effect two small reference-guide PRs popped up:

Steven_van_Beelen · October 28, 2022, 7:56am

Just spotted the pull requests, @Oliver_Libutzki. Thanks for the fixes; they’ve been merged.

Oliver_Libutzki · October 28, 2022, 9:49am

I just would like to share some numbers regarding the number of queries in the demo application:
For a single tenant and a a single EventProcessor (while still using the PooledStreamingEventProcessor with 1 coordinator and 16 workers) every 5 seconds 33 queries are exeucted:

2022-10-28 11:39:33.772 DEBUG 16472 --- [  Coordinator-0] org.hibernate.SQL                        : select tokenentry0_.processor_name as processo1_4_, tokenentry0_.segment as segment2_4_, tokenentry0_.owner as owner3_4_, tokenentry0_.timestamp as timestam4_4_, tokenentry0_.token as token5_4_, tokenentry0_.token_type as token_ty6_4_ from token_entry tokenentry0_ where tokenentry0_.processor_name=? order by tokenentry0_.segment ASC
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-7] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-6] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-11] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-14] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-13] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-8] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-1] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-4] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-15] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-12] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-13] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-11] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-7] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-8] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-10] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-9] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-15] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-6] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-13] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-5] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-1] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-2] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-7] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-10] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-6] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-11] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-8] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-0] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [       Worker-9] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.772 DEBUG 16472 --- [      Worker-14] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.779 DEBUG 16472 --- [      Worker-12] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?
2022-10-28 11:39:33.779 DEBUG 16472 --- [       Worker-3] org.hibernate.SQL                        : update token_entry set timestamp=? where processor_name=? and segment=? and owner=?

That means that there are 6,6 queries per second (average). Scaling seems to be linear. Added a second EventProcessor doubles the queries per second. Adding a second tenant doubles it as well.

That being said, assuming we have 1.000 tenants we would have 6.600 queries per second just to update the token. So beside the number of threads (which can be solved using the PooledStreamingEventProcessor) the number of queries seems to be the main limitating factor in the mutli tenancy scenario.

Experimenting with the event availability timeout, the token claim interval or the segment count might mitigate the issue, but I’m afraid that we could run into situations that events are handled with a huge latency.

Do you have any suggestions on this one?

stefand · October 28, 2022, 10:23am

The solution might be to create a custom token store, that would either batch token updates, or to use write thru cache both for tokens and projections. Its an interesting problem indeed

Oliver_Libutzki · October 28, 2022, 10:46am

In turn, updating the tokens in a batch would increase the need for synchronizing the workers… that’s exactly what we don’t want, right?

The high number of token updates is the main issue we currently have with Axon. It causes a lot of load even if the system is idling.

It’s ok in case you have a single tenant, but the mutli-tenancy idea amplifies this effect.

I am not sure how to address this issue. I don’t want to start with a solution like “Do batch token updates” or “use advanced chacing”. On the other hand, a ticket saying: Reduce the number of token updates might be a little bit too generic.

Gerard · October 28, 2022, 5:01pm

In case it’s mostly idle, you can increase both the claimExtensionThreshold on the PooledStreamingEventProcessor and the claimTimeout on the Token Store.

JohT · October 28, 2022, 6:17pm

Most probably a stupid question but I’m courios: Would it be possible to create a TokenStore per tenant or create a tenant number column and wire that somehow up?

danstoofox · October 28, 2022, 8:26pm

It sounds quite bad that Axon will create so many threads and queries with an increasing number of processing groups and tenants.

We currently create processing groups quite liberally (around 40 for a single tenant across two applications). From what I read in this thread the number of threads and queries just to update tokens would explode if we added more tenants and applications in the future.

What is planned to address this and do we need to rethink how we create processing groups? @stefand

stefand · October 31, 2022, 9:44am

@Oliver_Libutzki multitenancy extension was never designed to deal with such a large number of tenants (1000), more around 30 - 50 per app. Unfortunately, we don’t have a solution for this case. Even if you would decide to implement the plain old multitenant CRUD app you probably won’t be able to host this number of tenants, as it would require 1000 separate databases.

Maybe pgbouncer and pgpool can help with large number of db connections

Just for brainstorming: in theory, if you would implement your own event store to use a table per aggregate, and consider different aggregates as tenants, you would have at least some separations of data, and you could delete aggregates. But you would have a single token store and replaying events independently for different tenants would not be possible.

@JohT it’s possible to have either tenant column in a single token store or token store per tenant, not sure what you mean by wiring that somehow?

@danstoofox I would advise reducing the number of processing groups and using grouping wisely. As @Steven_van_Beelen mentioned you can reduce the number of threads, but that’s not much different from reducing the number of processing groups. I would advise you to experiment with what works for you the best, start increasing the number of tenants and try grouping more active tenants with less active tenants. “Premium” tenants that require more resources should have a dedicated app, or at least just a few tenants per app, while “free plan” or “cheaper plan” tenants should be grouped in larger numbers in a single application to save hosting costs.

Oliver_Libutzki · October 31, 2022, 10:14am

Thanks for being transparent regarding what is was built for and which limitations exist.

I highly appreciate your open communication. It’s a strength to say “We don’t have a solution for this particular use case.”

JohT · October 31, 2022, 12:23pm

@stefand didn’t know that. Thanks for the explanation. With „wire it up“ I thought about implementing a custom TokenStore.

stefand · October 31, 2022, 12:37pm

That’s possible too

Gerard · October 31, 2022, 1:06pm

You could create a custom TokenStore. But in best case, when there are no new events, it’s still a call to a database for every tenent, segment, processing group combination every claimExtensionThreshold. So that are easily thousands of calls a second when there are thousand tenants.

ViliusS · November 4, 2022, 1:15pm

Very interesting topic! We are not using multitenancy extension, but we are seeing mentioned issues with constant token updates in a database on a standard multi-user application.

Our application creates a database per tenant for user data, but the token store still remains the same for all tenants.

With more than 500 tenants our token store (PostgreSQL database) just goes into complete turtle mode which drastically increases latency for all Axon Server operations. It becomes even worse if we increase microservice count which are using Axon. If we decrease tenant count to 20 or 30, it is much more manageable.

@stefand Is the mentioned 30-50 tenant limit only applicable for multitenancy extension or similar token tracking limits also exist in Axon Framework and/or Server itself?

stefand · November 7, 2022, 2:38pm

I would say that limit is defined by the token store and not by extension or framework. For example, PostgreSQL by default starts complaining about a high number of connections, which can be changed by the parameter but things get slow. I’m not sure whats the status with other databases, maybe some are more optimized to work with high number of concurrent connection.

Also it looks that tools like PgBouncer might help you. Im currently doing tests with these tools and to see how much they help scaling of multitenant system.

ViliusS · November 8, 2022, 6:28am

Yes, I understand that physical limits are more in the infrastructure itself. What I was asking is the approximate limits which were in mind when designing the framework.