Multitenancy: Dangerous AssociationValueEntry entity definition

KaeF · September 13, 2024, 6:16am

AssociationValueEntry entity has id field annotated with @GeneratedValue without any parameters (AssociationValueEntry) which means that the AUTO strategy is selected. In combination with Postgres db it means that actually SEQUENCE strategy is selected. That creates a sequence with a default increment value of 50. Hibernate uses a pooled optimization strategy by default with allocationSize 50 which causes it to ask the database about nextval from the sequence and next values increment by itself until exceeds allocationSize then asks the database for a next val from the sequence again.

It is a good optimization in a standard environment but dangerous in a multitenant environment because e.g. imagine the first operation triggered by tenant1 causes retrieving nextval from tenant1 sequence then the next operation is in triggered by tenant2 and hibernate will not ask db again for nextval from sequence because not used allocationSize pool. That can cause a problem with the uniqueness of primary keys.

The other thing is that if hibernate optimization is disabled in configuration and sequence increment size is changed to 1 and the service uses schema validation then another issue can happen (if it will be changed in one service and others will remain the same).

The sequence can be invalidated in case if we have many axon services that use the same database but different schemas. Hibernate uses naive validation - gets all sequences from the database (from all schemas) and puts them to a map where a key is just a sequence name which is the same in that case in all schemas so they are overridden. So at the end, it can compare entity with sequence from different service (schema)

Hibernate - DatabaseInformationImpl

I think that the definition should be changed which at least should cause there will be no optimization used so all the time next val is taken from the database.

But actually, probably it would be better to not use sequence idis at all and instead use e.g. UUID

Steven_van_Beelen · September 16, 2024, 9:44am

I think that the definition should be changed which at least should cause there will be no optimization used so all the time next val is taken from the database.

We fully agree with this statement, @KaeF. The AssociationValueEntry class is, however, in its current format, something all current users of Sagas depend on.

If we suddenly changed the sequence generation within a patch or minor release of either Axon Framework or the Multi-tenancy Extension, we would likely be bombarded with many new bug issues.

As such, there isn’t much we can do within Axon Framework 4.
However, I can confirm the approach will change for Axon Framework 5.

Added, although suboptimal, there are workarounds for multi-tenant-specific environments, such as overriding the generation defaults in your environment.
Note that I prefer not giving this suggestion, but given our intent NOT to introduce breaking changes, it is sadly all I can do for the time being.

KaeF · September 16, 2024, 11:36am

Actually I workarounded it by

disabling hibenate optimization

spring:
  jpa:
    properties:
      hibernate:
        id:
          optimizer:
            pooled:
              preferred: none

introduce customer hibernate naming strategy

@Component
class MultitenantNamingStrategy(private val environment: Environment) : CamelCaseToUnderscoresNamingStrategy() {

    override fun toPhysicalSequenceName(name: Identifier, context: JdbcEnvironment): Identifier {
        return super.toPhysicalSequenceName(name, context)
            .let { Identifier("${environment.getProperty("spring.application.name")}_${it.text}", it.isQuoted) }
    }
}

alter sequence definition by

ALTER SEQUENCE association_value_entry_seq INCREMENT BY 1;

-- additionally set broken current val (this query has to be customized to reach service tenants databases )
SELECT setval('association_value_entry_seq', (SELECT max(nextval(sequence_schema ||'.' ||sequence_name)) as nextval
                                           FROM information_schema.sequences
                                           WHERE sequence_schema ilike '%[APP_NAME]'));

ALTER SEQUENCE association_value_entry_seq RENAME TO [APP_NAME]_association_value_entry_seq;

So disabling optimization and changing INCREMENT BY with setting proper curr value (in case it was already used) caused the next val is always retrieved from the database.

Changing name (adding service prefix) fixed schema validation - check only sequences for the current service and not all in db.

Anyway even if you cannot do any not backward compatible changes you should at least put some documentation in multitenancy how it should be configured or do a patch only for multitenant environments (providing multitenant generator/strategy).

If you do nothing then multitenancy extension is buggy and can cause a real problem for users.

lbilger · September 20, 2024, 11:30am

I think this is related to my post here, although my issue was with domain_event_entry.

Note that as a workaround, you can also override the sequence generator in orm.xml, which might or might not be easier than your approach:

<?xml version="1.0" encoding="UTF-8"?>
<entity-mappings xmlns="http://java.sun.com/xml/ns/persistence/orm" version="2.0">
    <mapped-superclass class="org.axonframework.eventhandling.AbstractSequencedDomainEventEntry" access="FIELD">
        <attributes>
            <id name="globalIndex">
                <generated-value generator="domainEventGenerator" strategy="SEQUENCE"/>
            </id>
        </attributes>
    </mapped-superclass>
    <entity class="org.axonframework.eventsourcing.eventstore.jpa.DomainEventEntry" access="FIELD">
        <sequence-generator name="domainEventGenerator" sequence-name="domain_event_entry_seq" allocation-size="1"/>
    </entity>
    <entity class="org.axonframework.modelling.saga.repository.jpa.AssociationValueEntry" access="FIELD">
        <sequence-generator name="associationValueEntryGenerator"
                            sequence-name="association_value_entry_seq"
                            allocation-size="1" />
        <attributes>
            <id name="id">
                <generated-value generator="associationValueEntryGenerator" strategy="SEQUENCE" />
            </id>
        </attributes>
    </entity>
</entity-mappings>

Gabriel_Shanahan · November 21, 2024, 2:39pm

Hey everyone, my two cents.

First, @Steven_van_Beelen I can appreciate that this is not the only thing on your plate, but this is a huge bug that affects anyone who’s running Axon + Hibernate in a multi-tenant environment, which is bound to be a significant number of people. I honestly can’t think of a more fundamental problem than “silently and randomly not running event handlers”. I understand that you’re going to solve this in Axon 5, but this thread is over 2 months old, and (unless I’m missing something), there’s not a word about this in the docs. This should be plastered in big red letters at the top of the documentation, and it takes 5 seconds to do, saving others man-days, even man-weeks, trying to diagnose the problem. We were lucky to encounter this topic, others might not have been. I don’t mean to be disrespectful, but this an absolute no-brainer and should’ve been done immediately when you found out this was an issue. Please prioritize it, and let’s get it into the docs in big red letters ASAP.

Moving on, the workarounds described above do not actually fix the issue - they just make the window for the race condition smaller. Setting the allocation size to 1 just causes Hibernate to query the database each time before it attempts to save a record, as can be seen on the attached screenshot. Apart from this doubling the load on the DB, imagine Tenant A and Tenant B - Tenant A asks the DB for an ID, gets 40, but before it manages to persist the event, Tenant B also asks for an ID, gets 41, and manages to persist its event before Tenant A does, leading to out of order globalIndex.

The way to solve this is to switch the generation strategy to identity, and migrating the columns to BIGINT GENERATED ALWAYS AS IDENTITY. This can be done using an AdditionalMappingContributor, which I’m attaching bellow. Note that you also need to migrate the columns in the database as well.

@AutoService(AdditionalMappingContributor.class)
public class AxonModelReplaceGeneratedValueIdsWithAutoIncrementAdditionalMappingContributor implements AdditionalMappingContributor
{

    @Override
    public String getContributorName()
    {
        return AxonModelReplaceGeneratedValueIdsWithAutoIncrementAdditionalMappingContributor.class.getSimpleName();
    }

    @Override
    public void contribute(
        final AdditionalMappingContributions contributions,
        final InFlightMetadataCollector metadata,
        final ResourceStreamLocator resourceStreamLocator,
        final MetadataBuildingContext buildingContext
    )
    {
        for (PersistentClass entityBinding : metadata.getEntityBindings()) {
            if (AbstractSequencedDomainEventEntry.class.isAssignableFrom(entityBinding.getMappedClass())) {
                remapToAutoincrement(entityBinding, "globalIndex");
            }

            if (AssociationValueEntry.class.isAssignableFrom(entityBinding.getMappedClass())) {
                remapToAutoincrement(entityBinding, "id");
            }
        }
    }

    private static void remapToAutoincrement(
        final PersistentClass entityBinding,
        final String fieldName
    )
    {
        var value = (SimpleValue) entityBinding.getProperty(fieldName).getValue();
        value.setIdentifierGeneratorStrategy("identity");

        Column column = CollectionUtils.extractSingleton(value.getColumns());
        column.setSqlType("bigint");
    }

}

Example Postgres migration (be sure to review and edit/substitute to match your environment):

ALTER TABLE domain_event_entry DROP CONSTRAINT domain_event_entry_pkey;
ALTER TABLE domain_event_entry
  ALTER COLUMN global_index DROP DEFAULT,
  ALTER COLUMN global_index SET DATA TYPE BIGINT,
  ALTER COLUMN global_index ADD GENERATED ALWAYS AS IDENTITY;
ALTER TABLE domain_event_entry ADD PRIMARY KEY (global_index);
DROP SEQUENCE IF EXISTS domain_event_entry_SEQ;

ALTER TABLE association_value_entry DROP CONSTRAINT association_value_entry_pkey;
ALTER TABLE association_value_entry
  ALTER COLUMN id DROP DEFAULT,
  ALTER COLUMN id SET DATA TYPE BIGINT,
  ALTER COLUMN id ADD GENERATED ALWAYS AS IDENTITY;
ALTER TABLE association_value_entry ADD PRIMARY KEY (id);
DROP SEQUENCE IF EXISTS association_value_entry_SEQ;

Steven_van_Beelen · November 22, 2024, 10:22am

@Gabriel_Shanahan, you’re right here. We at AxonIQ are aware of this, so we nudge people accordingly, of course. Especially those close in the network.

However, we do not talk with all users directly, so this should exist in the documentation, somewhere.

To be frank, I think I know just the place. But, I would greatly appreciate your (and other readers’) input on this. Guessing how somebody is going to traverse the documentation is, well, pretty tough. So, hearing from y’all how_ you go about the documentation would help uncover what the best place is. Or, the best places are.

That said, I was thinking of adjusting the Auto-Increment and Sequences page to accompany a (massive) warning concerning more recent versions of Hibernate and their sequencing behavior.

I do not think just this is sufficient, though. For one, I want to add it as a section to the Tuning section of the Axon Framework-specific pages.

Concluding, that leaves me with two pointers for readers of this thread:

Please give your 2 cents on the suggested locations!
Not just the code but also the documentation is open source. Hence, anybody can contribute. Thus, if your opinion is particularly strong (like yours, @Gabriel_Shanahan), the best help you can provide is to construct either (1) an issue or (2) do a PR (at this part of the Tuning guide, for example).

Given that I think you constructed this account with the forum for the reason to provide this comment, @Gabriel_Shanahan, I can tell you I would greatly appreciate your help

Everybody benefits here from the work AxonIQ puts into Axon Framework, the extensions, and all the documentation. Contributing to that here on the forum, but also on the repositories, ensures we keep it being great and useful.

Gabriel_Shanahan · November 22, 2024, 10:55am

Hey @Steven_van_Beelen , thanks for the prompt reply. I think both of the proposed locations are a good idea, but above that, I would consider creating a completely separate, toplevel “Gotchas” section, and including this and any other serious/significant issues you know of.

Other contenders:

SubscriptionQueries don’t work in multitenant unless you use Axon Server
SubscriptionQueries are buggy in general
- I don’t know to what degree you’re aware of this, but they cause memory leaks in certain common usage scenarios. I have test cases that demonstrate this that I’ve been meaning to share for months now, but time is unfortunately very short
Various unintuitive/non-obvious aspects of interceptors
- dispatch interceptors only runafter event sourcing handlers (makes sense when you think hard about it, but can be easy to trip over if you don’t)
  - Especially confusing when combined with the fact that handler interceptors do run on event sourcing handlers, which leads to the counter-intuitive behavior of "event handler interceptors running before event dispatch interceptors
- dispatch interceptors don’t always run on the same thread the event was dispatched on
  - I seem to recall there’s a deterministic way to ascertain if it will or won’t, but I’m not 100% on the exact rule - I believe it was “if the emission is done in the context of an existing UoW, the interceptor will be called on a different thread”
Axon uses LOB’s by default, which are problematic for multiple reasons (at least on Postgres) - since not everyone is a data expert, those aspect might be worth mentioning, as well as demonstrating how to work around this and change to bytea (I can share this code, it’s the same principle as above)

I understand that having a “Gotchas” section visible like this might feel counter-intuitive from a marketing standpoint, but any senior developer worth their salt knows that all software is buggy, known bugs >>> unknown bugs, and respects any software that makes their job easier by admitting and communicating its shortcomings.

ajisrael · November 25, 2024, 2:15pm

I feel really lucky to have found this thread. We’re about to launch our first Axon application which relies heavily on Sagas. Knowing this before we launch into production is extremely helpful.

To echo @Gabriel_Shanahan’s reply to your suggestion @Steven_van_Beelen, I feel that both of those places in the documentation would be helpful for finding the ways to resolve this particular issue.

That being said, I also agree that something like a “Gotchas” section would provide a lot of value as well. I don’t know if “Gotchas” is the ideal term to use, but having something within the docs to help surface bugs that have a solution, impact a large set of users, and cannot be resolved with a patch (such as this issue) would be extremely valuable. I would imagine that these would only exist in the docs for the version that is affected, so if this is resolved via Axon 5.0 then this issue would no longer be present on the page.

Maybe instead of gotchas we use “Known Issues and Workarounds”.

This is obviously just my 2 cents, and I’d love to hear what others have to say.

Gabriel_Shanahan · November 25, 2024, 2:28pm

+1 for “Known Issues and Workarounds”

Steven_van_Beelen · November 27, 2024, 10:27am

Although there is indeed marketing value in AxonIQ Docs, it is intended to help our users first and foremost. So, although it’s arguably counterintuitive, that argument is of zero importance to me. Differently put, it makes sense to have it.

Having said that…

I do really like the suggestion here, @ajisrael. So, thanks for that
Given your thumbs up on it, @Gabriel_Shanahan, I assume you’re on the same page regarding naming.

As it stands, I want to round up some other tasks first today. I just didn’t want to leave you guys hanging longer than I already did.

Once I have a PR, I would like to ask both of you to give your 2 cents as well. Would you be up for that?

Gabriel_Shanahan · November 27, 2024, 10:37am

Sure, feel free to ping me anytime

ajisrael · November 27, 2024, 2:12pm

I’d be happy to. Thanks @Steven_van_Beelen

Steven_van_Beelen · November 29, 2024, 10:04am

If you guys are curious, the PR is over here.

Note that I have not added everything from the list you’ve provided, @Gabriel_Shanahan. Not because they’re invalid, but mainly to share the load here.

Differently put, I hope that having the scaffolding in place will make it easy for anyone (inside or outside of AxonIQ) to add known issues.

KaeF · December 2, 2024, 2:20pm

There is no race condition between tenants if tenants have separate schemas or dbs - there are separate sequences then so tenantA and tenantB query different sequences.

Steven_van_Beelen · December 3, 2024, 1:48pm

I want to notify all readers here that the Known Issues and Workarounds page is online on AxonIQ Docs. I don’t think it’s complete by any means, but we can start adding things from there.