Hi Lukáš,
We handle a very similar use case. Our design is as follows:
- HTTP POST is submitted to /{dataType}/uploadData with file in body (multipart/form-encoded)
- The HTTP Controller takes the byte[] data and stores it in an ‘uploadRepository’. The repository returns a unique ID (“U1”) for this file
- The Controller then takes the remaining posted fields and submits a command ("SubmitUploadCommand(uploadID = U1, fileName = “data.xlsx”, …)
- The Aggregate is created by the SubmitUploadCommand and fires a UploadSubmittedEvent
- The command handler also has access to a ‘spreadsheet validator / parser’ service:
Set errors = spreadsheetValidator.validateUpload(“U1”)
- If the validator/parser service returns any errors, those are recorded as additional events, and the overall state of the upload will go into a ‘failed’ state (UploadFailedEvent)
- The aggregate never handles the byte[] data directly, instead it passes the uploadId to a service that can retrieve the data from the repository.
We’ve found that some of the data processing can take multiple seconds, and we created a saga for this workflow:
Upload Aggregate: UploadSubmittedEvent
Upload Processor Saga: create saga, schedule validation task
Upload Processor Saga: scheduled task runs, validate file (asynchronously from user command)
Upload Processor Saga: Send ‘success’ or ‘failed’ command to Upload Aggregate
Upload Processor Saga: saga ends
We built a whole workflow around this: The upload aggregate state model goes from “submit” through “validate”, “process” to “success” (or “failed”). The aggregate only tracks the file ID (“U1”) and has the Saga or another infrastructure service handle the actual file data.
I would avoid placing large data on the command itself. In our case, we serialize the commands (logging/auditing) and we route them via JGroups (clustered command bus). Having large amounts of binary data here would break the infrastructure.
We ended up storing the data as a BLOB in the database, so it’s accessible by any VM, but shared file systems, S3 or any other storage solution would work equally well.
Taking the actual file processing out of the aggregate and into a Saga works great to give users feedback on their potentially long-running processes. Just account for it in your state model.(Upload Submitted -> Upload Validated -> Processing Started -> Upload Processed -> Deactivate Requested -> Upload deactivated.)
If you do use Sagas for asynchronous processing, I would recommend keeping them as short-lived as possible (end the saga after every step, create a new saga if another step is needed.)
We have separate aggregates for the “Upload” and the actual objects that are being created from the uploaded data.
~Patrick