CQRS/ES stands for Command Query Responsibility Segregation and Event Sourcing. These are different techniques and can be used separately. But when used together, they enable us to built systems which are more resilient and elastic. As is always the case, we need to make trade-offs when we plan to use these techniques.
In traditional style of data persistence, every data update in database removes older state of data and current state is written. So previous state is no more available and we only have current version of data. This approach of persistence is called State Based Persistence. This style captures current state of data, but doesn't tell how we got to it. If we find an issue with current state of the data caused due to some change in past, we can't fix it.
A workaround for the problems faced in state based persistence is to use audit log. Audit log captures all the events in history which led to current state. So we can check which event in the past caused the issue and we can fix the current state. But audit log has its own issues. First, audit log and state can go out of sync. Second, if we store audit log in a file then we can't update data and audit file in same transaction. Also, now we have to decide which of the two is source of truth, data or audit log.
Event Sourcing(ES) is an alternate approach of solving all these issues. ES eliminates persistence of state and only persists the event which causes change in state. Event log is append only. We never modify an existing event or delete it. So ES basically is about capturing the intent, rather than destination. ES thus captures the journey of all the events which led to current state of the system. And to determine state of the system, we just replay the events. We need to take care that side effects related to actual execution of events are avoided when we replay the events. For example, when an event was originally captured, it also send a notification. So while replaying the event to determine current state, we must take care that notification part is not executed.
In most of the cases, number of events we may need to replay to determine current state is not so high. But for some use-cases it may be as high as hundreds or thousands of events. Replaying so many events may hit the system performance. We can make use of snapshot based approach for such scenarios. What it means is that we take snapshot of the state at regular intervals which determines state of the system at that point of time. And when we need to determine current state we only replay those events which occurred after latest snapshot.
ES provides several advantages in system building:
- Built in audit log
- Append only database operations are more efficient than update operations
- It is easy to rectify the errors made in the past by fixing the bug and obliterating the events and replaying them
It is common for data model to change with time. In case of state based persistence, this is not much of an issue. We can add/remove appropriate columns to the tables and provide default values for records already in the system. But for ES, it is a bit complicated. For ES, event log is sacrosanct. An event in the event log can't be modified or deleted. So if data model evolves, we need to add new version of event log. This adds to complexity in application layer as we need to continue to support older versions of event log. A best practice is to have flexible format for event log, like protobuf. On other hand something like Java serialization highly inflexible.
In Domain Driven Design, we use concepts like aggregate and aggregate root. For example, in a reservation system, we have have reservation as aggregate root and it fits well with writing reservations in the database. But if we want to fetch reservations for a customer, customer becomes a better aggregate root. This shows a conflict between read and write models. An aggregate root seems not to support all possible queries. Moreover, systems generally have different loads for read and write queries. So supporting both read and write with same model is not a good idea.
Command Query Responsibility Segregation(CQRS) is a technique to tackle this problem. This approach is based on having separate models for write and mode database operations. When system receives a command, it goes through write model and when a query arrives it goes through read model. We can have multiple read models based on the use-case. For example, in a reservation system we can have a read model to read reservations based on location(using location as aggregate root) and another read model to fetch reservations based on customer(using customer as aggregate root). And as we have separate models, we can optimize them as per needs. Write model is optimized for writing while read models are optimized for reading. For example, we can make have separate schemas, like we can make read schema denormalized so that joins are not needed, thus making reads fast, or we can use different data stores(polyglot persistence, like Elasticsearch has better read efficiency while Cassandra writes are much more efficient) or we can use different scaling schemes for read and write as per the load requirements.
Though ES and CQRS can be used independently, they are often used together as they compliment each other very well. Write model is used just to persist events. But events are not good candidate for read purposes. So we make use of a process called projection which consumes events and transforms them to a more read friendly form, which is mostly a denormalized form. Read model now reads from this new projected form rather than original events.
CQRS/ES based system also makes it possible to add new read queries easily. We just need to add a new projection to transform events to the desired form. Read and write models can evolve independently as per the needs.
Cost of CQRS/ES
CQRS/ES has various advantages as discussed above, but it also incurs some costs, major being:
- CQRS model is considered to be more complex
- May need to maintain multiple data stores
- More number of classes/objects(commands, events, queries)
- Higher storage size needed due to even log and data duplication in read model