Best Practices for Message Queue Architecture
Implement Robust and Reliable Messaging Queue Architecture
Recently, use of messaging queue has boomed with increasing interest in new architecture patterns like microservices, CQRS, event sourcing and many more. Messaging queue is important component for coordinate between decoupled services and as Pub-Sub mechanism. Apache Kafka, ActiveMQ, RabbitMQ, ZeroMQ and SQS are few well known messaging queue technologies available in the market. Basic concepts and design remains will be same for most of the messaging queue technologies irrespective of what you choose for your application. For enterprise systems, data is the source of truth. With many decoupled services coordinating to complete one task, it is important to make this messaging queue architecture robust and reliable to overcome the issues of the distributed systems.
Events in Queues
In message queues, producers publish events to the queues and all interested consumers are notified about the event for processing. Decoupling your application in this manner allows you to scale your producers and consumers independently from each other. You can use Event Notification or Event Carried State Transfer technique to design the communication contract between your services. Minimal data to fetch event related information is sent along with the event in Event Notification, while the complete data with old and new state is part of the event in Event Carried State Transfer mechanism. If you want to read more about them please check this article.
When event is pushed to the queue, all interested consumers poll and fetch the new events for processing. With properly working system having no bugs will process those events without issue. But no application or infrastructure is 100% fail proof. There are many possibilities of our service going down or misbehaving; like unknown bugs or some aspects which developers can’t control, for example: network outage, brokers getting crashed unexpectedly, etc. It is our job to make sure that our service can withstand against such incident or can recover from unexpected crash.
Transactions in Producers
Often applications coordinates with multiple applications like database, messaging queues, search engine and many more. Idempotency is an important aspect to keep state of application consistent for all the decouples services coordinating on network. Idempotent implementations make sure that retrying the same operations should not change the state of application if that action is already executed. To make sure that application state is not changes, we should rollback all actions if exception is thrown.
Most programming languages implement libraries to provide transactional support for multiple services. We should use transactions to make sure that all changes made to the application state is rollbacked on exception. Use transactional support provided by the queuing technologies like ActiveMQ and libraries for Apache Kafka to make your service and operation transactional. Also to avoid any message duplication it is suggested to produce events at the end of the operation.
Use Acknowledgement Wisely
In queues, if we are not storing events in persistent storage and any exception occurs then we might loss the event. Almost all queueing technologies supports acknowledgements, a way of consumer notifying broker about successful consumption of the event. Until consumer acknowledges the event, the consumer will keep getting the same message every time the consumer pulls events from the queue. Acknowledges can be used as a mechanism to track and control the event processing. This works fine when we have only one consumer processing the events, but this limits the scope of scaling our consumers with increase in load. With more than one consumers, there are chances of same event being processed multiple times or we might lose order of event processing. For example, our consumer acknowledge the event after successfully processing the event, but if in the same time another consumer tries to poll event from the same queue, the message which was not acknowledged by the earlier consumer will be delivered to this consumer as well. Such scenarios leads to multiple time event processing or loss of execution order. Though acknowledgements can be used as mechanism to track last processed events, we should be careful with acknowledging events after successful event processing, special with multiple consumers processing events from the same queue. It is advisable to acknowledge the event as soon as consumers fetch the event from brokers.
Retrying Event Processing
If consumer fails to process the event on stable system where no change is introduced recently, then often network or system level failures will be the reason. Such issues doesn’t require bug fix on applications or consumer side. If you have multiple load balanced instances of application running then most of the time, retrying to process the event will fix it. Modern queuing libraries commonly provide support to retry the event processing on failure. For example, in Spring Framework you can use RetryTemplate to reprocess failed Kafka events. Adding some simple retry logic can save you from multiple customer connects during such network glitches or minor issues on system.
Stop Processing New Events on Continuous Failures
Sometimes we start getting alerts for multiple action failure, may be because of some bug or temporary outages spanning few hours. In such scenarios, you observe that all the events or all events affecting specific flow is failing and you need some time to fix the issue. But we need to keep in mind that our system is live on production server and if we keep our consumer running then we will loose all the events generated in this span of time. Above retry logic won’t help us to safeguard against such incidents as such issues required manual intervention and may take some time to get fixed. To make our consumer resilient against such incidents, we can stop consumers from consuming one particular type of events. We might need to add custom logic to keep track of failures and once we observe the continuous failures above the threshold, we should notify concerned stack holders and stop consumer from consuming the events from queue. With this we won’t loose the events generated during outages and those events will be processed properly once the issue is identified and fixed.
Such strategy can be implemented in multiple ways. For example, Kafka provides APIs to manage the state of the consumer, which means we can start and stop the consumer programatically. So in Kafka consumers, we can keep track of offsets of failed events and increase the counter on continues failure and reset the counter on successful event processing. Once the counter of continuous failures crosses the pre-defined threshold, we can use Kafka provided API to stop the consumer until it is again notified to start processing (We can expose REST apis from consumer to manage the state(Pause/Resume) the consumer from consuming events from one or more queue).
Requeue Failed Events and Dead Letter Queue
For simpler use case, business can handle loss of few messages and implementing above two strategies can save you most of the time. But in critical businesses, where you can’t afford to loose any message, such as Payment Service, Stock Market, Order Service and almost all financial service. For such businesses, loosing one message can cost huge, so it is important for those organisations to design their applications resilient to all the outages and issues. By resilient, I mean that the data of customers should not be loose or must be notified to some system to be tracked later in future.
While working with queue, you can implement requeue logic and dead letter queue concepts for the same. If consumer failed to process the event, even after in place retries, then your consumer can push the same message to another queue for processing it after some fallback period. You will have to implement logic or another consumer which polls for the event for requeue and check if the event is eligible for requeue, (requeue threshold is reched?, is it useful to process this event again after some time?). If event is eligible for requeue the it will push the event back to the origin queue after pre defined fallback period for re-processing. Such implementation can give your system some time to recover from the issue and reprocess all failed events successfully.
When you have messages which can’t be processed, like wrongly formatted events, events for non existing resource and many more similar scenarios, we should notify other concerned teams to have a look at such events and take care of them if required. If requeue threshold is reached for any event or the event is not useful to process in future, then such events are pushed to another queue, often called as dead letter queue, for notification and persistent purpose. When the event is pushed to the dead letter queue, you can consume such events and notify the concerned teams in chat groups, via email or any other alerting system.
Messaging Queues adds one more important tool in your box to serve your customers/users in a better way or to improve your architecture for decoupled services. With decoupled services, we have to consider many scenarios to make our architecture full proof. It is not practical to implement and maintain all the above discussed solutions for all your consumers. How should you design your applications with queues highly depends on the business requirements. For example, you can ignore requeuing events on failures and dead letter queue, if business is not concerned with intermittent data loss or your application can handle few message loss. But for critical business requirements, like financial systems, it is important to consider above solutions to reduce any inconsistency in your application or data loss. In short, before designing your architecture with queues, you need to first evaluate business criticality and then implement the solutions based on the requirement and granularity of the consumers.