At COREMETRIX we recently did some big changes to our architecture and this is where inspiration for writing this article came from. We are fortunate enough not to have to deal with a monolith application but rather with microservices. We run on the cloud (AWS) and we are big fans of automation, from infrastructure as code (Terraform, Ansible) to our tests (CDC test, synthetic monitoring). While a microservice architecture has a lot of benefits, as your architecture evolves you can end up with quite a lot of dependencies between your services.
This article will highlight some of the benefits of moving towards an asynchronous architecture and some of our learnings.
Synchronous vs. Asynchronous
A synchronous operation is when you ask someone else for something and you wait for him or her to respond. This operation occurs in actual time. As illustrated in the diagram below, after making a request, Alice has to actively wait for Bob to respond. Both Alice and Bob have to work together at the same time – in other words they are synchronised.
An asynchronous operation is when you send someone a message and you do not wait for him or her to reply. You can go on and do your business and only react once you receive a reply. One example of asynchronous communication is an email communication. The diagram below illustrates Alice and Bob working in an asynchronous way.
A Microservice is an architectural term that describes a method of developing software systems independently deployable and loosely coupled. Rather than creating a large application that does everything, or in other words a monolith, you create a suite of modular services, where each service has its own well-defined function. Microservices integrate via well-defined interfaces, most commonly REST over HTTP.
Another important aspect of a microservice is data encapsulation. Each service owns its own data and this data is only accessed via its interface.
I’ve definitely seen places where so-called microservices shared the same database. This is a major anti-pattern where you pay the cost of a distributed application and lose a lot of the benefits of microservice architecture. Going back and fixing things like this is time consuming.
A microservice architecture allows you to choose the right technology for the specific problem. It also means that they are easier to maintain as the codebase is a lot smaller.
Our journey started like this, we followed the best practises, some of them mentioned above, and we took into account things like the 12 factor app. We took tests seriously and introduced Consumer Driven Contacts (CDC) and Synthetic Monitoring. CDC tests were relatively simple to implement, where each test was written from the point of view of the consumer and how each consumer expected an API to behave. Tests created their own data and cleaned up after themselves.
We also made sure that we decoupled legacy services that were integrated at the database level. We end up with a microservice architecture where each service could be safely deployed multiple times a day. Each service owns only the data it needs to do its job.
As the application evolved, more and more microservices got created and introducing certain features became a bit harder as some of the services inevitably became quite coupled to one another. Further more, if one service became unavailable this immediately affected other services that depend on it.
The diagram below illustrates a simplified version of how the architecture looked at that point in time. You can easily imagine that if service A becomes unavailable for a period of time, service I cannot do its work.
Apart from the fact that a calling service is impacted by errors, we also lost some of the flexibility as every service knows about each other.
Asynchronous Microservice architecture
Introducing a message queue like RabbitMQ and making services interacting with each other this way solved a lot of the previously specified problems. The number of dependencies was minimised (less coupling), therefore each service became more autonomous. More importantly, a service may continue to work if any of its downstream dependencies are down.
In this new architecture none of these services know about each other. Also availability is increased. If you imagine that service I is a service that scores certain events and service A is service that produces these events. In case of service I becoming unavailable for a while, A can continue to function. In the meantime these messages get accumulated on the queue and when the scoring service becomes available again, they will be consumed.
This new set up comes with some new problems that you will need to take into account. Things like what happens if a message cannot be consumed? How do you make sure you do not lose messages? What if your message queue gets filled up? At Coremetrix, having chosen RabbitMQ as our message queue, we answered these questions by making sure every queue has an equivalent dead letter queue where rejected messages end up. These messages can be manually inspected and even replayed later on. We also made our queues mirrored across multiple nodes for high availability. And finally we set up alerts on queue sizes.
Another complication that we had to overcome was testing. Automated testing is a lot more problematic when it comes to asynchronous systems. Our tests consist of unit tests and integration tests that run against Docker containers and stubs when we do a build. On top of that we have CDC tests and synthetic tests after deployment to each environment.
One of the first things we had to do was to write integration tests for the integration with the queues. These tests give us the confidence that we are using the infrastructure correctly.
Secondly, after the new changes were introduced, the majority of our CDC test started failing. In these tests we had to replicate the behaviour of the producer and poll the downstream API to check for the response. Our test are written in Scala, you can imagine a typical test looking like something like this:
You would have to also set up some PatienceConfig to define the time you tolerate for unsuccessful attempts and the interval to sleep between attempts.
These tests give us the confidence that messages can be published/consumed and served as expected via the tested API.
Moving to an asynchronous architecture made our system more robust, less coupled and improved performance and scalability. It also improved reliability by making our system more tolerant to errors. There are definitely a lot of new things to learn and new problems to take into account. Testing can be more complicated but that is a price worth paying.
Head of Engineering