Tradeoffs in messaging oriented architectures

Messaging systems are a very important component of the data pipeline in any large scale system.   Consider a simple web application that serves all its requests online (ie directly from the source of truth data store without any preprocessing or offline index generation):


Pretty soon as load increases and horizontal scaling schemes (partitioning, replication, LB etc) have been exhausted strict consistency has to be relaxed by separating the read and write paths, which necessitates the employment of message passing (or event dispatching) across the differents components (services) in your ecosystem, resulting in (a broader version of) the following (each tier is actually Load Balanced in a cluster for high availability):screen-shot-2016-12-10-at-2-56-51-pm

Message passing in a service-oriented architecture (SOA) is very critical for scalability and availability (from enabling eventual consistency to offline processing and analytics to buffering of event firehoses and so on).   In the simplest and most abstract term, message passing can be just thought of as firing off events (user action, metrics, errors, conditions, sensor data, stock prices etc) for asynchronous consumption and processing by one or more consumers depending on application needs.    Several systems enable messaging, eg Kafka, Google PubSub, Amazon SQS, RabbitMQ, ZeroMQ, and others.  While the overarching goal is asynchronous message delivery and consumption, these systems vary greatly on delivery semantics, persistence, ordering, access pattern optimization and so on.

Delivery Semantics

When messages are sent/fired, their delivery is never guaranteed in a distributed system (due to failures, corruption, unbounded network partitions etc).   At the highest level there are 3 delivery guarantees:

  1. exactly-once – Messages are delivered exactly once, processed exactly once and “verified/acknowledged” exactly once despite the level of resiliency in a system in a scalable and cost-efficient manner.  In practice, this requires a lot of application of idempotency constraints/tricks and expensive infrastructure and coordination.
  2. at-most-once: Messages are delivered, processed, verified at most once (including never).   Unfortunately if an acknowledgment is never received by the event processor or the receiver terminated during the processing the message is lost with no traces of its status.
  3. at-least-once: Messages can be delivered repeated until a successful verification – by far the simplest guarantee to provide.

Most systems provide at-least-once guarantee, and really, it takes us a very long way.   By easing the strictness of these guarantees, implementation costs are lowered and other semantics can be built *on top* of this.  A simple and widespread of this in practise is TCP/IP which provides ordering and reliability by building on top of IP (which is unreliable and unordered).

Persistence/Durability and Consumer state

Persistence and durability of messages affect delivery and processing requirements for consumers.   This is a major tradeoff between write-only queues and message brokers.  For example write-only queues like Kafka (and Amazon Kinesis and IBM MessageHub and others) are extremely producer/publisher-centric in that messages are published (reliably and ordered) at very high rates, but ultimately it is up-to the design of the consumers (in the architecture) to ensure a meaningful processing of messages and offset advancements in the message logs/queues.  ie Messages write scaling is independent of the number (or online/offline state) of consumers.  Adversely, higher durability and persistence guarantees are required since messages may be consumed offline at a different time to when the messages were actually published (which puts an upward pressure on write latencies).

In comparison typical message brokers (like RabbitMQ and *MQ) are great for enabling complicated topic and routing strategies but come at expense of state management across the consumers (by the broker).   Additionally depending on persistence configurations, the extra cost of state management affects how inexpensively writes can be scaled (hint: it is expensive) so are suited for cases with low write rates and suited more for “online” consumers rather than offline consumption and processing of messages.   If the requirement for asynchronicity is for online consumers for immediate consumption of messages then persistence constraints can be relaxed as the need and significance of “historical” messages may be low (eg in online learning systems).

Interface to a messaging system

Despite the fact that a messaging system is essentially an abstraction for storing messages for asynchronous consumption, the tradeoffs above influence different usage scenarios.   It so happens that building on top of a simpler abstraction (like a high throughput durable ordered queue) can lead to interesting routing strategies and workflows like those possible by a more complicated message broker.   What this boils down is to let the topology of consumers (and subsequent publishers) guide your guarantees at different parts of the system which requires more fine tunning and manual approach in topology selection and design.   There are ways to build this in a more reasoned approach, which is the topic of another blog (Spoiler Alert – Stream Processing).




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s