As someone who has used RabbitMQ in production for many years, you should rather consider using NATS [1] for RPC.
RabbitMQ's high availability support is, frankly, terrible [2]. It's a single point of failure no matter how you turn it, because it cannot merge conflicting queues that result from a split-brain situation. Partitions can happen not just on network outage, but also in high-load situations.
NATS is also a lot faster [3], and its client network protocol is so simple that you can implement a client in a couple hundred lines in any language. Compare to AMQP, which is complex, often implemented wrong, and requires a lot of setup (at the very least: declare exchanges, declare queues, then bind them) on the client side. NATS does topic-based pub/sub out of the box, no schema required.
(Re performance, relying on ACK/NACK with RPC is a bad idea. The better solution is to move retrying into the client side and rely on timeouts, and of course error replies.)
RabbitMQ is one of the better message queue implementations for scenarios where you need the bigger features it provides: durability (on-disk persistence), transactions, cross-data center replication (shovel/federation plugins), hierarchical topologies and so on.
While a simple HTTP interface is easy to code around, I quite like the AMQP protocol. It's fast, efficient, reliable and powerful.
We currently use it to send hundreds of thousands of messages per second, large and small, around the world to different data centers and it always works smoothly.
Note that NATS is not HTTP, it's its own very simple text-based protocol.
AMQP is nice, but I don't think anyone would categorize it as simple. It's binary, for one. And to read and write it, you have to deal with framing, which is not always easy to get right. For example, for a long time RabbitMQ had an issue with dead-letter exchanges (aka DLX), where each bounce would add a header to the message envelope. DLX is great for retries, but after a bunch of retries, a message could get quite large. Some clients (the Node.js client in particular) has a small limit on frame sizes and will throw on such messages rather than grow the buffer. (Fortunately, this header was fixed in a recent RabbitMQ version.)
Didn't say AMQP is simple though. It's definitely not, but that's where the features and functionality come from. It seems the issues you described are with the broker and clients, not the protocol itself, which I find to be pretty solid.
AMQP also specifies the data model (exchanges, queues, bindings and so forth), which dictates the implementation of it. I find that data model a bit heavy-handed, but it's not terrible. However, it's not very suitable for RPC.
I spoke to Kyle Kingsbury (Aphyr) about 9 months ago and at that time, the only queue or pub/sub system he thought was relatively safe from partition errors was Kafka. Not sure if his position has changed recently.
Indeed, though this is irrelevant to this particular use case.
NATS doesn't have replication, sharding or total ordering. Consistency is a challenge for clustered messaging brokers that need this.
With NATS, queues are effectively sharded by node. If a node dies, its messages are lost. Incoming messages to the live nodes will still go to connected subscribers, and subscribers are expected to reconnect to the pool of available nodes. Once a previously dead node rejoins, it will start receiving messages.
NATS in this case replaces something like HAProxy; a simple in-memory router of requests to backends.
This is also my personal experience with message queues, even though I haven't had a chance to work with NATS yet. Kafka is just a really solid piece of engineering when you need 5-50 servers. With that many servers you can handle millions of messages per second that usually enough for a mid size company. I am not sure about higher scale but I believe LinkedIN has much larger clusters.
Initially when developing Alchemy we had a look at NSQ http://nsq.io and found it a little difficult (that was a year or so ago, so might be better now). Then we started looking at RabbitMQ and thought it fit our requirements better. I have not heard of NATS but will definitely have a look see. At the moment Alchemy is tied to RabbitMQ pretty tightly, but abstracting and supporting many queue solutions would be good.
I don't like pushing the retry into the client side. Since microservice by necessity have lots of communication between them that can be quite a bit of code across all the services. I would rather the architecture deal with it and just ensure that endpoints are idempotent so calls can be retried without adding client complexity. This is a personal preference, and in some cases clients do need to deal with retires, but I just like it not to be the default.
1. cluster_partition_handling set to autoheal so that it will do its best to recover (a few lost messages is infinitely better than a broken system)
2. queue_master_locator is min-master, so queues are mastered on nodes where the least amount of other queues are mastered. This will balance the queues across the clusters meaning if a node goes down then there will be minimal amount of queues to recreate
3. A mirror policy to mirror every queue (this will only mirror service queues because response queues are exclusive) , this will make the system a bit slower, but makes it much more robust.
This is enough to handle split brain (although this is difficult to test) and nodes going down and coming back (much easier to test).
Destructive cluster recovery, patched with mirroring raises the question of how suitable RabbitMQ is for the job in the first place. Mirroring for RPC requests, think about it for a second! Mirroring. For RPC!
RabbitMQ's autohealing just solves your problem in the wrong way. Yes, it will usually fix itself (if it doesn't die with a Mnesia inconsistent_database error), but it will discard messages, and you won't know which ones.
Meanwhile, NATS will forward messages to subscribers as long as there's a clear path. There are no network partition issues because the queues don't have RabbitMQ's strict, total ordering.
Note that RabbitMQ is notoriously sensitive to partitions; one small blip and it gets its knickers in a twist. This is why I recommend increasing the net_ticktime option to something like 180 so you're less exposed.
Having done this for a long time, my advice is that making the client more intelligent is always the better option. If you're relunctant, consider a sidecar proxy like Linkerd [1] which can handle the gritty details for you.
HTTP/2 definitely makes it faster for browsers to load assets in a webpage, however, not sure how much it would speed up individual REST requests since most of the time is bound to the request/response round trip.
Just clarifying the comment about the benchmark, even though there are both Go and Node.js components the actual bit that was being benchmarked was HTTP and NATS for the inter service communication. All the code is available on github if anybody wants to rerun the benchmarks.
HTTP/2's main feature is that it's multiplexed, which a client written in a suitably async-friendly language can exploit to pipeline parallel requests, or at least better reuse connections. There's presumably not too much performance gain (though header compression helps) if your client can't exploit the multiplexing.
Setting Rabbitmq's "cluster_partition_handling" as "pause_minority" in theory ameliorates split-brain issues in clusters of 3 or more (odd numbers), where the majority would ignore the minority nodes.
NSQ and NATS are my goto tools for messaging, though NSQ seems more flexible to me because it supports message persistence and also provides NATS-like ephemeral channels for when persistence is not a hard requirement. And it comes with a shiny admin-dashboard, which NATS lacks. NATS is useful when raw performance is a priority.
One thing I do find lacking in both of these queues is support for per-message TTL though, for pruning time sensitive messages. I'm not sure what the performance overhead would be for supporting something like that.
Would this library be better off utilizing NATS instead of RabbitMQ baring any requirements a person might have for the persistence and other features you mentioned?
No, Kafka is completely unsuitable for RPC, for several reasons.
First, its data model shards queues into partitions, each of which can be consumed by just a single consumer. Assume we have partititions 1 and 2. P1 is empty, P2 has a ton of messages. You will now have one consumer C1 which is idle, while C2 is doing work. C1 can't take any of C2's work because it can only process its own partition. In other words: A single slow consumer can block a significant portion of the queue. Kafka is designed for fast (or at least evenly performant) consumers.
Kafka's queues are also persisted on disk, which is terrible for RPC.
Think of Kafka as a linear database that you can append to and read from sequentially. Its main use case is for data that can fan out into multiple parallel processing steps. For example, a log processing system that extracts metrics: You feed the Kafka queue into something like Apache Storm, which churns the data and emits counts into an RDBMS (for example).
Kafka stores everything to disk, this may not be what you are looking for for your RPC calls (that you would make usually as a direct service to service HTTP call). Moreover kafka "topics" are statically declared (i.e. by admin scripts instead of a public API), and it's a heavy-weight operation. So it's not the best fit to have micro-services registering themselves and dynamically creating "topics" for each method.
IMO this is an emerging anti pattern to use Rabbit to connect "microservices". It often introduces a single point of failure to your "distributed" system and has problems with network partitions. If critical functionality stops working when Rabbit is down, you're probably doing it wrong.
Most real world microservice projects (I've worked on several) already have many single points of failure. Often there is one service that needs to be up for the system to be up (such as the one that processes your customers orders), you don't realise some VMs are sharing a physical disk or everything is dependent on a single router somewhere you've never heard of that will one day run out of memory and drop TCP connections. This is not to mention the risks posed to availability by third-party tracking software that push changes that break web forms (#1 cause of long outages in my experience).
Message brokers like RabbitMQ give you a lot of benefit and introduce only a small number of failure modes. You can obviate tricky service discovery boot orders, do RPC with without caring about whether about your server could be restarted and of course you get a good implementation of pub-sub too. If you stay away from poorly considered high availability schemes I am absolutely fine with recommending it for intra-service communication.
Great comment. One thing I don't like very much about microservices is simply that my service often will have some such hard dependency. e.g. if the logging service is down, I lose all logs-based statistics. If the authentication microservice is down, I'm screwed. The parent comment made the excellent point about SPOF, but it seems like for microservices to work correctly, there will always be some SPOFs.
Maybe I'm being too pessimistic. I use microservices, but without significant engineering rigor, I think its a recipe for disaster.
> If the authentication microservice is down, I'm screwed.
Well, make sure your sevice is not dependent on those services then. Use signed tokens to only rely on the authenticator for logins. everything afterwards can work it our by themself.
logging: keep your stuff in a queue or logfile, until the log service is back up.
It does require some up-front work, but I've generally moved over to Kafka for most of my stuff. Rabbit has some nice aspects, but as noted its HA options start at "dire" and escalate smoothly to "oh dear god no". Throughput with Kafka is very, very good and,in my experience, it's remarkably difficult to kill. If you want fully ephemeral topics, you can write the broker data to a tmpfs. But I really like keeping data around, because I can replay my topics later both for DR and for debugging.
Kafka makes reliable durable messaging not just possible, but solves all the usual attendant problems you try to design around.
* Durable messages cause a slow-down when publishing, but not with Kafka because it uses the Linux kernel's page cache to write.
* Durable messages are slow to read if they have to come from disk. Kafka is optimized to load sequential blocks into memory an push them out through a socket with very few copies. This makes for very fast reads.
* Slow consumers can bog down the broker. Kafka stores all messages and keeps them on the topic for a time horizon. No back-pressure from slow consumers.
* Disconnected but subscribed consumers cause messages to back up on disk and eventually clog the broker. Kafka stores all messages. There's no clogging or backup, that's just how it works.
* Brokers must track whether a consumer actually received the message, failures can cause missed messages or clogs. Kafka clients may read from a given point on the topic forward. If they fail during a read, they just back up and read again. The messages will be there for hours/days/weeks as configured.
With a rock-steady durable messaging system based on commit logs, all of those problems that arose from attempting to avoid durable messaging go away.
Now you build microservices that emit and respond to events. Microservices that can "rehydrate" their state from private checkpoints and topic replays. And all of this with partition tolerance and simple mirroring.
Although it isn't usually necessary, if you want, you can make all of that elastic with Mesos, too.
Kafka saved my "life" few times. We had TTL set to 168h and somebody pushed a change to production that silently ignored a type of message. We realized it few days later. Luckily we could re-play all of the messages after fixing the code. I know there are so many things wrong with this, yet, Kafka is excellent at storing data for medium terms and that can be a real bliss.
I didn't read that web page carefully, but it seems to describe a transaction log, which is what Kafka excels it, but it has precious title to do with RPC.
RPC is point-to-point communication based on requests and replies. Kafka's strictly sequential requirement would be terrible for this because a single slow request would hold up its entire partition — no other consumer would be able to process the pending upstream events. Kafka is also persistent (does it have in-memory queues?), which is pointless for RPC.
Message queues, period, aren't particularly good for RPC. HTTP, as an online protocol, has such huge advantages that trying to replace it doesn't make sense to me. However, for comms between services where a consumer isn't waiting on the other end, a message queue is plenty appropriate--and Kafka is much, much better at that than RabbitMQ is in terms of throughput and data sanity.
I also quite like NATS, but Kafka provides similar performance characteristics in the general case (generally higher latency being the exception, though I have never encountered latency-sensitive processes where a message queue made sense in the first place) and means babysitting fewer systems.
To be clear, NATS is not a heavy messaging broker like RabbitMQ. For one, it's in-memory only, and queues only exist when there are consumers: If you publish and there are no subscribers, the message doesn't go anywhere. NATS is closer to ZeroMQ than RabbitMQ or Kafka.
A lot of people use HAProxy to route messages via HTTP to microservices — what's HAProxy if not a glorified message queue?
If you don't use an intermediate — meaning you to point-to-point HTTP between one microservice and another — you have to find a way to discover peers, perform health checks, load-balance between them, and so on. Which you can do — services like etcd and Consul exist for this — but using an intermediary such as NATS or Linkerd [1] is also a great, possibly simpler solution.
See above comment, in the file https://github.com/LoyaltyNZ/alchemy-framework/blob/master/s... is a description of creating a cluster of RabbitMQ nodes that can auto heal if an individual node goes down. We are running CoreOS which will occasionally shutdown a node and update. When this happens we see zero downtime and no error messages.
Even if you have a cluster of Rabbit nodes, the logical RabbitMQ cluster is still a single point of failure.
This is a fundamental problem with message bus architectures that advocates seem to ignore. It's even more problematic in a microservices architecture, where you (presumably) do domain-driven design in order to allow overall forward progress in the face of partial service/component/datastore outages. To throw all that out the window by coupling everything to a message bus... I still don't really understand.
In my experience failure occurs more frequently when you use more and more systems in more complex ways. e.g. using HAProxy for load balancing, with Consul for service discovery and Consul template for configuration. Each of these is a single point of failure as they are all required for the system to work.
If you define single point of failure, as any computer goes down takes the system with it, then RabbitMQ is not a single point of failure.
I am not sure how domain driven design helps solve this.
> HAProxy for load balancing, with Consul for service discovery and Consul template for configuration. Each of these is a single point of failure as they are all required for the system to work.
Not necessarily. I don't know anything about consul, but if you use something like zookeeper to discover services and write those into an HAProxy config, include a failsafe in whatever writes the HAProxy config on ZK updates such that if the delta is "too large" it will refuse to rewrite the config.
Then if ZK becomes unavailable, what you lose is the ability to easily _make changes_ to what's in the service list. If your service instances come and go relatively infrequently, this might be fine while the ZK fire get put out.
Service instances in a continuous deployment environment are coming and going all day. IF your service discovery and config breaks then everything stops, nothing can be developed or deployed until the broken stuff is fixed.
If SD or config mgmt dies, you can't deploy new stuff, but the existing services continue to work. When your message bus dies, everything dies. It's a fundamentally different failure.
Depends on the application, but as one possible general solution: have microservices talk to each other directly when it makes sense to rather than communicating over RabbitMQ / central message bus out of laziness/convenience.
Doesn't this make the architecture a lot more complex though? I mean if every service uses a common messaging broker, number of connections to every other service is O(n). While if every microservice needs to talk to every other microservice, its O(n2). And analysis is much harder, unless all the services send their logs/metrics to a common logging/metrics system.
If you don't have every service talk to every other service then it is either a deployment problem or a load balance problem.
If a service only say talks to instances of another service that is local, then every node in a cluster must contain ALL services. If the local service is overloaded but a remote service isn't then the service you are talking to will be slow, regardless of any front end load balancing.
Every service must be able to talk to every other service, because if it cannot then you cannot load balance or deploy without it being an n^2 problem. So the question is how to implement it WITHOUT complicating the services
profilesvc may only need to talk to (depend on) usersvc and accountsvc to do its job. usersvc and accountsvc may only need to talk to (depend on) their data stores.
Yes, every instance of every service needs to manage its communication paths to other services. That's N connections, rather than just 1 to the message bus. But this is a pretty well-understood problem. We have connection pools and circuit breakers and so on. And the risks are distributed, isolated, heterogeneous.
Service Discovery often removes the need for load balancers. Let the clients discover where all the instances of Service X are and build the clients to handle failures to connect to individual instances.
Service discovery does not remove the need for load balancing.
For example, if you had three nodes, each with every service and round robin service discovery to overload the system is just a matter of receiving a difficult request every third query. No matter how good your front end load balancing is in a micro service system, if your intra-service requests are not load balanced you can have problems with overloading one node, while others are idle.
Service discovery can remove the need for load balancing, if you move load balancing logic into the client services. Have them architect their own load balancing over available instances of their dependent services.
Unless you're deploying a Smartstack-esque LB strategy, where each physical node hosts its own load balancer, then the details of what's deployed where are mostly irrelevant. You use your SD system to abstract away the physical dimension of the problem, and address logical clusters of service instances. And you rely on your scheduler to distribute service instances evenly among nodes.
Looks pretty cool, but doing RPC over RMQ is quite expensive. I would rather see good abstractions over ZMQ - that would be really cool.
Also as many people here pointed out, RMQ is a single point of failure which might be acceptable for some cases. To me, the problem is that by design this architecture has a bottleneck which is really hard to get rid off.
OpenStack has had this as a common design pattern across a lot of the services for a while.
It works quite well for us now - there was a period 2-3 years ago where rabbitmq was the bane of my existence, and the cause of many a page but newer versions have been fine.
We do assume that everything in the queue can go away, and only in flight calls will fail - and (at least in the service I work on) we do retries if something falls through the cracks.
Yes, I have worked with the OpenStack messaging queue, and its really amazing how it just works. Also, easy to grab analytics about service usage by querying the queue directly, is amazing.
I had not heard of these before. But they look really interesting. I think they are pretty similar as it is a common pattern, and straight forward solution to many problems.
This looks cool. I wrote a less developed than this and unreleased RabbitMQ-based microframework that combines RPC and one-way event streams to integrate an ecommerce site in one datacenter with a back office in another. At another company I used ZeroMQ to make a job processing cluster.
Comments by others suggesting the use of http are kind of missing the point of message queues. The latency added by http(s) made interactive RPC sluggish, and made batch back office updates take hours instead of minutes with AMQP.
If I had Alchemy two years ago I might have used this instead of rolling my own.
this is helpful. we already use rabbitmq for our golang services. This would ease the task of setting up publishers and subscribers in the node APIs we're building.
I've been working on something similar but decided to go with Kafka as the message bus because I want the messages to persist. This allows for more error recovery solutions and auditing. Still having all that data around also allows me to come up with new ideas to use that days after the fact.
I feel like this is a bit silly. Feel like one could just speak HTTP or Thrift, and use etcd or zookeeper for discovery. RabbitMQ just complicates the stack. Also, it's HA options are generally... disappointing.
I do not understand the question. In production we use an ELB hitting multiple routers, on multiple nodes. Not sure how hashring applies to load balancing.
as someone who's used RabbitMQ in a SaaS app to run millions of tasks, heed my warning, RabbitMQ is great if you are low on memory but the overhead was too much to deal with. Celery also contributed to this but if I could travel back in time, I would stop myself from using RabbitMQ. Instead, Redis is a much better alternative for micro-service.
Redis pub/sub (which I assume you are talking about as an alternative) is fine, but you have to be listening when the message is sent otherwise you miss it.
Another problem is working out if where you sent it there are any listeners, which RabbitMQ deals with return queues. This is important for 404's when a client is trying to talk to a service that does not exist. In Redis you would just post to a pub/sub queue wait till nothing happens and timeout, rather than immediately knowing through a return queue.
Actually Redis returns the number of consumers that were subscribed during your publish and received the message, so you can detect that no service is available and return your 404 status. The real limitation for RPC over Redis pub/sub to me is that you can only broadcast your messages to all consumers, and not have them load-balanced between the subscribed applications. So you end-up following a discover/unicast call pattern where Redis doesn't bring much to the table for the actual RPC call. Maybe that's an opportunity for a cool addition to redis...
Cheers, I didn't know that about Redis. I actually really like Redis and it simplicity, it would be cool to get more options for passing messages over Redis.
RabbitMQ's high availability support is, frankly, terrible [2]. It's a single point of failure no matter how you turn it, because it cannot merge conflicting queues that result from a split-brain situation. Partitions can happen not just on network outage, but also in high-load situations.
NATS is also a lot faster [3], and its client network protocol is so simple that you can implement a client in a couple hundred lines in any language. Compare to AMQP, which is complex, often implemented wrong, and requires a lot of setup (at the very least: declare exchanges, declare queues, then bind them) on the client side. NATS does topic-based pub/sub out of the box, no schema required.
(Re performance, relying on ACK/NACK with RPC is a bad idea. The better solution is to move retrying into the client side and rely on timeouts, and of course error replies.)
RabbitMQ is one of the better message queue implementations for scenarios where you need the bigger features it provides: durability (on-disk persistence), transactions, cross-data center replication (shovel/federation plugins), hierarchical topologies and so on.
[1] http://nats.io
[2] https://aphyr.com/posts/315-jepsen-rabbitmq
[3] http://bravenewgeek.com/dissecting-message-queues/