Microservice Communication — Every Tool You Need and When to Use Each One

Microservice Communication — Every Tool You Need and When to Use Each One

When you split a monolith into microservices, the first real problem you hit is not deployment or scaling. It is communication. How do these services talk to each other? The answer is not one tool — it is knowing which tool to use for which situation.

I have built microservice architectures across fintech platforms, marketplace systems, and AI-powered backends. Every time, the communication layer is where teams either get it right and move fast, or get it wrong and spend months debugging cascading failures.

This is the guide I wish I had when I started. Every major tool, when to use it, when to avoid it, and what it actually looks like in production.

The Two Fundamental Patterns

Before diving into tools, you need to understand the two core communication styles.

Synchronous — Service A sends a request to Service B and waits for a response. Like a phone call. Simple, immediate, but the caller is blocked until the other side responds.

Asynchronous — Service A sends a message and moves on. Service B processes it whenever it can. Like sending a text. The sender is not blocked, but you need infrastructure to hold and deliver that message.

Most production systems use both. The question is knowing which pattern fits which interaction.

REST APIs (HTTP/JSON)

REST is the default. If two services need a simple request-response interaction and you do not have a strong reason to use something else, REST is the answer.

How it works: Service A makes an HTTP request (GET, POST, PUT, DELETE) to Service B’s endpoint. Service B returns a JSON response. Standard HTTP semantics apply.

When to use it:

  • CRUD operations between services
  • Simple request-response where latency is acceptable
  • Public-facing APIs that external clients consume
  • When your team is small and you need something everyone already understands

When to avoid it:

  • High-throughput internal communication (thousands of calls per second)
  • When you need streaming or bidirectional communication
  • When payload size and serialization overhead matter

Real-world example: An order service calling a payment service to charge a customer. The order service needs a definitive yes-or-no response before proceeding. REST works perfectly here.

The catch: REST over HTTP/1.1 creates a new TCP connection per request (or reuses connections inefficiently). At high scale between internal services, this overhead adds up. And JSON serialization is verbose — you are sending field names as strings in every single payload.

gRPC

gRPC is what you reach for when REST becomes the bottleneck. It is built on HTTP/2, uses Protocol Buffers for serialization, and supports streaming natively.

How it works: You define your service contract in a .proto file. Both the client and server generate strongly-typed code from that definition. Communication is binary (not JSON), compact, and fast. HTTP/2 multiplexing means multiple requests share a single TCP connection.

When to use it:

  • High-frequency internal service-to-service calls
  • When you need strict API contracts between teams
  • Streaming data (server-side, client-side, or bidirectional)
  • Polyglot environments where services are written in different languages
  • Low-latency requirements where every millisecond matters

When to avoid it:

  • Browser-to-server communication (browsers have limited gRPC support without a proxy)
  • Simple CRUD where REST is good enough
  • When your team has no experience with Protocol Buffers and the learning curve is not justified

Real-world example: In a fintech platform I worked on, the transaction processing service communicated with the fraud detection service via gRPC. Every transaction needed a fraud score in under 50ms. REST could not keep up at 5,000 transactions per second. gRPC with connection pooling and binary serialization brought the p99 latency under 20ms.

The catch: Debugging is harder. You cannot curl a gRPC endpoint. You need tools like grpcurl or Postman’s gRPC support. The .proto file management across teams requires discipline.

RabbitMQ

RabbitMQ is a message broker built on the AMQP protocol. It is the workhorse of asynchronous messaging — reliable, mature, and battle-tested.

How it works: Service A publishes a message to an exchange. RabbitMQ routes that message to one or more queues based on routing rules. Service B consumes messages from its queue. Messages are persisted until acknowledged.

Key patterns:

  • Direct exchange — message goes to a specific queue (point-to-point)
  • Fanout exchange — message goes to all bound queues (broadcast)
  • Topic exchange — message routed based on pattern matching on routing keys

When to use it:

  • Task queues (distribute work across multiple workers)
  • Decoupling services that do not need immediate responses
  • When you need guaranteed delivery (messages survive broker restarts)
  • Complex routing logic (different messages go to different consumers)
  • Background job processing (sending emails, generating reports, processing images)

When to avoid it:

  • High-throughput event streaming (millions of events per second)
  • When you need message replay (once consumed, messages are gone)
  • Log aggregation or event sourcing (Kafka is better for this)

Real-world example: In a marketplace platform, when a seller lists a new product, the listing service publishes a message. The search indexing service, the notification service, and the analytics service each have their own queue and process the event independently. If the notification service goes down, messages queue up and get processed when it recovers. No data loss.

The catch: RabbitMQ is a traditional message broker — it deletes messages after they are consumed and acknowledged. If you need to replay events or maintain a history, RabbitMQ is the wrong tool.

Apache Kafka

Kafka is not a message broker in the traditional sense. It is a distributed event streaming platform. The difference matters.

How it works: Producers write events to topics. Topics are split into partitions for parallelism. Consumers read from partitions at their own pace using offsets. Crucially, Kafka retains events for a configurable period (days, weeks, or forever). Multiple consumer groups can read the same events independently.

When to use it:

  • Event sourcing (recording every state change as an immutable event)
  • High-throughput data pipelines (millions of events per second)
  • When multiple services need the same event stream
  • Log aggregation across services
  • Real-time analytics and stream processing
  • When you need message replay (reprocess events from any point in time)

When to avoid it:

  • Simple task queues (Kafka is overkill and operationally complex)
  • When you need complex routing logic (RabbitMQ is better)
  • Low-volume messaging where RabbitMQ or even Redis would suffice
  • If your team cannot handle the operational overhead

Real-world example: In a fintech system, every transaction is written to a Kafka topic. The account balance service consumes it to update balances. The fraud detection service consumes it to run real-time risk scoring. The reporting service consumes it to build daily summaries. The audit service consumes it to maintain a complete transaction history. All from the same event stream, each at their own pace.

The catch: Kafka is operationally complex. ZooKeeper (or KRaft in newer versions), partition management, consumer group rebalancing, offset management — there is a lot to get right. If you are a small team with low throughput, Kafka will create more problems than it solves.

NATS

NATS is the lightweight alternative that does not get enough attention. It is fast, simple, and designed for cloud-native environments.

How it works: At its core, NATS is a pub/sub messaging system. Publishers send messages to subjects, and subscribers receive them. NATS JetStream adds persistence, at-least-once delivery, and stream processing on top.

When to use it:

  • Lightweight service-to-service messaging
  • IoT and edge computing (tiny footprint, low latency)
  • Request-reply patterns that feel like RPC but are async
  • When you want something simpler than Kafka but more capable than Redis Pub/Sub
  • Kubernetes-native service mesh communication

When to avoid it:

  • When you need Kafka-level throughput and ecosystem (Kafka Connect, ksqlDB)
  • When your team is already invested in RabbitMQ and the migration is not justified
  • Complex routing logic with dead letter queues and priority queues

Real-world example: In a microservice architecture with 20+ services running on Kubernetes, NATS handles internal event distribution. Service discovery events, configuration change notifications, and lightweight command messages all flow through NATS. The overhead is minimal — a single NATS server handles the entire cluster’s messaging needs.

The catch: NATS is less battle-tested in enterprise environments than RabbitMQ or Kafka. The ecosystem is smaller, and finding developers with NATS experience is harder.

Redis Pub/Sub and Redis Streams

Redis is not just a cache. Its Pub/Sub and Streams features make it a viable messaging tool for specific use cases.

Redis Pub/Sub — fire-and-forget messaging. If no subscriber is listening when a message is published, that message is gone. No persistence, no replay.

Redis Streams — a persistent, append-only log (similar to Kafka in concept but simpler). Messages are stored, consumer groups are supported, and you can replay from any point.

When to use it:

  • Pub/Sub: Real-time notifications, chat messages, live updates where losing a message is acceptable
  • Streams: Lightweight event streaming when Kafka is overkill but you need persistence
  • When Redis is already in your stack and you want to avoid adding another infrastructure component

When to avoid it:

  • Mission-critical messaging where losing a message is unacceptable (use RabbitMQ)
  • High-volume event streaming at scale (use Kafka)
  • When Redis is already under memory pressure from caching

Real-world example: In a real-time dashboard for a marketplace, Redis Pub/Sub pushes live order updates to connected WebSocket clients. If a client disconnects and reconnects, it fetches the current state via REST. The transient nature of Pub/Sub is fine here because the source of truth lives in the database, not in the messages.

GraphQL Federation

When your API gateway needs to aggregate data from multiple microservices into a single response, GraphQL Federation is the answer.

How it works: Each microservice exposes its own GraphQL subgraph. A gateway (like Apollo Router) composes them into a single unified graph. Clients query the gateway, which routes and resolves across services.

When to use it:

  • Client-facing APIs that need data from multiple services in a single request
  • Mobile applications where reducing network round trips matters
  • When different teams own different parts of the data model

When to avoid it:

  • Internal service-to-service communication (use gRPC or messaging)
  • Simple APIs where REST is sufficient
  • When your team does not have GraphQL experience

Real-world example: A mobile app’s home screen needs user profile data (from the user service), recent orders (from the order service), and recommended products (from the recommendation service). Instead of three separate REST calls, a single GraphQL query through the federated gateway returns everything in one round trip.

Service Mesh (Istio, Linkerd)

A service mesh handles the communication infrastructure so your application code does not have to. It is not a messaging tool — it is a layer that manages how services communicate.

How it works: A sidecar proxy (like Envoy) runs alongside each service. All network traffic flows through the proxy. The mesh handles load balancing, retries, circuit breaking, mutual TLS, and observability — without changing your application code.

When to use it:

  • Large-scale microservice deployments (50+ services)
  • When you need mTLS between every service without modifying code
  • Canary deployments and traffic splitting
  • Centralized observability across all service communication

When to avoid it:

  • Small deployments (under 10 services)
  • When the operational complexity of running a mesh is not justified
  • If your team is already struggling with Kubernetes basics

The Decision Matrix

Scenario Best Tool
Simple CRUD between two services REST
High-frequency, low-latency internal calls gRPC
Background job processing RabbitMQ
Distribute tasks across workers RabbitMQ
Event sourcing / audit log Kafka
Multiple consumers need the same event Kafka
Real-time analytics pipeline Kafka
Lightweight pub/sub in cloud-native setup NATS
Real-time notifications (lossy is OK) Redis Pub/Sub
Simple event streaming without Kafka overhead Redis Streams
Aggregate data from multiple services for clients GraphQL Federation
Infrastructure-level traffic management Service Mesh

What I Actually Use in Production

In every microservice system I have built, the stack looks something like this:

  • gRPC for synchronous internal communication where latency matters
  • REST for external-facing APIs and simple internal CRUD
  • RabbitMQ or Kafka for async messaging (RabbitMQ for task queues, Kafka when I need event streaming and replay)
  • Redis Pub/Sub for real-time features like notifications and live updates

I rarely use just one. The services that process payments talk over gRPC. The services that send emails consume from RabbitMQ. The services that build analytics dashboards read from Kafka. Each tool earns its place by solving the specific problem it was designed for.

The worst mistake I see teams make is picking one tool and forcing every communication pattern through it. REST for everything leads to synchronous coupling nightmares. Kafka for everything leads to operational overhead on simple interactions. The right answer is almost always a combination, matched to the actual requirements of each interaction.

Do not start with the tool. Start with the question: does this interaction need an immediate response, or can it be processed later? That single question will guide you to the right pattern, and the right pattern will guide you to the right tool.

comments powered by Disqus