Saga Hq

Distributed systems struggle with atomic operations across multiple bounded contexts. Traditional 2PC (Two-Phase Commit) locks resources across network boundaries, killing throughput; pure choreography (event chains) offers high decoupling but hides the global state of a transaction, making observability and failure recovery a nightmare. Saga Hq provides a hybrid architecture: a decentralized execution model guided by a lightweight protocol that combines the visibility of orchestration with the resilience of event-driven systems.

The Hybrid Saga Protocol

Instead of a monolithic coordinator owning every step, each participant in a Saga Hq transaction is responsible for its own execution and failure recovery. The protocol uses a two-channel model:

Forward Channel: Events drive progress. Each step publishes a completion event that the next participant consumes.
State Channel: Each participant maintains a local saga state and registers a compensating action before executing the forward work.

When a step succeeds, the forward channel advances. When a step fails, the participant publishes a failure event, and the protocol walks the registered compensation stack in reverse order. This gives developers the "one-click" visibility of a centralized orchestrator without the coordinator becoming a single point of failure or a bottleneck.

Core Components

The Saga Coordinator (Lightweight Per Saga)

There is no global coordinator. Instead, each transaction instance has a lightweight, short-lived coordinator that lives within the scope of the forward execution. It holds the contract definition—the ordered list of steps and their compensation logic—but does not execute the work. It is the "source of truth" for what the transaction is supposed to do, passed as metadata between participants.

The State Store

Each participant stores its local view of the saga state in a durable store. This includes the current step index, the execution status of each step, and the registered compensation actions. Idempotency keys are enforced at the storage layer: every event is deduplicated against the saga ID so retries never produce side effects.

Compensating Actions

Every forward step must register a compensating action — the undo operation that reverses the work. These actions are executed in LIFO order during failure unwinding. Saga Hq guarantees that if step N fails, steps 0 through N-1 are compensated in reverse order.

Failure Handling and Resilience

Failure is a first-class citizen in Saga Hq. The system handles three failure modes:

Transient Failures: The forward channel retries with exponential backoff and jitter. The participant continues the forward path once the external dependency recovers.
Permanent Failures: When a step is unrecoverable, the failure event triggers the compensation chain. Each compensation itself is retried until it succeeds, ensuring the system eventually reaches a consistent "undone" state.
Human-in-the-Loop: For complex failure modes where automated compensation is risky, the protocol can emit a "halt" event. The saga enters a paused state, and a human operator can manually trigger the next step or force compensation.

Key Advantages

Fault Tolerance: Because there is no global coordinator, a failure in one participant cannot block the entire system; only the affected transaction is impacted.
Observability: The shared protocol ensures the global state of every transaction is discoverable by querying the participants' local stores.
Developer Experience: Developers define a linear list of steps rather than a sprawling graph of events, making the happy path and the failure path equally explicit.
Idempotent Execution: Every operation is guarded by a saga-scoped idempotency key, so duplicate events from the message bus never cause double billing or duplicate emails.

Saga Hq turns the messy reality of distributed failure into a structured, reversible protocol — a durable transaction model for the event-driven era.