Stop Letting Cron Run Your Go Jobs

I built this project to understand event-driven architecture in practice, not just from blog posts. I wanted to see jobs move through a system, fail, retry, and recover in a way I could watch in real time.

The result is a small event-driven job engine written in Go. Here is how it works.

Why I cared about events in the first place

Most real apps have background work:

send an email after a signup
charge a card at the end of the month
generate a report once a day

When you are in a hurry, this work grows randomly:

a cron script here
a table with a status flag there
a queue that only one part of the app knows about

After a while, you have a mess of special cases.

I wanted one consistent model:

Treat every piece of background work as an event that moves through a clear set of states.

That model looks like this:

awaiting processing
dispatched
being processed
succeeded
failed (with retry rules)
suspended
expired

The engine’s job is to move events safely between these states.

Seeing the engine in action

The code lives here:

https://github.com/xceejay/event-driven-golang

Once you have Go, MySQL, Redis, and NATS running, you can start the engine with:

make engine

Then open the UI:

open http://localhost:8080

You will see a small web page that lists events and their current state. It looks like this:

Event engine UI

When the engine is running against a real database and broker, the UI connects over server-sent events and WebSocket and shows you:

when a new event is created
when it is picked up for work
when it succeeds or fails
when it is retried or expires

That turns “event-driven architecture” from a diagram into something you can watch.

When the Railway deployment is up you can hit the same UI at https://eda-engine-production.up.railway.app.

The main pieces

The engine has four main parts:

A server that exposes a small API
A lifecycle service that knows how events are allowed to move
Storage for events, payloads, and configuration
Workers that move events forward on a schedule

The server

The server is a Go HTTP service. It:

accepts new events
lets workers ask for work
records when events succeed or fail
streams live updates to the web UI

It also exposes a set of tools using MCP (Model Context Protocol) so AI agents can drive the engine, but you can ignore that detail if you just care about Go and queues.

The lifecycle service

This part of the code knows the rules for state changes. For example:

from awaiting processing you can go to dispatched
from dispatched you can go to being processed
from being processed you can go to succeeded or failed

Every event type gets a configuration row in MySQL. That config says things like:

how many attempts are allowed
how long the event can live before it expires
how long to wait before the first and later attempts

When you publish an event the lifecycle service looks up that config, calculates the first scheduled time and expiry, and writes both the event row and its payload. When a worker reports success or failure, the lifecycle service checks that the move is valid, updates the row, and decides whether to retry or stop based on that same configuration.

Storage

The engine uses:

MySQL to store events and configuration
Redis to store payloads (or an in-memory map in development)
Prometheus metrics to track queue sizes and processing

You could swap these pieces out, but they are enough to keep the engine simple and real.

The workers and NATS

For actual processing the engine uses two things:

scheduled jobs inside the Go process
a NATS message broker on the edge

The scheduled jobs are just goroutines with time.Ticker. They:

look for events that are ready to be processed
schedule future events when their time comes
expire events that have been suspended too long
update queue metrics

When an event is ready, the engine publishes it to a NATS subject. A small adapter subscribes to that subject, does the real work (for example, send an email or hit an external API), and then tells the engine whether it succeeded or failed.

I chose NATS because:

it is small and fast
the Go client is solid
you get per-message time-to-live support for free

How one event moves through the system

Take a single RIDE_REQUESTED event.

The UI calls POST /api/events with:
- event_type = "RIDE_REQUESTED"
- flow_type = "FLOW_B" for rides
- a flow_id that identifies the ride
- a JSON payload with pickup and destination
The server validates the body and asks the lifecycle service to publish the event.
The lifecycle service loads the config for RIDE_REQUESTED, calculates when the first attempt should run and when the event should expire, and inserts two rows:
- one in the event table with state awaiting processing
- one in the event_payload table with the JSON payload
A scheduled job wakes up, finds events that are due, and marks them as dispatched.
For each dispatched event it publishes a message to NATS on a subject like event-engine.flow_b with a copy of the event ID and payload.
The ride adapter receives that message and calls POST /api/events/{id}/acquire to lock the event and fetch the payload.
After doing its work the adapter calls either:
- POST /api/events/{id}/success with optional spinoff events, or
- POST /api/events/{id}/failure with an error message.
The lifecycle service updates the event row based on that call and, if there were spinoff events, publishes those as brand new events in the same flow.

The same pattern works for email, payouts, or anything else that moves through a few clear states and needs retries.

A concrete example: the ride-hailing demo

The easiest way to see this engine in real life is the small ride-hailing demo I built on top of it. One ride request becomes six separate events:

RIDE_REQUESTED
DRIVER_MATCHED
DRIVER_ARRIVED
TRIP_STARTED
TRIP_COMPLETED
PAYMENT_PROCESSED

Each step has its own configuration for retries and timing. When one step succeeds, the worker reports success and asks the engine to publish the next event in the chain. If a step fails, the engine can retry just that step without losing the overall ride.

On the live demo the HTML UI connects to the engine and shows one ride at a time moving through those stages. You can request a ride, watch the events appear, fail, retry, and eventually complete.

Where this pattern shows up in payments

Payment systems follow a similar pattern. A single purchase might involve:

creating a payment attempt
sending it to a card network or PSP
waiting for a callback
capturing funds
issuing a receipt or notification

Each step needs to be idempotent and recoverable so that if a request or callback is retried, money is not charged twice and no payments are lost.

An engine like this gives you:

a table that records the current state of each payment attempt
retry rules that live in one place instead of being scattered in handlers
a single API for workers to report success or failure, with a clear version and status

You still need to design the payment events carefully, but the mechanics of “move between states and do not lose work” are the same as the ride example.

How this helped me understand event-driven systems

A few things clicked for me while building this:

States matter more than functions. Once I wrote down the event states, many design decisions became obvious.
Centralising the lifecycle removes a lot of guesswork. Instead of each feature inventing its own retry logic, the engine owns it.
Simple tools are enough. Goroutines, tickers, and a small broker like NATS cover a lot of background job needs.
A visual UI is worth the time. Seeing events move made it easier to debug than reading logs.

If you want to learn from it

A simple way to use this project is:

Clone the repo
Skim the PLAN.md to see the intended architecture
Run make engine and open the UI
Trigger a few events and watch how they move
Read the lifecycle and worker code with the UI open

You do not have to use this engine in production to learn from it. The goal was to build a concrete example that shows how jobs can move through a system in a controlled way without a lot of framework magic.