Why We Built a Distributed Workflow Monitor from Scratch

Published: February 2026 | Reading Time: 8 minutes

What if your monitoring system costs more than the downtime it's supposed to prevent? That's the uncomfortable question many organizations face as observability bills climb while the problems they solve remain unchanged.

Like most engineering teams, we started with the obvious choices. Standard uptime monitoring tools, APM platforms, synthetic check services. These tools worked when we were small. But as we scaled, cracks appeared—and the cracks revealed fundamental limitations rather than configuration problems.

The Cost Problem

Observability costs have grown substantially year over year, often outpacing the growth of the infrastructure being monitored. The pricing models that seemed reasonable at small scale become challenging at larger scale.

Many monitoring platforms use modular pricing that appears economical initially. APM priced separately from infrastructure monitoring, synthetic checks priced separately from log analysis. Each module seems affordable. But organizations typically need multiple modules, and the cumulative cost grows quickly.

Data volume compounds the problem. More services generate more metrics, more logs, more traces. The "centralize-then-analyze" model—shipping all observability data to a central platform—works at small scale but costs scale with data volume regardless of value delivered.

Burst traffic creates billing surprises. A traffic spike generates more metrics, more logs, more traces—all adding to the bill. The cost scales with activity, not with the value of the insights.

The result: significant engineering time spent not on building features but on optimizing observability costs. Adjusting retention policies, reducing metric cardinality, sampling more aggressively—all to control bills rather than improve visibility.

The Workflow Gap

Beyond cost, we encountered a capability gap. Our business runs on complex workflows: multi-step API orchestrations, conditional processing pipelines, sequential operations with dependencies between steps.

Standard monitoring tools excel at checking whether individual endpoints respond. They can tell you that server-01 returned 200 OK in 150ms. What they can't tell you is whether a complete business process executed successfully—whether an order flowed from placement through payment validation through inventory check through fulfillment.

We needed to monitor workflows, not just endpoints.

A workflow check validates that a sequence of operations completes successfully. Step one calls the authentication service. If that succeeds, step two queries the inventory system. Depending on the result, step three either processes payment or returns an error. The workflow either completes successfully or identifies exactly where it failed.

This is fundamentally different from checking individual endpoint health. Every service might report healthy while the workflow that depends on their correct sequencing fails. Synthetic transaction monitoring approaches this but typically lacks the conditional logic and multi-step orchestration that complex workflows require.

The Perspective Problem

Single-region monitoring creates blind spots. If your monitoring runs from the same infrastructure as your service, it shares the same failure modes. Regional outages affect both your service and your monitoring—resulting in dashboards showing green while users experience failures.

We experienced this directly during regional cloud outages. Our monitoring, running in the affected region, showed acceptable metrics. External reports told a different story. The monitoring system couldn't see the problem because it was inside the problem.

Effective monitoring requires external perspective—probes running in different regions, on different networks, seeing your service the way users see it rather than the way your infrastructure sees itself.

The Build Decision

We evaluated alternatives extensively. Open-source monitoring stacks offered flexibility but required significant operational investment. Enterprise platforms offered capability but at substantial cost. Point solutions addressed specific needs but created fragmented visibility.

None provided what we needed: workflow-native monitoring with multi-region execution at predictable cost.

So we built it.

The decision wasn't obvious. Building monitoring infrastructure is substantial investment. But the build-versus-buy calculation depends on your specific needs. For us, the recurring cost of SaaS tools exceeded what self-built infrastructure would cost to operate. The capability gaps meant we'd need custom development regardless. And the data sovereignty requirements for certain use cases made centralized SaaS difficult.

The Architecture We Chose

We designed around the limitations we'd experienced.

Distributed by default. Rather than centralizing data for analysis, we distribute processing to regional nodes. Each region runs its own monitoring infrastructure, reducing both latency and cross-region data transfer costs. Regional failures affect regional monitoring, not global visibility.

Workflow-native modeling. Workflows are first-class concepts, not sequences of independent checks stitched together by external logic. Define workflows as multi-step processes with HTTP, TCP, ICMP, DNS, and browser-based steps. Conditional logic, variable passing between steps, and structured failure handling are built in.

Multi-database strategy. Different data types have different access patterns. Workflow definitions are relational—they fit PostgreSQL naturally. Time-series metrics need specialized storage optimized for append-heavy workloads. We use the right database for each data type rather than forcing everything into one system.

Predictable costs. Fixed infrastructure costs scale with our deployment choices, not with data volume. We know what monitoring will cost next month regardless of traffic patterns.

When This Makes Sense

Custom monitoring infrastructure makes sense in specific situations.

When scale creates meaningful cost savings. If observability bills are substantial and growing, self-built alternatives become economically attractive. The crossover point varies but typically becomes compelling when spending exceeds certain thresholds on SaaS tools.

When standard tools can't model your requirements. If your business processes involve workflow orchestration that monitoring tools can't express, custom development happens regardless—either as external glue code around standard tools or as purpose-built monitoring.

When data sovereignty matters. Regulated industries, government contracts, and certain international operations require data to stay within specific boundaries. Self-operated infrastructure provides control that SaaS solutions often can't.

When real-time requirements are stringent. SaaS monitoring adds network latency by design—checks run remotely, results travel to central analysis. Self-operated monitoring can achieve lower detection latency for time-critical operations.

When It Doesn't

Building monitoring infrastructure doesn't make sense in other situations.

Small teams should focus on their core product, not on operating monitoring infrastructure. The operational burden exceeds the benefits at small scale. SaaS tools trade dollars for operational simplicity—a good trade when dollars are more available than engineering time.

Standard monitoring needs are well-served by existing tools. If your requirements fit what off-the-shelf solutions provide, there's no reason to build custom infrastructure.

Limited budgets, paradoxically, sometimes favor SaaS. Building requires upfront investment; SaaS spreads cost over time. Organizations with constrained cash flow might prefer predictable monthly bills to development investment.

The Results

Since deploying our own monitoring infrastructure, we've seen measurable improvements.

Observability costs decreased substantially compared to our previous SaaS stack. The savings continue to compound as we scale.

Detection latency improved dramatically. Workflow failures that previously took minutes to detect now surface in seconds. Multi-region execution means regional issues are detected from outside the affected region.

Workflow visibility transformed operations. Instead of correlating dozens of individual checks to understand business process health, single workflow monitors provide end-to-end visibility.

Data control eliminated compliance concerns for sensitive workloads. We know exactly where our monitoring data lives and who can access it.

The Broader Question

The question isn't "build or buy"—it's "what combination of build and buy addresses your needs at acceptable cost."

For commodity monitoring needs, buy. For unique requirements at meaningful scale, consider building. For most organizations, a hybrid approach makes sense: SaaS for standard infrastructure monitoring, custom solutions for business-specific workflows.

The economics and capabilities of monitoring continue to evolve. What required custom development five years ago might be commodity today. What's possible with SaaS today might be inadequate for your needs tomorrow. The key is evaluating your specific requirements against available options, rather than defaulting to either extreme.

We built Monitrics because we needed it. If you need it too, we'd love to have you try it.

Interested in distributed workflow monitoring? Monitrics provides multi-step health checks with HTTP, TCP, ICMP, DNS, and browser-based monitoring across multiple regions. Learn more at Monitrics.

Why We Built a Distributed Workflow Monitor from Scratch

Why We Built a Distributed Workflow Monitor from Scratch

The Cost Problem

The Workflow Gap

The Perspective Problem

The Build Decision

The Architecture We Chose

When This Makes Sense

When It Doesn't

The Results

The Broader Question

Related Articles

From Cron to Distributed: Scaling Scheduled Workflows Beyond a Single Server

Why We Chose Go Microservices Over a Monolith for Workflow Monitoring

Multi-Region Health Checks: Why Your Uptime Monitoring Needs Geographic Diversity