CQRS-based Integration

Hero

The Question

You architected a CQRS-based integration using Azure Service Bus for command processing with GraphQL read models. Walk me through a specific challenge you faced with eventual consistency and how you solved it. What monitoring did you put in place to detect consistency lag?

The Problem¶

Every app we had was building its own integration with Salesforce, our CRM. The SFOps team was only two people, and they were already overtaxed.

So we built a system that sat between Salesforce and everything else—basically a cache of the data the other apps needed. It had three main parts:

A process written in Apex (Salesforce's proprietary language) that would detect changes to the target records and publish them to a message queue
A worker process written in Go that subscribed to this queue and pushed the data into Azure CosmosDB, then published an event to Azure EventHubs once successfully stored
A query-only GraphQL API written using .NET that dependent apps used to pull the cached Salesforce data

The business could live with data being up to 24 hours stale—basically unbounded from an ops standpoint. Since Salesforce updates peaked at only a few hundred per day, we consistently achieved convergence in under 60 seconds in practice.

The Consistency Challenge¶

Our biggest consistency issue came from a Salesforce schema change. The SFOps team modified a monitored table's schema late one evening, which caused our Apex code to fail on type mismatches when processing new or updated records.

The tricky part was that this failure was silent initially—no Salesforce activity occurred overnight, so no errors were generated. The next morning, when a business user created a new record, the Apex code immediately failed, and our alerts fired within minutes.

When we investigated, we discovered the integration had been broken for 12-14 hours. During that time, no updates flowed from Salesforce to CosmosDB, meaning our GraphQL API was serving completely stale data to dependent systems.

How We Detected and Resolved It¶

We set up monitors and alerts to make sure the system was actually delivering on its promise—a combination of metrics, distributed tracing, and logging.

Metrics we monitored:

Number of messages the Apex code produced
Number of messages the worker process consumed (successfully and unsuccessfully) and the latency
Size of the dead letter queue
Input/output messages in Azure EventHubs
Standard HTTP endpoint metrics for the GraphQL API

We set up monitors in Datadog tracking these metrics and alerts in PagerDuty. Since the system didn't process high message volumes, we didn't set "expected minimum" thresholds like we would for a higher-volume system.

We also used distributed tracing for GraphQL requests with correlated logs, and placed alerts on error logs. We created operational tasks that Jira automatically scheduled for the on-call person to review logs in Datadog and PagerDuty incidents. Since the business could tolerate delays, we treated these as business-hours issues rather than urgent incidents.

Prevention¶

We didn't have much control over the Salesforce environment and had no test instance, so technical solutions like synthetic monitoring were off the table. Instead, we worked out a schema change coordination process with SFOps.

Now, any changes to target records in Salesforce go through my team for impact review before deployment. We assess whether changes will affect our integration, update the Apex code if needed, and coordinate deployment timing with SFOps. That eliminated similar incidents going forward and kept things clean between our teams.

Lessons Learned¶

In hindsight, we should have established that coordination process from day one—we already had a working relationship with SFOps for identifying the data requirements, so extending that to include schema change reviews would have been natural. We learned that lesson the hard way, but once we experienced the incident, we immediately put the right processes in place and haven't had similar issues since.