8,000% Throughput Improvement

Hero

The Question

You achieved an 8,000% throughput improvement at SmartDrive. That's massive and great stuff. What was the bottleneck you identified, and how did you validate the solution before implementing?

The Context¶

This was back in 2013-2014, several years before the ground-up rearchitecture. The system was processing IoT data from smart recorders installed in fleet vehicles, and there was a well-known performance problem—as load increased, processing throughput dropped dramatically. We were handling about 40,000 messages per day on a single instance, and the system was struggling.

The thing about this workload was that data came in waves. Most smart recorders would transmit their data when vehicles returned to the depot and connected to WiFi, which meant we'd see massive spikes in traffic at predictable times rather than steady load throughout the day.

Identifying the Bottleneck¶

When I took over the team—seven engineers based in Lviv, Ukraine, working remotely—they walked me through how the software was implemented, and the problem became immediately obvious to me: the database was processing business logic, not just storing data.

The architecture had some serious problems:

Stored procedures hundreds of lines long containing business logic that belonged in the application tier
Synchronous processing when the nature of the work naturally lent itself to asynchronous processing
Heavy reliance on SQL Server Service Broker adding additional overhead
Single SQL Server instance handling everything—read/write contention, dirty reads, and blocking were rampant

The organization had been throwing more hardware at the SQL Server instance, but it wasn't helping. This was a classic sign of an architectural problem, not a resource problem.

The Solution¶

Once I identified that the database shouldn't be the processing engine, the solution became clear. We needed to move to asynchronous processing with business logic in the application tier:

Immediate response: When a request came in, we saved the data as-is and immediately returned a 202 Accepted response. No processing blocking the HTTP request.
Business logic migration: We extracted all the business logic from the stored procedures—aside from simple CRUD operations—and moved it into service classes in the application. These service classes processed the data we'd saved to the database.
ThreadPool processing: We processed these requests using the .NET ThreadPool, which could handle up to 32 threads simultaneously by default if I recall correctly. This allowed us to process multiple requests concurrently without overwhelming the system.
Load balancing: Our systems engineers also implemented a load balancing strategy to scale the application out horizontally, though in reality the core issue wasn't application capacity—it was database contention.

How We Validated the Solution¶

We validated this approach through multiple methods:

Load Testing: We tested as thoroughly as we could in lower environments, simulating the wave-based load patterns we saw in production when vehicles returned to depots.

Logging and Metrics: We had logging in place and some rudimentary metrics that backed up our results and let us compare before and after performance.

Database Profiling: This was the most important validation. Our DBAs had profiled database performance before we made the changes, capturing CPU usage, memory consumption, disk I/O, and contention metrics. After our changes, they ran the same profiling and could objectively compare the results. They verified that database contention dropped, along with CPU, memory, and disk I/O during processing periods.

This before-and-after profiling gave us confidence that we'd actually solved the root cause, not just moved the bottleneck somewhere else.

Risk Management¶

We had a rollback plan in place, though the specific details escape me after 12 years. I know we had database backups before making the changes, and I believe we may have deployed the new service as a separate IIS site with DNS-based traffic switching so we could revert quickly if needed. The exact mechanism is fuzzy, but I know we didn't just deploy and hope for the best—we had a way to revert if something went wrong.

The Results¶

The impact was dramatic. We went from processing 5,000 messages per day/instance to over 400,000 messages per day/instance—an 8,000% improvement in throughput.

The system also became more stable. We never had this performance issue again, even as the business continued to grow. The wave-based load patterns that had previously crushed the database were now handled smoothly. The DBAs were happy because their SQL Server instance was no longer constantly on fire, and the business could scale without constantly buying bigger database hardware.

Over time, as more smart recorders transmitted data over cellular networks rather than waiting for depot WiFi, the size of these waves gradually decreased, which further smoothed out the load patterns. But the architectural fix we'd made meant the system could handle either pattern effectively.

Eventually, the DBAs implemented SQL Server Always On availability groups—dedicating one server for writes and directing reads to read-only replicas. But that was outside my wheelhouse and came later as additional optimization. The core architectural change we made solved the immediate crisis and gave the system room to grow.