Identity Platform Improvements

Hero

The Question

You achieved 99.995% uptime on your identity platform—that's less than 30 minutes of downtime in 30 months. What was your architectural approach to achieving this level of reliability? What were the main failure modes you designed against?

What MIDS Was¶

The Identity Platform was a custom-built identity provider using .NET that implemented open standards such as OAuth2, OpenID Connect, and SAML2—similar in function to providers like Google or Okta. It relied upon library called IdentityServer. Originally, all MCG solutions implemented their own authentication and authorization mechanisms. Our clients needed a single sign-on solution (SSO), instead of logging into each application separately.

This made the MCG Identity Platform a single point of failure for our many of our products—if it went down, customers couldn't access any of our applications. That's why achieving 99.995% uptime (less than 30 minutes of downtime in 30 months) was critical to our business.

The Challenge I Inherited¶

When I became Senior Software Architect in early 2021, I took responsibility for the Identity Platform—a system we called MIDS (MCG Identity Service) that had been in production since 2017 and was running at 99.7% uptime (about 22 hours of downtime per year).

Since this was a single point of failure for our entire product suite, we needed to do a lot better. I built a three-person team—two Staff Engineers and a Senior Engineer, sourced through internal transfer, direct hire, and contract-to-hire. I looked for engineers who could handle ambiguous problems independently and contribute to technical decisions, regardless of their language background.

Over the next 18 months, we improved uptime to 99.995%, reducing annual downtime from 22 hours to less than 30 minutes.

Main Failure Modes We Addressed¶

When we dug in, we found four things dragging down the uptime:

Ingress failures - Azure Front Door point-of-presence outages caused latency and complete unavailability
Capacity saturation - No autoscaling meant performance degraded under load
Deployment disruptions - Weekly maintenance windows created planned downtime
Silent failures - Poor logging and monitoring meant issues went undetected until customers reported them

Here's how we tackled each:

Ingress Failures¶

When we inherited MIDS, it used Azure Front Door as its main entry point. At the time, Front Door had reliability issues with its points of presence going offline randomly. This caused significant latency at best, and complete unavailability at worst—which happened more frequently.

We consulted with Microsoft to see if we could proactively monitor Front Door's points of presence or receive maintenance notifications, but those capabilities weren't available at the time.

Solution: We switched to Azure Application Gateway, a regional service rather than Front Door's global approach. Since all our applications were deployed to the same region, this made sense architecturally and nearly eliminated our ingress issues.

Capacity Saturation¶

MIDS was hosted as a Docker container within Azure App Service—not ideal, but serviceable. The real problem was nobody had set up autoscaling. Two instances ran at all times regardless of load.

Solution: We experimented with various scaling approaches and settled on capacity-based autoscaling—starting with 5 instances and scaling to 10 as demand increased, with a 10-minute cooldown period. This single change decreased latency significantly and had the biggest impact on our availability improvement.

Deployment Disruptions¶

The org had always done one maintenance window per week, and all applications including MIDS followed this schedule. This meant scheduled downtime every week.

Solution: We reengineered our CI/CD pipeline to enable zero-downtime, blue-green deployments. A key part of this was separating database deployments from application deployments. This allowed us to deploy updates to MIDS at any time without disrupting customers, eliminating the need for 1.5 hours of scheduled downtime per month.

We also developed a deployment checklist and process that included rollback procedures. We always tested this process in lower environments before production. Azure App Service deployment slots enabled us to deploy the new version, switch traffic over, and quickly roll back if we encountered any unforeseen issues. We communicated upcoming deployments to dependent teams so they could monitor their applications for anomalies.

Silent Failures¶

MIDS had no real monitoring and alerting strategy. It sent logs to Datadog, but logged at DEBUG verbosity and generated a high volume of "ERRORS" that weren't actual errors. For example, it logged an error when someone forgot their password or provided an invalid username—the system doing exactly what it should.

Solution: We cleaned up the logging systematically. First, we reclassified business rule violations from "errors" to "informational." Then we reevaluated all warning log entries: if a warning didn't require action or investigation, we downgraded it to informational. If a warning indicated something that required correction, we reclassified it as an error and investigated the root cause.

By the end of this process, we had almost eliminated all warning messages produced by MIDS itself, not including warnings from third-party libraries. We set the production logging level to "warn," which drastically reduced noise and allowed real errors to surface. We could then proactively correct issues, usually within the day they were discovered.

We also added distributed tracing, which allowed us to trace errors back to client applications. This enabled us to work proactively with those teams to improve their integration with MIDS.

Operational Improvements¶

Beyond addressing technical failure modes, we shifted from reactive to proactive operations. We added a daily operational task managed through Jira automation, where the on-call engineer would review logs and PagerDuty alerts and create Jira incidents for triage during our team standup.

That discipline mattered just as much as the technical changes—we were catching issues before customers even noticed.

Results and Impact¶

All told, MIDS went from a reactive, maintenance-heavy system to a reliable one. We eliminated planned downtime, cut incident response from days to hours, and hit 99.995% uptime.