Charter.

government · 2010–2014 · U.S. Centers for Medicare & Medicaid Services (CMS)

Healthcare.gov launch (2013)

How a federal program that hit every milestone shipped a system that didn't work — and what the rescue team did to fix it.

13 min read · 5 sources cited

// background

The Affordable Care Act, signed in March 2010, required a federal health insurance marketplace to be operational by October 1, 2013. The marketplace needed to allow consumers in 36 states (the states that hadn't built their own exchanges) to compare plans, verify identity and income against IRS and DHS records, calculate subsidies, and enrol — all in one online flow. The program was managed by the Centers for Medicare & Medicaid Services (CMS) within the Department of Health and Human Services.

CMS built the system as a federation of contractors. CGI Federal built the front-end web portal. QSSI built the data services hub that connected to IRS, DHS, the Social Security Administration, and a dozen other federal data sources. Multiple other vendors handled identity proofing, plan management, the back-end eligibility engine, and the call center infrastructure. CMS itself was the systems integrator — the role that, on a normal commercial program, would belong to a single accountable prime.

The deadline was statutory. The scope was set by law. Both were treated as fixed inputs by the program. By the spring of 2013, internal CMS testing showed serious problems with how the federated system would behave under real load and with real eligibility verification calls. A McKinsey assessment commissioned by CMS in March 2013 used unusually direct language for a consulting deliverable: the program was "highly likely to miss critical milestones" and the "current path would not result in a launch ready system."

The system launched on October 1, 2013 anyway. Six users completed enrolment on the first day. The site crashed within hours. By mid-month, press coverage had turned the launch into a national-political event. What's less remembered, and more instructive for PMs, is the recovery: a "surge team" of senior engineers (drawn from Google, the U.S. Digital Service nucleus, and elsewhere) was inserted in mid-October. Within ten weeks they had taken the system from a uptime of less than 50% to over 95% and from a failed-enrolment rate of ~80% to under 1%.

The Government Accountability Office and HHS Office of Inspector General both produced detailed post-mortems that are still required reading in public-sector PM courses. The lessons are sharper than the popular narrative ("government can't ship") suggests.

// the decisions

1. Whether to delay launch when March 2013 testing surfaced systemic issues

March 2013. McKinsey's review identifies major architectural and integration risks. CMS leadership has six months to launch. The contracts have been signed. The political environment is openly hostile to any delay; an HHS-initiated postponement would be read as political failure.

options on the table

  • A.Delay launch until end-to-end load testing showed the system working — would have required statutory or regulatory cover.
  • B.Reduce launch scope (e.g., launch with paper-application fallback, fewer states, no tax-credit calculation in the initial release).
  • C.Hold the deadline and spend the remaining six months trying to harden the system in place.

what they actually did

Hold the deadline. CMS leadership did not formally raise the launch-risk question to the level where statutory delay or scope reduction could be considered. The contractors continued to deliver to the original SOWs. End-to-end integration testing happened in the final two weeks before launch.

consequence

Launch failed. By Day 4, only 248 people had successfully enrolled across all 36 federal-exchange states. The administration spent the next six weeks in active crisis-management mode. The OIG post-mortem found that 'CMS made many missteps throughout development and implementation' and specifically called out 'the absence of clear leadership' on the integration question.

lesson

On any program with a statutory deadline, the PM's first job is to make the trade between scope and date legible to the people who can change the law or the scope. Hiding the trade — or letting it stay hidden because no one wants to surface it — converts a scope-or-date decision into a do-or-fail decision. CMS had options that weren't formally considered because surfacing them would have created political work nobody wanted.

2. The systems-integrator gap

Healthcare.gov had ~55 contractors. On a typical program of comparable complexity, one prime is the accountable systems integrator. CMS held that role themselves. CMS staff had managed individual contracts; they had not run a federated systems integration of this scale. The OIG report found this was 'a fundamental misalignment between role and capability.'

options on the table

  • A.Designate one prime contractor (most likely CGI or QSSI) as systems integrator with end-to-end accountability.
  • B.Bring in a separate independent systems-integration firm whose only job was to own integration.
  • C.Build CMS's own systems-integration capability quickly, hiring senior staff against the deadline.

what they actually did

None of the above. CMS retained the systems-integrator role internally and continued to manage the contractors as separate workstreams. There was no single accountable owner for end-to-end behaviour of the system.

consequence

When integration testing surfaced cross-vendor issues in the final weeks, no single party had the authority — or the technical breadth — to make trade-off calls across the contract lines. Issues bounced between vendor teams. The launch-day failure was an integration failure: each component largely worked at its interface; they didn't work together at scale.

lesson

If exactly one person isn't accountable for end-to-end behaviour, no one is. RACI's 'single A per row' rule scales up to 'single A per program', and Healthcare.gov is the canonical case study of what happens when that rule is broken at the program level. The structural fix isn't a process; it's an org chart.

3. How the surge team turned the system around in ten weeks

Mid-October 2013. The site is functionally broken. Headlines are devastating. The administration brings in a 'surge team' of senior engineers, headed by Mikey Dickerson (formerly Google) reporting effectively to Jeff Zients. The team finds a system with no single dashboard, no shared SLOs, no on-call rotation that crossed contract boundaries, and a culture where contractors reported their own status without an independent integrator validating it.

options on the table

  • A.Continue managing the existing contractor structure with tighter oversight.
  • B.Replace the contractors mid-flight (politically catastrophic and technically slower than fixing in place).
  • C.Insert engineers across contract boundaries with explicit authority to override individual vendor priorities, install shared metrics dashboards, and run a single daily incident-style standup.

what they actually did

The third. The surge team installed a 'war room' culture with a single daily 9am stand-up across all vendors, ran live performance dashboards on the wall, instituted a single SLO for end-to-end enrolment success, and gave Dickerson the authority to override vendor priorities. They explicitly *didn't* try to redesign the system — they triaged in priority order based on the dashboards.

consequence

By December 1, 2013, end-to-end enrolment success rate was over 90%. By mid-March 2014, more than 7M people had enrolled. The surge team's working pattern became the kernel of the U.S. Digital Service (founded 2014) and the General Services Administration's 18F (founded 2014). The 'surge team' playbook is now codified in federal IT procurement and is taught in the State Department's program management curriculum.

lesson

Most program rescues look like this: install one accountable owner with cross-cutting authority, install shared dashboards that everyone sees the same numbers on, and triage. The work isn't redesign; it's restoring the management feedback loop. When a program is failing, the fastest fix is usually not technical — it's structural.

// what to take away

  • 01The systemic failure was not technical; it was a program-management void. McKinsey's March 2013 memo was unusually direct, and the GAO and OIG reports both name 'the absence of an integrator with end-to-end accountability' as the proximate cause.
  • 02Statutory deadlines are not 'fixed'. They feel fixed because changing them is politically expensive — but the cost of pretending they are fixed when the program can't hit them is always higher than the cost of surfacing the trade. Healthcare.gov's launch failure cost more than any imaginable launch delay would have.
  • 03The surge team's intervention is the most-studied IT-rescue case in the federal government. The pattern (insert one accountable owner, install shared dashboards, run a war-room cadence, triage by impact) is portable. Most software-program rescues use a variant of it.
  • 04The surge team's after-action led directly to USDS and 18F. Modern federal-IT acquisition rules (e.g., the TechFAR Handbook, agile-friendly contract vehicles) trace to lessons learned from this single program.
  • 05The popular framing — 'government can't ship software' — is wrong both empirically and as a PM lesson. The right framing is 'large multi-vendor programs without a single accountable integrator usually fail to ship'. That's true in the private sector too. The Healthcare.gov story just made it newsworthy.

// timeline

  • Mar 23, 2010Affordable Care Act signed. Statutory deadline: federal exchange operational October 1, 2013.
  • Sep 2011CGI Federal awarded the front-end portal contract.
  • Mar 2013McKinsey assessment warns CMS the system is unlikely to launch ready.
  • Jul 2013Initial integration tests fail; bug counts rise across vendors.
  • Oct 1, 2013Launch. Six users complete enrolment on day one. Site crashes within hours.
  • Oct 22, 2013Surge team (Mikey Dickerson et al.) inserted by HHS.
  • Dec 1, 2013End-to-end enrolment success rate above 90%.
  • Mar 31, 2014Open enrolment closes; >7M people enrolled via the federal exchange.
  • Aug 2014U.S. Digital Service founded by executive order, with surge-team alumni at its core.
  • Feb 2015OIG report on Healthcare.gov launch released.

// sources

  • HHS Office of Inspector General — Healthcare.gov: Case Study of CMS Management of the Federal Marketplace (OEI-06-14-00350)U.S. Department of Health and Human Services, OIG, 2016
  • Government Accountability Office — Healthcare.gov: Ineffective Planning and Oversight Practices Underscore the Need for Improved Contract Management (GAO-14-694)U.S. Government Accountability Office, 2014
  • McKinsey Briefing on the Federally Facilitated Marketplace (released through congressional inquiry)McKinsey & Company, 2013
  • Healthcare.gov: A Case Study for Effective Software Engineering ManagementMikey Dickerson, talk at SREcon, 2014
  • United States Digital Service CharterExecutive Office of the President, 2014

Practice this kind of decision

The simulator runs scenarios that exercise these same lessons under time pressure. Pick a chapter that exercises integration + stakeholder.