medical · 1982–1987 · Atomic Energy of Canada Limited (AECL)
Therac-25 radiation therapy accidents
How removing physical interlocks on the strength of software validation arguments produced six massive radiation overdoses, three of them fatal.
12 min read · 5 sources cited
// background
The Therac-25 was a medical linear accelerator built by Atomic Energy of Canada Limited (AECL) and Compagnie Générale de Radiologie. It delivered radiation therapy to cancer patients in two modes: a low-energy electron mode (a few MeV, used for surface tumours) and a high-energy X-ray mode (25 MeV, in which a tungsten target is positioned in the beam path so the electrons produce X-rays). The correct positioning of the target — and a beam-spreading magnet, and a flattening filter — for each mode is essential. Firing the high-energy electron beam without the target in place delivers a radiation dose roughly 100 times the prescribed therapeutic level directly to the patient.
The Therac-25 was the third in a product family. Its predecessors, the Therac-6 and Therac-20, used hardware interlocks: physical mechanical linkages that made it impossible to fire the electron beam at high energy unless the target was correctly positioned. The Therac-25 was re-architected around a single PDP-11 computer running custom real-time software, and the hardware interlocks were removed in favour of software checks. AECL's certification argument to regulators rested in part on the predecessors' clean field record — a record that had, unbeknown to AECL, masked at least one software bug in the Therac-20 because the hardware interlocks had caught it.
Between June 1985 and January 1987, six patients received massive radiation overdoses on Therac-25 machines at five different clinical sites in the United States and Canada. Three of them died from radiation injuries. The remaining three suffered severe permanent injuries. The accidents were caused by two distinct software defects: a race condition in the operator-interface state machine, and an arithmetic overflow in a counter used during machine setup. In both cases, the affected control path could position the machine for high-energy electron firing without the target in place.
AECL's response across the first several accidents was inadequate. The company's initial position — communicated to operators and regulators — was that the machine could not have caused the injuries described. The fault codes the operators saw on their terminals were undocumented; the operator manual instructed users to override 'Malfunction' codes by pressing a single key, which simply re-armed the beam. The U.S. Food and Drug Administration eventually issued a notice of intent to remove Therac-25s from service in February 1987, more than 18 months after the first accident.
The case was investigated and documented in detail by Nancy Leveson and Clark Turner; their 1993 IEEE Computer paper is the canonical account and is reprinted in Leveson's 1995 book *Safeware*. The FDA regulatory record is also public. The case is the foundational reference in software-safety engineering.
// the decisions
1. Removing hardware interlocks in favour of software checks
Late 1970s to early 1980s. AECL is designing the Therac-25 as a successor to the Therac-20. Computing has gotten cheap enough to put a dedicated PDP-11 in every machine. Hardware interlocks — physical linkages that make incompatible mode selections mechanically impossible — add cost, weight, and maintenance complexity. The team's argument is that software can perform the same checks more flexibly, with full audit logging, at lower cost. The Therac-20's clinical record is excellent; the team treats this as evidence that the underlying control logic (much of which the Therac-25 inherits) is correct.
options on the table
- A.Retain hardware interlocks alongside software checks (defence in depth, higher cost).
- B.Replace hardware interlocks with software checks, supported by additional software-validation testing and explicit hazard analysis.
- C.Replace hardware interlocks with software checks, justified by the predecessor's clinical safety record.
what they actually did
Option three. The Therac-25 shipped with software interlocks only. Hazard analysis of the integrated hardware-software system, in the form practiced today, was not performed; AECL's regulatory submission did not include a fault-tree or system-safety analysis covering the software's failure modes. Leveson and Turner specifically note that AECL appears to have assumed software-driven systems do not exhibit the random-failure modes that hardware fault analysis is designed to surface, and so treated software safety as a software-quality problem rather than a system-safety problem.
consequence
When the software defects were eventually triggered in the field — by an experienced operator typing quickly enough to interleave commands in a way the state machine didn't cover — there was no physical defence. The 'evidence' that the predecessor's control logic was safe turned out to be evidence that the predecessor's hardware interlocks had been catching bugs all along, including a software bug in the Therac-20 that was only diagnosed after the Therac-25 accidents.
lesson
Interlock removal under software-validation arguments is the canonical case where the validating evidence didn't actually cover the failure mode. The Therac-20's clean record was strong evidence its hardware interlocks worked; it was not evidence its software was correct. Mistaking outcome evidence for component evidence is one of the deepest traps in safety reasoning. PMs evaluating 'we can simplify because the previous version was reliable' should ask explicitly which subsystem the evidence speaks to.
2. How the operator interface handled error conditions
The Therac-25's operator console displayed cryptic 'Malfunction' codes (Malfunction 1 through 64) when the control software detected an inconsistency. The codes were not documented in the operator manual. The standard operator response, codified in the manual itself, was to press 'P' to proceed. Doing so cleared the malfunction state and re-armed the machine. Many of the accidents involved the machine displaying a malfunction immediately after a high-energy electron firing in an unsafe configuration, the operator pressing 'P', and the machine firing again — sometimes multiple times — into the same patient.
options on the table
- A.Treat malfunction codes as unrecoverable: require service intervention before the machine could fire again.
- B.Document the codes, classify them by severity, and require explicit acknowledgement of severe codes before proceeding.
- C.Allow operator override on all codes via a single keypress, on the theory that frequent malfunction codes (the machine produced them often, mostly for benign reasons) made stricter handling impractical for clinical workflow.
what they actually did
Option three. The malfunction-code handling reflected a workflow consideration — the machine generated frequent benign malfunctions and clinicians needed to keep treatment moving — that was not balanced against any classification of the severity of the underlying conditions. Leveson and Turner note that 'the manual gave no indication that these malfunctions could indicate situations in which the patient had received an unsafe dose.'
consequence
At the Tyler, Texas accident in March 1986 (the first one definitively linked to the software), the machine displayed 'Malfunction 54' — undocumented, no severity indicator — after the first overdose. The operator pressed 'P' to retry. The machine fired again. The patient received a second overdose. This pattern recurred at multiple sites. The patient at the East Texas Cancer Center accidents died of radiation injury within months.
lesson
Error reporting that gives the operator no information beyond a code, and a workflow that defaults the operator to 'continue', is a system that has decided in advance which side of the false-positive / false-negative tradeoff to favour. PMs designing operator interfaces for safety-critical systems should classify error conditions before specifying the operator's available actions. A malfunction code with no severity attached is not an error report; it is a request for confirmation, and the request should not be granted by default.
3. AECL's response after the first accidents
June 1985 to early 1986. After the first Therac-25 accident at the Kennestone Regional Oncology Center (Marietta, Georgia), the patient sued, and AECL was notified. AECL's investigation concluded the machine could not have caused the described injury. After the second accident at Hamilton (Ontario) in July 1985, AECL again concluded the machine had not delivered an overdose; this conclusion was based on the machine's operating logs (which did not capture the relevant state) and on AECL's own analysis without independent review. AECL did not notify other Therac-25 sites of the accidents at this stage, and the company's communication with the FDA was limited to the affected machines.
options on the table
- A.Notify all Therac-25 sites of the reported accidents, suspend operation pending root-cause analysis, and engage independent experts.
- B.Notify the regulator (FDA in the U.S., the equivalent in Canada) and let regulators decide on fleet-wide notification.
- C.Investigate internally and respond to the affected sites only; treat the reports as isolated until proven otherwise.
what they actually did
Option three through most of 1985 and into 1986. After the third accident (Yakima, Washington, December 1985), AECL again concluded the machine was operating correctly. Only after the Tyler accident in March 1986 — and a second Tyler accident in April 1986, in which the operator was able to reproduce the malfunction code in the presence of an AECL service engineer — did AECL identify the race-condition defect. Even then, the company's first corrective notice was a single hardware modification (a fixed microswitch) that addressed only one of the two software bugs; a second accident pattern continued at Yakima in 1987 and was traced to the second, separate software defect.
consequence
The FDA issued a notice of intent in February 1987 to declare Therac-25 defective and remove it from service. AECL produced a corrective action plan; deployment of the corrected software was completed in 1987. Six patients had been overdosed; three died. The Leveson-Turner investigation was conducted at the request of the FDA over 1986–1987 and is the basis of the public record.
lesson
Incident response is a structural problem before it is a technical one. AECL's pattern — investigate internally, conclude the machine is fine, don't notify the fleet — is recognisable in many later cases (the Boeing 737 MAX case is structurally similar). The PM lesson is that the reporting and notification protocol for safety-relevant field reports has to be set before any incident, has to specify thresholds at which the fleet is notified independent of the manufacturer's preliminary investigation, and has to be enforceable by an authority outside the program. Manufacturers investigating their own field accidents in the absence of independent oversight will, on average, take too long to identify the defect.
// what to take away
- 01The Therac-25 case is not principally a story about coding bugs. The race condition and the counter overflow were real, but Leveson and Turner emphasise repeatedly that the system-level decisions — removing hardware interlocks, treating predecessor reliability as software evidence, ignoring the human-factors design of the operator interface, and a slow incident-response posture — are what allowed the bugs to reach patients. 'Software safety is a system property, not a software property' is the case's central lesson.
- 02Hardware interlocks were defence in depth against precisely the failure modes that occurred. Their removal was justified on the basis of the predecessor's clinical record, but the predecessor's clinical record was a record of the hardware interlocks doing their job — not a record of software correctness.
- 03Operator interfaces in safety-critical systems are part of the safety case. Undocumented malfunction codes with a single 'press P to continue' override default the operator's behaviour to 'proceed' on every condition, including conditions that should never proceed. The interface design was a safety decision and was not analysed as such.
- 04Slow, internal-only incident response after the first accidents enabled the subsequent accidents. The pattern — investigate internally, conclude the field report is wrong, don't notify other sites — is recognisable in later medical-device, automotive, and aerospace cases. The structural fix is independent oversight with mandatory notification thresholds.
- 05The case is the foundational reference for software-safety engineering. Modern safety-critical software practice (DO-178C in aerospace, IEC 62304 for medical software, ISO 26262 in automotive) all trace lineage to the lessons formalised by Leveson and others in the wake of Therac-25. PMs working in safety-relevant domains who don't know this case are working without the field's foundational vocabulary.
// timeline
- 1976Therac-6 enters clinical use, with hardware interlocks.
- Early 1980sTherac-25 designed; software replaces hardware interlocks. First clinical installations begin.
- Jun 3, 1985Kennestone Regional Oncology Center (Marietta, GA): first overdose. Patient sues; AECL concludes machine is operating correctly.
- Jul 26, 1985Ontario Cancer Foundation (Hamilton, Ontario): second overdose. Patient dies November 1985.
- Dec 1985Yakima Valley Memorial Hospital (Yakima, WA): third overdose.
- Mar 21, 1986East Texas Cancer Center (Tyler, TX): fourth overdose. Operator presses 'P' on Malfunction 54; second exposure delivered. Patient dies five months later.
- Apr 11, 1986Second Tyler accident; AECL service engineer present. Race condition reproduced. AECL identifies first software defect.
- Jan 1987Second Yakima accident, traced to a second, separate software defect (counter overflow). Patient dies April 1987.
- Feb 1987FDA issues notice of intent to declare Therac-25 defective.
- 1987Final corrective action plan completed; remaining machines updated.
- Jul 1993Leveson & Turner, 'An Investigation of the Therac-25 Accidents,' published in IEEE Computer.
- 1995Leveson, *Safeware: System Safety and Computers* — definitive book-length account.
// sources
- An Investigation of the Therac-25 Accidents — Nancy G. Leveson and Clark S. Turner, IEEE Computer (vol. 26, no. 7), 1993
- Safeware: System Safety and Computers — Nancy G. Leveson (Addison-Wesley), 1995
- FDA Letter to AECL, Notice of Intent to Declare Therac-25 Defective — U.S. Food and Drug Administration, Center for Devices and Radiological Health, 1987
- Medical Device Reporting (MDR) records — Therac-25 — U.S. Food and Drug Administration, 1987
- Engineering a Safer World: Systems Thinking Applied to Safety — Nancy G. Leveson (MIT Press), 2011
Practice this kind of decision
The simulator runs scenarios that exercise these same lessons under time pressure. Pick a chapter that exercises quality + risk.