Verizon Terremark Outages Take Down Obamacare Site, But Who's To Blame?
Another hit for the HealthCare.gov site came over the weekend as outages shut down the already struggling site.
Verizon Terremark, in charge of hosting the database hub for the site, experienced a "failure in a networking component" Sunday, and subsequent attempts to fix the problem through regular maintenance took down the system, according to a press release by the Connecticut exchange, which was hit by the outages. The outage affected other sites as well, according to the same release.
Verizon Terremark, the cloud infrastructure division of Verizon, did not respond to multiple CRN requests for comment for an update on or explanation of the situation.
[Related: Obamacare Site Disaster: 10 Steps Solution Providers Would Take To Fix It ]
The HealthCare.gov site is back up and running as of 7 a.m. EST Monday, but the question remains: What caused the outages? Even though they didn't have access to the back-end systems, solution providers told CRN what they thought the problem was based on Verizon's explanation and who should be blamed for falling short.
"There's certainly no details out yet, but it looks inexcusable," said Robin Purohit, CEO of Clustrix.
Network problems are "inevitable," Purohit said, but it is something that every e-commerce site already deals with, which means the HealthCare.gov site should have been prepared.
"It's bizarre to me that a network component can fail and take the entire system offline," Purohit said.
Verizon's comment that a network component issue caused the outage suggests that it was a network architecture problem, said Andrew Pryfogle, senior vice president and general manager of cloud services and complex bids at Intelisys. There should have been redundancies built into the system, he said, to prevent a single malfunction from taking down the entire system.
"It could be either an ingress or egress to their network that had an issue. But, ... why would a single network failed component cause this outage? Why was there not diversity built in if that's the case?" Pryfogle said. "It could have been a database failure, a storage failure, or a router or a switch component that failed anywhere on the network, but if any single failure caused that outage, that brings you right back to the big question of why wasn't that designed around on the front end."
The apparent lack of redundancy is best practice, Purohit said, and the fact that the outage happened with Verizon suggests someone along the way told the company not to put the redundancies in place.
"It's hard to tell whether Verizon is at fault here or whether they were asked to do the wrong thing," Purohit said.
Purohit suggested the culprit was CMS, the government group in charge of creating the site. He said that he wouldn't chalk the issue up to budgetary problems, as the department has spent hundreds of millions on the site, but rather to the passiveness and lack of accountability in such large contracted projects.
NEXT: More Than Just Bad Architecture?
Jamie Shepard, regional vice president at Lumenate, agreed with Intelisys' Pryfogle and Clustrix's Purohit that Verizon should have put redundancies in place. He said that Verizon should not be able to remove one piece and cause the entire system to collapse. However, the outage also pointed to a bigger problem, he said.
"I'm assuming they had architecture built in a program. You have to. It's the government," Shepard said.
Government audits would make an architecture problem unlikely, Shepard said, because that is one major area that would have been checked. Instead, he said it was much more likely that the problem came from a process change instead of a technical change. Shepard compared the problem to electricity in a house, which can be maintained during a power outage with a generator.
"The problem was the generator didn't kick in," Shepard said. "It was architected right, but the process is broken."
What Verizon failed to do, Shepard said, was put in place processes such as risk assessment and mitigation that would have established plans of action in case of network problems. Verizon, he said, should have gone through the plan for the site and created contingency plans for possible problematic scenarios at each turn.
"You can't tell me that the tech is always at fault here; ... it's the risk mitigation. You have to openly state that if we miss a step here, here's the risk. I can guarantee that didn't happen here," Shepard said. He said he speculates someone didn't look at the process enough and looked only at the surface-layer architecture.
Regardless of whose fault it is and how easy the problem is to fix, the reputational damage to Verizon is done, the solution providers told CRN.
"What I don't know are the details around what was dictated to them, the requirements that were dictated to them, what they did around redundancy, data base replication -- I don't know those details. But, I can say that, although [Verizon] Terremark has a strong reputation, even Terremark is vulnerable to outages when the root cause is basically a poorly architected solution," Pryfogle said.
PUBLISHED OCT. 28, 2013