Microsoft Apologizes For Another Azure, Teams Outage
The company blamed the outage on DNS issues that prevented access to a number of cloud services, and it was the second major Microsoft outage in three weeks.
Microsoft reports that DNS issues led to an outage that affected cloud services including Azure, Teams and Dynamics 365 on Thursday evening.
The peak of the issues occurred between about 5:30 p.m. and 6:30 p.m. Eastern Time on Thursday, and the problem was fully mitigated as of about 10:30 p.m., Microsoft said.
[Related: Microsoft Apologizes ‘Deeply’ For Worldwide Azure, Teams Outage]
“We apologize for the impact caused by this outage,” Microsoft said on its Azure status history page. “We are continuing to investigate to establish the root cause and additional preliminary details will be published in the next 24 hours.”
It was the second major outage for Azure and other key Microsoft services -- such as the widely used Teams collaboration app -- in the past three weeks. On March 15, a worldwide outage impacted Microsoft services as the result of “authentication errors” across multiple cloud services, the company said at the time.
On Thursday, Microsoft said the problems stemmed from an unexpected increase in DNS (Domain Name System) traffic. The Domain Name System provides the directory that’s used to match domain names with their associated IP addresses.
“We are continuing to investigate the underlying cause for the DNS outage but we have observed that Microsoft DNS servers saw a spike in DNS traffic,” Microsoft said.
During the outage, some users experienced “intermittent issues” with accessing Azure, Dynamics, Xbox Live and other cloud services, Microsoft said.
We're investigating an issue in which users may be unable to access Microsoft 365 services and features. We'll provide additional information as soon as possible.
— Microsoft 365 Status (@MSFT365Status) April 1, 2021
In order to mitigate the impact of the issues, Microsoft “engaged resilient DNS capabilities to absorb the spike in DNS traffic,” the company said.
CRN has reached out to Microsoft seeking further details.
Ryan Loughran, reactive service manager at Valiant Technology, a New York-based MSP, said his team was holding its daily standup meeting over the Teams app on Thursday when the issues occurred.
“The meeting started normally and as it was going on I was logging into a client’s O365 Admin Portal to check something out. It let me log in, but then going to the actual admin console kept failing. I tried a different browser and got the same thing,” Loughran said. “Then, everyone on Teams froze for me. I tried to re-join the meeting, but couldn’t. Then, I got the ‘we ran into some issues’ banner in Teams and I figured everything was down.”
Loughran said that he sends out a message to clients any time there is an outage for a platform that Valiant manages.
“Thankfully, email was still working -- so at around 5:40 p.m. EST I sent one out to our clients. I would say the majority of our clients on are O365 and use Teams heavily,” he said. “I got quite a few replies back stating that they were experiencing the issue and appreciated the communication.”
Fortunately, the outage occurred at the end of the business day for East Coast clients -- though several clients with operations on the West Coast lost communication and had to revert to texting or using Slack, Loughran said.
Ultimately, “it’s disappointing that there has been a string of outages lately, much more than usual,” he said. “I really hope Microsoft steps up its communication efforts though. I had to rely on open source intel rather than an update on the O365 status page -- which was down -- or one of their Twitter accounts.”
At Atlanta-based solution provider ProArch, clients and services “experienced intermittent disruption from this DNS outage that was geo-specific”--with different geographies seeing varying issues, said Ben Wilcox, ProArch’s senior vice president of solution architecture.
One takeaway is that it’s important to monitor cloud services using a third-party monitoring service, Wilcox said.
“This will help with SLA adherence and credits requests for missing SLAs,” he said. “These services typically can provide faster outage notifications than Microsoft. And you can use these services too for building notification plans for outages of the cloud, based on your business impact analysis or disaster recovery plan.”
The issues on Thursday followed Microsoft’s March 15 outage that had impacted “any service” that uses Azure Active Directory, Microsoft’s widely used identity authentication solution.
During that outage, Microsoft indicated that the biggest impact was on Teams -- an essential tool for countless businesses with its chat, audio and video calling, and document-sharing functionality.
With the outage Thursday, it’s at least the fourth time that major issues have affected Teams since the beginning of February.
Teams -- which is part of Microsoft’s Office 365 suite of productivity apps -- had reached 115 million daily active users worldwide as of late October, Microsoft disclosed at the time. That was a more than 5X increase from roughly a year earlier, when Teams had 20 million daily active users, according to Microsoft.
An IT director, who was impacted by the outage Thursday and did not want to be identified, said he believes Microsoft is struggling to keep pace with the rapid growth of Teams.
“It looks to me like a breakdown in their software code change control procedures as a result of rapid Teams usage and growth,” said the IT director. “The DNS issue seems like another example of that.”
The IT director said the recent uptick in Azure outages is troubling. “They are going to have to figure this out real quick,” he said. “I know they have a lot on their product road map with Teams, but I would rather have them slow things down and get it right rather than put out updates that could be causing issues.”