资讯专栏INFORMATION COLUMN

AWS S3 挂掉原因:程序员输错字母,误删服务器,故障4小时!

MarvinZhang / 885人阅读

摘要:周四声称,输错命令导致了亚马逊网络服务出现持续数小时的故障事件。太平洋标准时上午,一名获得授权的团队成员使用事先编写的,执行一条命令,该命令旨在为计费流程使用的其中一个子系统删除少量服务器。

AWS解释了其广大US-EAST-1地理区域的S3存储服务是如何受到中断的,以及它在采取什么措施防止这种情况再次发生。

 

AWS周四声称,输错命令导致了亚马逊网络服务(AWS)出现持续数小时的故障事件。这起事件导致周二知名网站断网,并给另外几个网站带来了问题。

这家云基础设施提供商给出了以下解释:

亚马逊简单存储服务(S3)团队当时在调试一个问题,该问题导致S3计费系统的处理速度比预期来得慢。太平洋标准时(PST)上午9:37,一名获得授权的S3团队成员使用事先编写的playbook,执行一条命令,该命令旨在为S3计费流程使用的其中一个S3子系统删除少量服务器。遗憾的是,输入命令时输错了一个字母,结果删除了一大批本不该删除的服务器。

这个错误无意中删除了对US-EAST-1区域的所有S3对象而言至关重要的两个子系统――这个区域是庞大的数据中心地区,恰恰也是亚马逊历史最悠久的区域。两个系统都需要完全重启。亚马逊特别指出,这个过程以及运行必要的安全检查“所花的时间超出了预期。”

重新启动时,S3无法处理服务请求。该区域依赖S3进行存储的其他AWS服务也受到了影响,包括S3控制台、亚马逊弹性计算云(EC2)新实例的启动、亚马逊弹性块存储(EBS)卷(需要从S3快照获取数据时)以及AWSLambda。

亚马逊特别指出,索引子系统到下午1:18分已完全恢复,而布置子系统在下午1:54分恢复正常。到那时,S3已正常运行。

AWS特别指出,由于这起事件,自己正在“做几方面的变化”,包括采取将来防止错误输入引发此类问题的措施。

官方博客解释:“虽然删除容量是一个重要的操作做法,但在这种情况下,使用的那款工具允许非常快地删除大量的容量。我们已修改了此工具,以便更慢地删除容量,并增加了防范措施,防止任何子系统低于最少所需容量级别时被删除容量。”

AWS已经采取的其他值得注意的措施有:它开始致力于将索引子系统的部分划分到更小的单元。该公司还改变了AWS服务运行状况仪表板(AWSService Health Dashboard)的管理控制台,以便仪表板可以跨多个AWS区域运行――颇具讽刺意味的是,那个拼写错误在周二导致仪表板失效,于是AWS不得不依靠Twitter,向客户通报问题的进展。

针对北弗吉尼亚(US-EAST-1)区域亚马逊S3服务中断的简要说明

 

我们想为大家透露另外一些信息,解释2月28日上午出现在北弗吉尼亚(US-EAST-1)区域的服务中断事件。亚马逊简单存储服务(S3)团队当时在调试一个问题,该问题导致S3计费系统的处理速度比预期来得慢。太平洋标准时(PST)上午9:37,一名获得授权的S3团队成员使用事先编写的playbook,执行一条命令,该命令旨在为S3计费流程使用的其中一个S3子系统删除少量服务器。遗憾的是,输入命令时输错了一个字母,结果删除了一大批本不该删除的服务器。不小心删除的服务器支持另外两个S3子系统。其中一个系统是索引子系统,负责管理该区域所有S3对象的元数据和位置信息。这个子系统是服务所有的GET、LIST、PUT和DELETE请求所必可不少的。第二个子系统是布置子系统,负责管理新存储的分配,它的正常运行离不开索引子系统的正常运行。在PUT请求为新对象分配存储资源过程中用到布置子系统。删除相当大一部分的容量导致这每个系统都需要完全重启。这些子系统在重启过程中,S3无法处理服务请求。S3 API处于不可用的状态时,该区域依赖S3用于存储的其他AWS服务也受到了影响,包括S3控制台、亚马逊弹性计算云(EC2)新实例的启动、亚马逊弹性块存储(EBS)卷(需要从S3快照获取数据时)以及AWSLambda。

S3子系统是为支持相当大一部分容量的删除或故障而设计的,确保对客户基本上没有什么影响。我们在设计系统时就想到了难免偶尔会出现故障,于是我们依赖删除和更换容量的功能,这是我们的核心操作流程之一。虽然自推出S3以来我们就依赖这种操作来维护自己的系统,但是多年来,我们之前还没有在更广泛的区域完全重启过索引子系统或布置子系统。过去这几年,S3迎来了迅猛发展,重启这些服务、运行必要的安全检查以验证元数据完整性的过程所花费的时间超出了预期。索引子系统是两个受影响的子系统中需要重启的第一个。到PST 12:26,索引子系统已激活了足够的容量,开始处理S3 GET、LIST和DELETE请求。到下午1:18,索引子系统已完全恢复过来,GET、LIST和DELETE API已恢复正常。S3 PUT API还需要布置子系统。索引子系统正常运行后,布置子系统开始恢复,等到下午1:54已完成恢复。至此,S3已正常运行。受此事件影响的其他AWS服务开始恢复过来。其中一些服务在S3中断期间积压下了大量的工作,需要更多的时间才能完全恢复如初。

由于这次操作事件,我们在做几方面的变化。虽然删除容量是一个重要的操作做法,但在这种情况下,使用的那款工具允许非常快地删除大量的容量。我们已修改了此工具,以便更慢地删除容量,并增加了防范措施,防止任何子系统低于最少所需容量级别时被删除容量。这将防止将来不正确的输入引发类似事件。我们还将审查其他操作工具,确保我们有类似的安全检查机制。我们还将做一些变化,缩短关键S3子系统的恢复时间。我们采用了多种方法,让我们的服务在遇到任何故障后可以迅速恢复。最重要的方法之一就是将服务分成小部分,我们称之为单元(cell)。工程团队将服务分解成多个单元,那样就能评估、全面地测试恢复过程,甚至是最庞大服务或子系统的恢复过程。随着S3不断扩展,团队已做了大量的工作,将服务的各部分重新分解成更小的单元,减小破坏影响、改善恢复机制。在这次事件过程中,索引子系统的恢复时间仍超过了我们的预期。S3团队原计划今年晚些时候对索引子系统进一步分区。我们在重新调整这项工作的优先级,立即开始着手。

从这起事件开始一直到上午11:37,我们无法在AWS服务运行状况仪表板(SHD)上更新各项服务的状态,那是由于SHD管理控制器依赖亚马逊S3。相反,我们使用AWS Twitter帐户(@AWSCloud)和SHD横幅文本向大家告知状态,直到我们能够在SHD上更新各项服务的状态。我们明白,SHD为我们的客户在操作事件过程中提供了重要的可见性,我们已更改了SHD管理控制台,以便跨多个AWS区域运行。

最后,我们为这次事件给广大客户带来的影响深表歉意。虽然我们为亚马逊S3长期以来在可用性方面的卓越表现备感自豪,但我们知道这项服务对客户、它们的应用程序及最终用户以及公司业务来说有多重要。我们会竭力从这起事件中汲取教训,以便进一步提高我们的可用性。

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally.  The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.

We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.

From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD.  We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

欢迎加入本站公开兴趣群

软件开发技术群

兴趣范围包括:Java,C/C++,Python,PHP,Ruby,shell等各种语言开发经验交流,各种框架使用,外包项目机会,学习、培训、跳槽等交流

QQ群:26931708

Hadoop源代码研究群

兴趣范围包括:Hadoop源代码解读,改进,优化,分布式系统场景定制,与Hadoop有关的各种开源项目,总之就是玩转Hadoop

QQ群:288410967 

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/4198.html

相关文章

  • "打错一个字母,瘫痪半个互联网" 是怎样的感受?

    摘要:打错一个字母瘫痪半个互联网是怎样的感受在今天亚马逊披露了这起事故背后的原因后,很多人心里都会有一个疑问这个倒霉的程序员会被开除吗关于这一点,虽然主页君肯定没法做出准确的判断,但还是愿意给出我们的猜测不会。 2月28号,号称「亚马逊AWS最稳定」的云存储服务S3出现超高错误率的宕机事件。接着,半个互联网都跟着瘫痪了。一个字母造成的血案AWS 最近给出了确切的解释:一名程序员在调试系统的时候,运...

    刘福 评论0 收藏0
  • 北美互联网哀鸿遍野 - 号称99.9%可用性的S3挂了

    摘要:当和类似的服务诞生后,对于很多初创的互联网公司,简直是久旱逢甘霖,的持久性,和的可用性爽的不能再爽,于是纷纷把自个的存储架构布在了上。所以,当今早主要是宕机时,整个北美的互联网呈现一片哀魂遍野的景象。 事件回顾美西太平洋时间早上 10 点(北京时间凌晨 2 点),AWS S3 开始出现异常。很多创业公司的技术人员发现他们的服务无法正常上传或者下载文件。有人在 hacker news 上问:I...

    fancyLuo 评论0 收藏0
  • 阿里云故障「惊魂」1小时:难道我们是那0.1%?

    摘要:一场因阿里云故障引发的突发事件,导致他所在的互联网金融公司几近瘫痪。此次事故从点分至点分,时长约一小时。对此,阿里云方面不予置评。但阿里云相关负责人向新浪科技表示,赔偿问题将按照相关服务保障条款进行处理。 6月27日晚,北京国贸写字楼2座灯火通明。林晓宇疾步往返于运维部与研发部的走廊上,表情有些凝重。  一场因阿里云故障引发的突发事件,导致他所在的互联网金融公司几近瘫痪。在运维部工作近一年,...

    darkerXi 评论0 收藏0
  • 再见,Python!你好,Go语言

    摘要:语言诞生于谷歌,由计算机领域的三位宗师级大牛和写成。作者华为云技术宅基地链接谷歌前员工认为,比起大家熟悉的,语言其实有很多优良特性,很多时候都可以代替,他已经在很多任务中使用语言替代了。 Go 语言诞生于谷歌,由计算机领域的三位宗师级大牛 Rob Pike、Ken Thompson 和 Robert Griesemer 写成。由于出身名门,Go 在诞生之初就吸引了大批开发者的关注。诞生...

    MorePainMoreGain 评论0 收藏0
  • 再见,Python!你好,Go语言

    摘要:语言诞生于谷歌,由计算机领域的三位宗师级大牛和写成。作者华为云技术宅基地链接谷歌前员工认为,比起大家熟悉的,语言其实有很多优良特性,很多时候都可以代替,他已经在很多任务中使用语言替代了。 Go 语言诞生于谷歌,由计算机领域的三位宗师级大牛 Rob Pike、Ken Thompson 和 Robert Griesemer 写成。由于出身名门,Go 在诞生之初就吸引了大批开发者的关注。诞生...

    zhaot 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<