Software postmortem




















Lamentablemente para muchos desarrolladores y jefes de equipo es una autentica perdida de tiempo. Forma parte de la mejora continua que debemos practicar como equipo de trabajo. Debemos encontrar cosas concretas y escribirlos sin censuras.

Evitar las excusas. Es importante remarcar el contexto, la finalidad o el asunto por el que se escribe este postmortem.

And I haven't seen it happen again since ! The full post-mortem can be found here. Microsoft have Fabric Controllers which are computers that control around 1, servers. It manages their life cycles, monitors their health, and restarts servers when they go down. Isolating all these servers into 1, server clusters helps them keep modularity and keeps failures localized to 1 Fabric Controller rather than all of their servers going down.

Each server in the cluster has something called a Host Agent , and this is used by the Fabric Controller to do the work it deems necessary. To know when these SSL certificates need to be re-generated, they take the next day at midnight and add one year.

So if the Fabric Controller is creating a new certificate for a server on March 19, it will expire March 20, Do you see where this is going? These servers attempted to generate a one year certificate for a server on a leap year. It was trying to set the certificates to expire on February 30th, When the certificates fail to generate for the server, it terminates. And if it terminates three times in a row, it's considered to be a hardware failure, and then tells the Fabric Controller it is faulty.

The Fabric Controller , in an attempt to "heal" the failed server, will hand over the work to another server. One by one, all the servers will error out while trying to generate these certificates. And this eventually shuts down the Fabric Controller with it's servers!

This disaster resulted from faulty code. There are better ways to handle date problems like leap years and time-zone differences. From Thursday August 13, to Monday August 17, , there were some errors on Google Cloud services, along with permanent data loss on 0. There were four successive lightning strikes on the local electrical grid that sent power to Google's computers powering Google Cloud services. There were systems that began to immediately replace the power that was taken out by using a battery backup.

Alongside manual intervention from Google employees, the service was restored with minimal permanent loss. Google has an ongoing program of upgrading their infrastructure so that they are less susceptible to issues like this. After this, they ran analysis covering electrical distribution, software controlling the Cloud services, and the Cloud hardware being used. Flowdock instant messaging became unavailable for roughly 32 hours between April 21st and 22nd The result must be an array of action items that must be implemented to fix or improve the raised issues.

AAre you adopting or looking to improve your Agile practices? Is your team remote? We focus on making communication more effective and easier for remote teams. Topics to check: Key differences between Post-Mortems and Agile Retrospective sessions How to get started with Agile Retrospectives The best tools for useful Retrospective sessions Key differences between post-mortem and retrospective sessions They might sound similar, but be careful not to confuse them: Post Mortems are not the same as Retrospectives.

Getting started with retrospectives It all starts with keeping regular sessions with your team. A standard format that might be useful for you is: Set the stage Greet each attendant and be mindful about their reactions. Gather data The team needs to understand its own vision about the last Sprint. This can be done by asking each attendant the following questions: What went well? What did not go well?

What could we improve? Brainstorm ideas After you have learned the ideas of everyone, there are several exercises that can help to unify and prioritize this information in order to further discuss them through brainstorming in order to effectively arrive to follow-up actions and next steps.

Pick the ideal solutions You know what to mitigate or solve, so you have some action items to chase after for the next sprint or at least, you acknowledge them and plan to solve them during another phase of the project.

In that way, anyone at Loon could contribute to a postmortem, see how an incident was handled, and learn about the breadth of challenges that Loon was solving. While everyone agreed that postmortems were an important practice, in a fast moving start-up culture, it was a struggle to comprehensively follow through on action items.

Ideally, we would have prioritized postmortems that focused on best practices and learnings that were applicable to multiple generations of the platform, but those weren't easy to identify at the time of each incident. Even though the company was not especially large, the novelty of Loon's platform and interconnectedness of its operations made determining which team was responsible for writing a postmortem and investigating root causes difficult.

For example, a 20 minute service disruption on the ground might be caused by a loss of connectivity from the balloon to the backhaul network, a pointing error with the antennae on the payload, insufficient battery levels, or wind that temporarily blew the balloon out of range.

Actual causes could be quite nuanced, and often were attributable to interactions between multiple sub-systems. Thus, we had a chicken-and-egg problem: which team should start the postmortem and investigation, and when should they hand off the postmortem to the teams that likely owned the faulty system or process? Not all teams had a culture of postmortems, so the process could stall depending on the system where the root cause originated.

Much of how Loon used postmortems, especially in software development and Prod Team, was in line with SRE industry standards. Sharing the postmortems openly and widely across Loon was critical to building a culture of continuous improvement and addressing root causes.

To increase cross-team awareness of incidents, in we instituted a Postmortem Working Group. In addition to reading and discussing recent postmortems from across the company, the goals of the working group were to make it easier to write postmortems, promote the practice of writing postmortems, increase sharing across teams, and discuss the findings of these incidents in order to learn the patterns of failure. The use of postmortems became a standardizing factor across Loon's teams — from avionics and manufacturing, to flight operations, to software platforms and network service.

Many industries have adopted the use of postmortems — they are fairly common in high-risk fields where mistakes can be fatal or extremely expensive. As the original SRE book states, blameless postmortems are key to "an environment where every 'mistake' is seen as an opportunity to strengthen the system.

To facilitate learning, SRE's postmortem format includes both what went well — acknowledging the successes that should be maintained and expanded — and what went poorly and needs to be changed. The Prod Team had three primary goals:. Own the mission of fielding and providing a reliable commercial service Loon Library in the real world.

Seeking to complement the anomaly resolution system, the Flight Operations Team incorporated the SRE software team's postmortem format for incidents that needed further investigation — for example, failure to avoid a storm system, deviations from the simulated expected flight path that led to an incident, and flight operator actions that directly or indirectly caused an incident.

Their motto, "Own our Safety", brought a commitment to continually improving safety performance and building a positive safety culture across the company. This was one of the strengths of Loon's culture: all the organizations were aligned not just on our audacious vision to "connect people everywhere," but also on doing so safely and effectively. This probably comes as no surprise to developers in similar environments — when the platform or services that require investment are rapidly changing or being replaced, it's hard to spend resources on not repeating the same mistakes.

Its founding goal was to " Cultivate a postmortem culture in Loon to encourage thoughtful risk taking, to take advantage of mistakes, and to provide structure to support improvement over time.

Prod Team and several other teams' meetings had a standing agenda item to review postmortems of interest from across the company, and we sent a semi-annual email celebrating Loon's "best-of" recent incidents: the most interesting or educational outages. Once we had a standardized postmortem template in place, we could adopt and reuse it to document commercial service field tests. By recording a timeline and incidents, defining a process and space to determine root causes of problems, recording measurements and metrics, and providing the structure for action item tracking, we brought the benefits of postmortem retrospectives to prospective tasks.

When Loon began commercial trials in countries like Peru and Kenya, we conducted numerous field tests.



0コメント

  • 1000 / 1000