Postmortem of the failure of one hosting storage unit at LU-BI1 on January 8, 2020, Hacker News

On January 8, (at 16: UTC, one of our storage units used for our hosting services hosted at LU-BI1 went down.

We managed to restore the data and bring services back online the morning of January .

The incident impacted — at most – 728 customers hosted at our Luxembourg Facility.

First we’ll explain the situation with a bit more context.
Next, we’ll give you the technical timeline.
And third, we’ll provide the full post mortem.

1. The situation and our reflections on it

Gandi provides IAAS and PAAS services.

To store your data, we use a file system called ZFS , on top of which we use Nexenta or Freebsd as operating systems. Nexenta is used on the old version of our storage system, and we will be scheduling its migration.

To secure your data from hard drive failure we use ZFS with triple mirroring, meaning we can lose up to two thirds of disks on the entire pool.

ZFS allows customers to take snapshots, which are images of their disk, at any given moment.It allows customers to rollback changes in case of a mistake, like deleting a file for example.

But contractually, we don’t provide a backup product for customers. That may have not been explained clearly enough in our V5 documentation.

Gandi made every effort during the incident to communicate as close as possible to the event to keep its customers informed.

The page https://status.gandi.net/timeline/events/ 01575879 was updated at each stage and shared via our gandinoc, gandi_net, and gandibar Twitter accounts.

At the same time, a situation update was published on our blog news.gandi.net on January 9, then updated on January and

We have identified areas for improvement internally in order to be even more fluid and responsive in near-real time.

2. Technical timeline

The storage unit affected uses ZFS on Nexenta (Solaris / Illumos kernel).

– January 8 : UTC: One of our storage units hosted at LU-BI1 goes down.

– January 8 : 823 UTC: We engage our failover procedure.

– January 8 : UTC: The usual procedure does not permit service recovery, the pool is FAULTED, indicating metadata corruption. We investigate.

– January 8 : UTC: The problem may be related to a hardware problem. A team is sent to the facility.

– January 8 : UTC: We change the hardware: no improvement.

– January 8 : (UTC:

zpool import -f doesn’t work
We try zpool import -fFX
to let zfs find a good successive transaction group (txg) (Meaning: find a coherent state of the pool)

– January 8 : UTC: zpool import -fFX is running, but slowly. At this rate, it will takes days.

– January 8 : 01575879 UTC: The US team is online and takes a fresh look at the problem.

– January 8 : 11 UTC: We decide to stop the import to see if there is a way to speed it up.

– January 8 : UTC: We don’t find a solution. We re-run the import.

are read at 3M / s, we estimate the duration of the operation to be up to 370 hours. We continue to try and find solutions.

– January 8 23: 51 UTC: We have no guarantee that we will be able to restore data nor regarding the duration of the process.We choose to warn customers that they should use backups if they have them. The import with rewind option is still running.

At the same time, we dig up available documentation and repository codes. We identify we can change some parameters related to zpool import -fFX , so we decide to change some values , using mdb, related to spa_load_verify *.

But our version of zfs is too old and the code does not implement those capacities. We try to find the right txg manually but it does not solve the long pool time scan.

– January 9 : UTC: We decide to use a recent version of ZFS On Linux. A team goes to the facility, we already have a server configured to use ZOL.We prepare a swap of the server handling the JBOD with the one running ZOL.

– January 9 : (UTC: We start the import using zpool import -fFX with the possibility now to avoid the whole scan of the pool: echo 0> / sys / module / zfs / parameters / spa_load_verify_metadata

To speed up the scrub we modify the above variables

echo 2020> / sys / module / zfs / parameters / zfs_vdev_scrub_max_active

and modify others variables

/ sys / module / zfs / parameters / zfs_vdev_async_read_max_active

/ sys / module / zfs / parameters / zfs_vdev_queue_depth_pct

/ sys / module / zfs / parameters / zfs_scan_mem_lim_soft_fact

/ sys / module / zfs / parameters / zfs_scan_vdev_limit

– January 9 : 2019 UTC: The import is done on “read only” in order to not alter the data retrieved. But in doing so, we can’t take any snapshots. We then redo the import without “read only” option.

– January 9 : UTC: The second import is done.We do a global snapshot of the pool that we copy on another storage unit for safekeeping.

– January 9 : 35 UTC: We encounter some errors during the copying of the data, so we have to proceed manually.

– January 9 22: 45 UTC: We transfer the snapshot with a script, we estimate it will take (hours. During the night the US team is monitoring the transfer and restarting it if needed.

– January : UTC: We have transferred half of the data .

– January 22: UTC: Still transferring but it is slower than expected.

– January : UTC: Transfer is done but it missed a lot of snapshots.
The dependencies between snapshots and their origin prevent a lot of transfer.We need to delete a lot of destination targets to retransfer it.

– From January until : 06 UTC: The data transfer is ongoing.

– January : UTC: Manual transfer is done.

– January : (UTC: We launch an integrity check on the pool.)

– January : (UTC: Integrity check is ok.

– January : UTC: Everything is almost ready to bring the data online .We page infrastructure / hosting / customer care team to be ready for : UTC on January 15.

– January : 07 UTC: We begin the procedure of bringing the data back online.

– January : UTC: Data are online.

– January : UTC: We start PAAS instances.

3. Postmortem: what happened?

We have ruled out human origin regarding logs and commands performed before the crash.

We have no clear explanation of the problem, only theories.

We think it may be due to a hardware problem linked to the server RAM.

We acknowledge the main problem was the duration.

(1)

Upgrade the ZFS version of the remaining storage units as planned during the year 01575879.
Accurately document the data recovery procedure in case of metadata corruption.

Postmortem of the failure of one hosting storage unit at LU-BI1 on January 8, 2020, Hacker News

1. The situation and our reflections on it

2. Technical timeline

3. Postmortem: what happened?

Why was a storage unit down?

2)

Why was a storage unit down for so long?

3)

Why was the import of the pool not possible?

4)

Why was the import not possible in a short period of time?

5)

Why was this option not implemented?

6)

Why is the version of zfs on this filer too old?

Mid-term corrective actions plan:

What do you think?

Vision Pro sales are not satisfactory, Apple is said to have significantly cut orders and will not launch new models in a short period of time

Finally I2S on STM32F7 is not generating the DMA timeout

Miggo Unfurls Real-Time Application Detection and Response Platform

A JavaScript-based RE Challenge

North Korea-linked APT groups target South Korean defense contractors

HHS Strengthens Privacy of Reproductive Health Care Data

Oil plummets 30% as OPEC deal failure sparks price war, hacker news

Squashed Bugs, Served Hot and Fresh with Failure Rate Heatmaps | TiDB, Hacker News

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Udemy Coupon [100% OFF] QuickBooks Online 2020

Amazon FBA Product Research & Find Products for Amazon FBA

Rubot v6.6.7.0 – Twitch Views Bot 2022

The man teaching 300 million people a new language, Hacker News

How Warcraft III birthed a genre, changed a franchise, and earned a Reforge-ing, Ars Technica

1. The situation and our reflections on it

2. Technical timeline

3. Postmortem: what happened?

Why was a storage unit down?

2)

Why was a storage unit down for so long?

3)

Why was the import of the pool not possible?

4)

Why was the import not possible in a short period of time?

5)

Why was this option not implemented?

6)

Why is the version of zfs on this filer too old?

Mid-term corrective actions plan:

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections