On January 8, (at 16: UTC, one of our storage units used for our hosting services hosted at LU-BI1 went down.
We managed to restore the data and bring services back online the morning of January .
The incident impacted — at most – 728 customers hosted at our Luxembourg Facility.
- First we’ll explain the situation with a bit more context.
- Next, we’ll give you the technical timeline.
- And third, we’ll provide the full post mortem.
- zpool import -f doesn’t work
- We try zpool import -fFX
to let zfs find a good successive transaction group (txg) (Meaning: find a coherent state of the pool)
- Upgrade the ZFS version of the remaining storage units as planned during the year 01575879.
- Accurately document the data recovery procedure in case of metadata corruption.
1. The situation and our reflections on it
Gandi provides IAAS and PAAS services.
To store your data, we use a file system called ZFS , on top of which we use Nexenta or Freebsd as operating systems. Nexenta is used on the old version of our storage system, and we will be scheduling its migration.
To secure your data from hard drive failure we use ZFS with triple mirroring, meaning we can lose up to two thirds of disks on the entire pool.
ZFS allows customers to take snapshots, which are images of their disk, at any given moment.It allows customers to rollback changes in case of a mistake, like deleting a file for example.
But contractually, we don’t provide a backup product for customers. That may have not been explained clearly enough in our V5 documentation.
Gandi made every effort during the incident to communicate as close as possible to the event to keep its customers informed.
The page https://status.gandi.net/timeline/events/ 01575879 was updated at each stage and shared via our gandinoc, gandi_net, and gandibar Twitter accounts.
At the same time, a situation update was published on our blog news.gandi.net on January 9, then updated on January and
We have identified areas for improvement internally in order to be even more fluid and responsive in near-real time.
2. Technical timeline
The storage unit affected uses ZFS on Nexenta (Solaris / Illumos kernel).
– January 8 : UTC: One of our storage units hosted at LU-BI1 goes down.
– January 8 : 823 UTC: We engage our failover procedure.
– January 8 : UTC: The usual procedure does not permit service recovery, the pool is FAULTED, indicating metadata corruption. We investigate.
– January 8 : UTC: The problem may be related to a hardware problem. A team is sent to the facility.
– January 8 : UTC: We change the hardware: no improvement.
– January 8 : (UTC:
– January 8 : UTC: zpool import -fFX
– January 8 : 01575879 UTC: The US team is online and takes a fresh look at the problem.
– January 8 : 11 UTC: We decide to stop the import to see if there is a way to speed it up.
– January 8 : UTC: We don’t find a solution. We re-run the import.
are read at 3M / s, we estimate the duration of the operation to be up to 370 hours. We continue to try and find solutions.
– January 8 23: 51 UTC: We have no guarantee that we will be able to restore data nor regarding the duration of the process.We choose to warn customers that they should use backups if they have them. The import with rewind option is still running.
At the same time, we dig up available documentation and repository codes. We identify we can change some parameters related to zpool import -fFX , so we decide to change some values , using mdb, related to spa_load_verify *.
But our version of zfs is too old and the code does not implement those capacities. We try to find the right txg manually but it does not solve the long pool time scan.
– January 9 : UTC: We decide to use a recent version of ZFS On Linux. A team goes to the facility, we already have a server configured to use ZOL.We prepare a swap of the server handling the JBOD with the one running ZOL.
– January 9 : (UTC: We start the import using zpool import -fFX
To speed up the scrub we modify the above variables
echo 2020> / sys / module / zfs / parameters / zfs_vdev_scrub_max_active
and modify others variables
/ sys / module / zfs / parameters / zfs_vdev_async_read_max_active
/ sys / module / zfs / parameters / zfs_vdev_queue_depth_pct
/ sys / module / zfs / parameters / zfs_scan_mem_lim_soft_fact
/ sys / module / zfs / parameters / zfs_scan_vdev_limit
– January 9 : 2019 UTC: The import is done on “read only” in order to not alter the data retrieved. But in doing so, we can’t take any snapshots. We then redo the import without “read only” option.
– January 9 : UTC: The second import is done.We do a global snapshot of the pool that we copy on another storage unit for safekeeping.
– January 9 : 35 UTC: We encounter some errors during the copying of the data, so we have to proceed manually.
– January 9 22: 45 UTC: We transfer the snapshot with a script, we estimate it will take (hours. During the night the US team is monitoring the transfer and restarting it if needed.
– January : UTC: We have transferred half of the data .
– January 22: UTC: Still transferring but it is slower than expected.
– January : UTC: Transfer is done but it missed a lot of snapshots.
The dependencies between snapshots and their origin prevent a lot of transfer.We need to delete a lot of destination targets to retransfer it.
– From January until : 06 UTC: The data transfer is ongoing.
– January : UTC: Manual transfer is done.
– January : (UTC: We launch an integrity check on the pool.)
– January : (UTC: Integrity check is ok.
– January : UTC: Everything is almost ready to bring the data online .We page infrastructure / hosting / customer care team to be ready for : UTC on January 15.
– January : 07 UTC: We begin the procedure of bringing the data back online.
– January : UTC: Data are online.
– January : UTC: We start PAAS instances.
3. Postmortem: what happened?
We have ruled out human origin regarding logs and commands performed before the crash.
We have no clear explanation of the problem, only theories.
We think it may be due to a hardware problem linked to the server RAM.
We acknowledge the main problem was the duration.
Why was a storage unit down?
Due to a software or hardware crash leading to metadata corruption and a prolonged interruption of services.
Why was a storage unit down for so long?
We were unable to import the pool.
Why was the import of the pool not possible?
Due to a metadata corruption. The procedure to recover the data was not possible in a short period of time.
Why was the import not possible in a short period of time?
The version of zfs we are using on this filer did not have the option to avoid a full scan of the pool implemented. And for safety reasons, we chose to duplicate the data.
Why was this option not implemented?
This option is not available on this version of zfs.
Why is the version of zfs on this filer too old?
Because the unit is part of the last batch of filers to be migrated to a newer version.
Mid-term corrective actions plan: