Thursday , February 25 2021

crates.io incident report for 2020-02-20 | Inside Rust Blog, Hacker News

      

On 2226 – – 22 at 27: UTC we received a report from a user of crates.io that their crate was not available on the index even after 12 minutes since the upload. This was a bug in the crates.io webapp exposed by a GitHub outage.

Root cause of the outage

In some corner cases the code that uploads new commits to the GitHub repository of the index was returning a successful status even though the push itself failed. The bug caused the job scheduler to think the upload was actually successful, causing the job to be removed from the queue and producing a data loss.

The outage was caused by that bug, triggered by an unexpected response during the GitHub outage happening at the same time.

Resolution

The team analyzed the code of the background job uploading commits to the index, and found a possible cause of the misreported success. A team member wrote the fix , another one reviewed it and we then deployed the patch directly to production.

At the same time, once we saw the index started to be updated again, we removed the broken entries in the database manually and asked the reporter to upload their crates again.

Affected crates

    ) Postmortem

    Deploying the change took way longer than expected: there were changes landed in master but waiting to be deployed on production, increasing the length of the build process and the risks of the deploy. In the future we should deploy hotfixes by branching off the current deployed commit, and cherry-picking the fix on top of that. We should also strive to reduce the amount of time PRs sit in master without being live.

    Nobody was paged due to this incident, as our monitoring and alerting system wasn’t able to catch the problem: we have monitoring in place for jobs failing to execute, but in this case the job was mistakenly marked as correct. We should implement periodic checks that ensure the database and the index are correctly synchronized.

    We were lucky that two members of the team with access to both the support email and the production environment were online during the outage: without paging available we could’ve noticed it way later than we did.

    During the incident investigation we also found that our logging was not good enough to properly diagnose the problem: there is no message logged when a commit is pushed to the index, nor when a background job is executed. Also, the API call to publish new crates doesn’t include the crate name in its line. We should enhance our logging capabilities to find the root cause of issues quickly during future incidents.

Timeline of events

It took 1 hour and minutes from the start of the incident to the deploy of the fix.

2226 – –

About admin

Leave a Reply

Your email address will not be published. Required fields are marked *