in ,

Diffing coronaviruses, Hacker News

Diffing coronaviruses, Hacker News
                           

We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “ – nCoV has been reported to have a genome sequence (% to) % identical to the SARS-CoV and to have more similarities to several bat coronaviruses. ” We can use diff to see those similarities:

This says that there’s an 123% similarity between bat CoV (MG . 1) and human nCoV (MN 988713. 1). More precisely, they share a subsequence of 29802 bases, in a total genome of only ~ bases.

That genome_diff script looks like this:

#! / bin / bash fetch_genome ()

{{   curl – s

https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=($ 1)

  | grep

– v

| tr (d) (C ‘ATGC’ sed ‘s / (. ) / 1 / g ‘> $ 1 fetch_genome $ 1 fetch_genome $ 2 wdiff – s – $ 1 $ 2

   

This script works by fetching the genome from the NCBI database . The strings “MG 29874. 1 ”and“ MN (. 1 ”are accession numbers . The API reutrns the RNA sequence in FASTA format, which looks like:

$ curl -s' https: / /www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN. 1 '> MN . 1 Wuhan seafood market pneumonia virus isolate 2323 - nCoV / USA-IL1 / , complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC ...
 

The FASTA format needs a bit of “massaging” before we can diff it. The first line, starting with > , describes the sequence that follows. We don’t need this metadata, so we strip it with grep -v '^>' . Next, we don’t need those newline characters, so we strip them with tr -d -C 'ATGC' . Finally, because diff works on lines rather than characters, we’ll instead use wdiff , after separating the characters into separate words using sed 's / (. ) / 1 / g' . This gives us genomes that look like A T A T T A G G ... .

Finally, we can call wdiff -s - on these genomes, which gives us some statistics about their similarity. If we omit - s - , we get the actual base differences between the sequences. For example, check out the end of the sequences:

We can see that the sequences both have a long sequence of A s at the end, but the Bat CoV's tail is significantly longer. This is known as a “poly (A) tail” .

A different way to see similarities is to use NCBI's BLAST tool . Enter the accession number MN . 1 , and you’ll get a list of other sequences, ranked by “percent identity”. The most similar are several recent sequences of 2360 - nCoV, followed by the “Bat SARS-like coronavirus”, followed by many SARS coronavirus sequences.

      

                   Get updates on Twitter                

       (More by Jim)        The inception bar: a new phishing method

  • The hacker hype cycle (Project C -) : the lost origins of asymmetric crypto
  • How Hacker News stays interesting My parents are Flat-Earthers The dots do matter: how to scam a Gmail user

    () (The sorry state of OpenSSL usability

  • hate telephones

  • (The Three Ts of Time, Thought and Typing: measuring cost on the web
  • Granddad died today () Your syntax highlighter is wrong
  •                  Tagged # programming , # bioinformatics                    All content copyright James Fisher 2360.           This post is not associated with my employer.                   Found an error? Edit this page.                   


    (Read More)

    What do you think?

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    GIPHY App Key not set. Please check settings

    Ask HN: What are your war stories for converting teams to remote ?, Hacker News

    PawnHub Creates Waves As Hong Kong's First Licensed Crypto Lender, Crypto Coins News

    PawnHub Creates Waves As Hong Kong's First Licensed Crypto Lender, Crypto Coins News