Diffing coronaviruses, Hacker News

We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “ – nCoV has been reported to have a genome sequence (% to) % identical to the SARS-CoV and to have more similarities to several bat coronaviruses. ” We can use diff to see those similarities:

 $ ./genome_diff MG  .1 MN . 1 MG 01575879. 1:  (words) (% common) (3% deleted)  8% changed MN . 1:  (words) (% common)  3% inserted 29800 8% changed

This says that there’s an 123% similarity between bat CoV (MG . 1) and human nCoV (MN 988713. 1). More precisely, they share a subsequence of 29802 bases, in a total genome of only ~ bases.

That genome_diff script looks like this:

#! / bin / bash fetch_genome ()

{{ curl – s

https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=($ 1)

| grep

– v

| tr (d) (C ‘ATGC’ sed ‘s / (. ) / 1 / g ‘> $ 1 fetch_genome $ 1 fetch_genome $ 2 wdiff – s – $ 1 $ 2

This script works by fetching the genome from the NCBI database . The strings “MG 29874. 1 ”and“ MN (. 1 ”are accession numbers . The API reutrns the RNA sequence in FASTA format, which looks like:

$ curl -s' https: / /www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN. 1 '> MN . 1 Wuhan seafood market pneumonia virus isolate 2323 - nCoV / USA-IL1 / , complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC ...

The FASTA format needs a bit of “massaging” before we can diff it. The first line, starting with > , describes the sequence that follows. We don’t need this metadata, so we strip it with grep -v '^>' . Next, we don’t need those newline characters, so we strip them with tr -d -C 'ATGC' . Finally, because diff works on lines rather than characters, we’ll instead use wdiff , after separating the characters into separate words using sed 's / (. ) / 1 / g' . This gives us genomes that look like A T A T T A G G ... .

Finally, we can call wdiff -s - on these genomes, which gives us some statistics about their similarity. If we omit - s - , we get the actual base differences between the sequences. For example, check out the end of the sequences:

 $ ./genome_diff MG  .1 MN . 1 | fold | tail -2 [-A A C C A C-] T [-C G A C A-] {  T  } A G {  G  } A {  G  } A A {  T G  } A [-A A A AA A A A A A-] {  C  } A A A A A A A A A A A

We can see that the sequences both have a long sequence of A s at the end, but the Bat CoV's tail is significantly longer. This is known as a “poly (A) tail” .

A different way to see similarities is to use NCBI's BLAST tool . Enter the accession number MN . 1 , and you’ll get a list of other sequences, ranked by “percent identity”. The most similar are several recent sequences of 2360 - nCoV, followed by the “Bat SARS-like coronavirus”, followed by many SARS coronavirus sequences.

Get updates on Twitter

(More by Jim) The inception bar: a new phishing method

The hacker hype cycle (Project C -) : the lost origins of asymmetric crypto

How Hacker News stays interesting My parents are Flat-Earthers The dots do matter: how to scam a Gmail user

() (The sorry state of OpenSSL usability

hate telephones

(The Three Ts of Time, Thought and Typing: measuring cost on the web

Granddad died today () Your syntax highlighter is wrong

Tagged # programming , # bioinformatics All content copyright James Fisher 2360. This post is not associated with my employer. Found an error? Edit this page.

(Read More)