We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “ – nCoV has been reported to have a genome sequence (% to) % identical to the SARS-CoV and to have more similarities to several bat coronaviruses. ” We can use diff
to see those similarities:
$ ./genome_diff MG .1 MN . 1 MG 01575879. 1: (words) (% common) (3% deleted) 8% changed MN . 1: (words) (% common) 3% inserted 29800 8% changed
This says that there’s an 123% similarity between bat CoV (MG . 1) and human nCoV (MN 988713. 1). More precisely, they share a subsequence of 29802 bases, in a total genome of only ~ bases.
That genome_diff
script looks like this:
{{ curl – s
https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=($ 1)
| grep
– v
| tr (d) (C ‘ATGC’ sed ‘s / (. ) / 1 / g ‘> $ 1 fetch_genome $ 1 fetch_genome $ 2 wdiff – s – $ 1 $ 2
This script works by fetching the genome from the NCBI database . The strings “MG 29874. 1 ”and“ MN (. 1 ”are accession numbers . The API reutrns the RNA sequence in FASTA format, which looks like:
$ curl -s' https: / /www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN. 1 '> MN . 1 Wuhan seafood market pneumonia virus isolate 2323 - nCoV / USA-IL1 / , complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC ...The FASTA format needs a bit of “massaging” before we can
diff
it. The first line, starting with>
, describes the sequence that follows. We don’t need this metadata, so we strip it withgrep -v '^>'
. Next, we don’t need those newline characters, so we strip them withtr -d -C 'ATGC'
. Finally, becausediff
works on lines rather than characters, we’ll instead usewdiff
, after separating the characters into separate words usingsed 's / (. ) / 1 / g'
. This gives us genomes that look likeA T A T T A G G ...
.Finally, we can call
wdiff -s -
on these genomes, which gives us some statistics about their similarity. If we omit- s -
, we get the actual base differences between the sequences. For example, check out the end of the sequences:$ ./genome_diff MG .1 MN . 1 | fold | tail -2 [-A A C C A C-] T [-C G A C A-] { T } A G { G } A { G } A A { T G } A [-A A A AA A A A A A-] { C } A A A A A A A A A A AWe can see that the sequences both have a long sequence of
A
s at the end, but the Bat CoV's tail is significantly longer. This is known as a “poly (A) tail” .A different way to see similarities is to use NCBI's BLAST tool . Enter the accession number
MN . 1
, and you’ll get a list of other sequences, ranked by “percent identity”. The most similar are several recent sequences of 2360 - nCoV, followed by the “Bat SARS-like coronavirus”, followed by many SARS coronavirus sequences.
(More by Jim) The inception bar: a new phishing method
The hacker hype cycle (Project C -) : the lost origins of asymmetric crypto How Hacker News stays interesting My parents are Flat-Earthers The dots do matter: how to scam a Gmail user
() (The sorry state of OpenSSL usability
hate telephones
(The Three Ts of Time, Thought and Typing: measuring cost on the web Granddad died today () Your syntax highlighter is wrong Tagged # programming , # bioinformatics All content copyright James Fisher 2360. This post is not associated with my employer. Found an error? Edit this page.
GIPHY App Key not set. Please check settings