Tuesday, April 23, 2013

Genome on a Hard Drive

I ignored all of 23andMe's warnings about loss of security and downloaded my raw chromosome/SNP-level genome data from their servers onto my hard drive. It's a 10 Megabyte text file which starts at chromosome 1 and ends with X & Y chromosomes, and my mitochondrial DNA. Here is how the file starts.

# This data file generated by 23andMe at: Tue Apr 23 09:13:29 2013
# Below is a text version of your data.  Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier
# (an rsid or an internal id), its location on the reference human genome, and the
# genotype call oriented with respect to the plus strand on the human reference sequence.
# We are using reference human assembly build 37 (also known as Annotation Release 104).
# Note that it is possible that data downloaded at different times may be different due to ongoing
# improvements in our ability to call genotypes. More information about these changes can be found at:
# https://www.23andme.com/you/download/revisions/
# More information on reference human assembly build 37 (aka Annotation Release 104):
# http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9606
# rsid chromosome position genotype
rs4477212 1       82154 AA
rs3094315 1       752566 AA
rs3131972 1       752721 GG
rs12124819 1       776546 AA
rs11240777 1       798959 GG
rs6681049 1       800007 CC
rs4970383 1       838555 AC
rs4475691 1       846808 CT
rs7537756 1       854250 AG
rs13302982 1       861808 GG
rs1110052 1       873558 GT

... and so on for 16,563 pages ...

In Microsoft Word it takes ten minutes to open due to its vast size and occupies 20 Megabytes.

As you can see, to the human eye this enormous heap of data is both incomprehensible and useless, but there are obviously tools - programs - which can access it. These were used by 23andMe itself to profile my health risks, inherited traits and ethnicity/ancestry.

Some people in the genetics trade (Razib Khan) have in fact pledged to make their genome publicly available on the Internet.

So: boring or useful?

The insurance company argument has these organisations running their programs over your genome and raising your premiums (or denying you cover) for heritable conditions. This is meant to be illegal in some places but that won't stop them.

The police/security argument is that these agencies would like nothing more than everyone's full genotype publicly available (on Facebook?) because (i) it makes DNA matching so much more powerful; and (ii) as 23andMe show, you can get a lot of phenotype even from today's restricted genotyping (i.e. 23andMe know a scary amount about me just from running their analysis).

The personal identity argument is that, assuming a benign prenatal and childhood environment, most of the key facts about personal identity are gene-encoded (how could they not be - we were built by these things). So personality, intelligence, appearance, height and even many social attitudes are heavily influenced by the genome: see this surprising graphic from here where genetic contribution is to the right in blue.

Graphics like this are built from twin studies, not genomic analysis, but there is intense ongoing research looking at the specific gene-variants (alleles) which drive such phenotypical characteristics. At some stage after the research is in, there will be tools which can grab your or my genome and read off this kind of rather personal information.

I guess those nice folk from the security and police services will be first in line, followed by employers and then potential life partners. Actually, the line could be in any order!

I ignore really futuristic options open perhaps to our great-great-grandchildren to clone their genome-publicising ancestor either in virtuality or in the flesh! (I wrote about this at science-fiction.com).