Look into my eyes

What color are your eyes?

There are many ways to answer that question. Statistically speaking, they’re probably brown. Unless you’re from northern Europe, in which case they’re probably blue.

The apparent source of most light colored eyes on the planet.
The apparent source of most light colored eyes on the planet.

But what about green eyes? Or hazel, amber, grey, or shades in between? If you’ve ever looked deeply into someones eyes, you know for a fact that calling eyes “brown” or “blue” is as reductionist as a government form.

Human diversity has a fantastic ability to defy standardized forms. You don’t often see a checkbox for Heterochromia iridium.

Eye color can be described quantitatively. This study from December 2015 examined high resolution photos of the irises of 1465 people. They interpolated a single color value from a 256×256 pixel square of each iris, and plotted the results on a scatter plot in CIELAB color space with shape representing the participant’s country of origin.

Scatter plot of the average eye color of 1465 people.
Scatter plot of the average eye color of 1465 people. Triangles represent East Asian participants, squares are European, and circles are South Asian.

The study shows that eye color is highly correlated with the participant’s origin (and, by extension, likely ancestry) and goes on to look at the genes responsible for color variation.

Can genetics accurately predict the color of an individual’s eyes?

Mendelian eye color is a recessive theory

Eye color was once believed to be due to simple Mendelian inheritance. Brown indicated a dominant trait, blue was recessive, and any other color was hand-wavingly explained as a mix between the two.

But it is possible (though not common) for blue-eyed parents to have a child with non-blue eyes. Simple Mendelian inheritance can’t explain this.

Genetic sequencing has shown that the real story is, as usual, a lot more complex.

Eye color (as well as hair and skin color) seems to be determined by the concentration and type of melanins present. The melanins responsible for eye color include two flavors of eumelanin (brown and black) and pheomelanin, which appears pink-to-red. The mixing of concentration of these pigments determines your eye color. And your genes determine how much of each is likely to be produced by your body.

But which genes? And why?

While research is still ongoing, this area was heavily investigated in 2008 (mostly in European populations).

Here is a study that identifies several SNPs, which are also correlated with skin and hair color.

This paper demonstrates that two SNPs (rs12913832 and rs1129038) show a perfect association with blue eye color for a large Danish family. They go on to show that the region in which the SNPs occur is highly conserved (even in horses, cows, cats, dogs, rhesus monkeys, mice, and rats), possibly indicating a founder mutation.  But these two SNPs alone can’t account for the wide spectrum of iris color variation.

Another study implicates two additional SNPs (rs916977 and rs1667394) as being essential to identifying eye color.

23andMe uses rs12913832 as the definitive SNP, giving relative percentages of the likelihood of brown, green, or blue eyes on this call alone.

I signed up for 23andMe a couple of years ago (before their trouble with the FDA, which has thankfully passed). I have those results, and I can correlate their call with my recent WGS results. Fortunately the calls agree in this case. (This isn’t always true, due to technical differences in how the analysis is performed).

Here are my calls for several SNPs from the above studies. Depth refers to allelic depth. Humans are diploid, with two copies of each chromosome (allowing for variation in zero, one, or both copies). The first number indicates the number of reads supporting a call for the reference base, the other for the alternate (ROB) base. A zero in either place is a homozygous call; non-zero numbers in both places are heterozygous.

15  28365618 rs12913832 A   G   0,34   HERC2
15  28356859 rs1129038  C   T   0,22   HERC2
15  28513364 rs916977   T   C   19,19  HERC2
15  28530182 rs1667394  C   T   0,30   HERC2
15  28230318 rs1800407  C   T   17,16  OCA2
14  92773663 rs12896399 G   T   15,13  SLC24A4
5   33951693 rs16891982 C   G   40,0   SLC45A2
11  89011046 rs1393350  G   G   33,0   TYR
6   396321   rs12203592 C   T   14,11  IRF4

The verdict: Almost certainly blue.

Like 8% of the rest of the world, my eyes are commonly described as "blue". Also, slightly bloodshot due to overcaffeination.
Like 8% of the rest of the world, my eyes are commonly described as “blue”. Also, slightly bloodshot due to overcaffeination.

The genetics of eye color are a well-traveled path of research, but there is clearly still a lot of work to be done. The European bias of current research probably helps in my case, since I happen to be of European descent. But you can expect an even more complex story to unfold as we study Africa, China, the Pacific Rim, South America, and the rest of the melanin-rich world.

An immense amount of work has been done to tie genetics to something as easily observable as eye color. Now try to imagine the effort necessary to understand the genetic basis of more complex conditions like autism, cancer, schizophrenia, Alzheimer’s, aging… Especially if we’re not even certain that the dominant factor is genetic.

Computational genomics is certainly going to help extend human life and cure genetic disease. But the problem is vast, and we’re in a race for our lives. It’s going to be a long and tough fight.

My genome: Let me show you it

tl;dr: download Rob’s source code

In October 2015 I signed up as a beta tester for Arivale, a Seattle-based “scientific wellness” company. The service is something like nutritional-coach-meets-quantified-self.

In their words:

Our systems approach gathers, connects, and analyzes your data to create a complete picture of you.

And that it does.

Once a month I have a chat with a nutritional coach about my current diet, life stresses, and exercise habits. Over the course of a year they take multiple blood samples and plot an extensive panel of blood chemistry trends over time. They collect multiple saliva samples, measuring cortisol at four points throughout the day. They perform a gut microbiome sequencing (gross, yet fascinating!) to measure the impact of diet on microbial population diversity. They supply a Fitbit to track steps, sleep, and heart rate. They take a DNA sample and run a SNP panel looking for several variations linked to nutrition and exercise.

And last (but certainly not least), they perform whole genome sequencing. This sets it solidly apart from services like 23andMe that can only detect specific SNPs. While the whole genome is specifically excluded from the coaching process, it is used (with consent) as a basis for further genomic study.

Most importantly: Arivale provides a copy of the data, including a VCF and the raw reads.

Your own genome on a hard drive. If you’re into computational genomics, this is the ultimate unboxing experience.

After anxiously waiting for several months, I finally received an encrypted hard drive containing a VCF file and an aligned BAM file. Tech specs for the reads:

  • Ran on an Illumina HiSeq X Ten
  • 106 GB of compressed BAM data
  • 150 bp paired reads
  • Just under 600 million reads total
  • 30x average coverage
  • Uses hs37d5 for a reference
  • FastQC indicates that the read quality is quite good:
These reads look nice and clean, all the way to the end. Excellent.

The VCF calls about 4.5 million variants, including standard rs IDs. The longest called deletion is 231 bases, and the longest insertion is 524 bases.

But what does it all mean?

That, my friends, is an ongoing and evolving field of study.

The human genome itself was first sequenced in 2003 (coincidentally, just after I moved to Seattle).  But 13 years later, we do not yet have a simple database where you can look up “what a gene does” or “what a genetic variation means”.

The current state of the art includes databases like dbSNP and dbVar and clinVar  that attempt to tie genetic samples together with studies of specific phenotypes and conditions. It’s new science, and still tough going.

It’s not clear that we will ever have a database that tells us “what this gene does”, because life is clearly much more complex than that. DNA is Layer 1 of the stack that runs this program called life. Epigenetics and microbiota and environment and poor life choices clearly have a significant impact on the health of any given organism.

And yet, DNA provides the ground rules of what any organism could aspire to. Cats beget cats. Plants beget plants. Bacteria beget bacteria. People beget people (who host a colony of bacteria at least as big as they are).

Your DNA is not your body, but it does set the parameters for what can be made with locally available materials.

As a hacker, I’d like to help document and debug Layer 1. Now that I have a copy of my source code, I intend to share the code review process with you.

Responsible disclosure

There are already many online sources of human genetic data available for analysis (see 1000 Genomes, the NCBI Sequence Read Archive, the European Nucleotide Archive,  etc.) Researchers benefit from large and factually complete databases that make it possible to perform genome-wide association studies that can link genetic traits to phenotype and disease risk in a way that would not otherwise be possible.

But our genetic data tells possibly the most intimate story about ourselves, including our ancestral background, inherited disease risks, and direct family relations. Data mining can turn up many unexpected patterns. Some happy, some not so happy.

For that reason, public genetic databases take personal privacy (and HIPAA compliance) seriously. And I’m sure they don’t want to be sued.

Ideally I’d like my genetic data to be studied as widely and thoroughly as possible. To alleviate all possible privacy concerns, I hereby release my own genome under Creative Commons CC-BY-SA. You may reuse or remix my genetic data on a non-commercial basis any way you like. Please share your findings!

And I’d appreciate an introduction to any evil clones you might produce. Just don’t forget to credit the original author. (Spoiler alert: they’re all evil.)

My data is up on the SRA with ID SRR3990320. It’s also referenced by BioProject PRJNA335906.  To download the data, it’s best to use sratools or ascp; a slow and often unreliable ftp link should also be up shortly. The BAM is 106 GB.

While it’s apparent that people of European descent are already overrepresented in modern genomics, nobody else’s genome is mine to give. I expect this gap to close sharply as the cost of sequencing continues to plummet and it becomes a standard test covered by insurance. In the meantime, I hope one more white dude’s data is useful to somebody.

Curious about how your genes determine your eye color?  Look Into My Eyes.