January 27, 2013
Using only publicly available online information, a team of researchers from the Whitehead Institute showed that it is possible to discern the identities of people who have submitted personal genetic information to scientific studies, even when that information is supposedly stored anonymously. The findings were published in the journal Science.
“This is an important result that points out the potential for breaches of privacy in genomics studies,” says Whitehead Fellow Yaniv Erlich said.
The study was performed in the spirit of “vulnerability research” studies, in which experts assess the security of data by attempting to break through the measures put in place to protect it. In the current study, it was not even necessary for the researchers to hack into any servers or crack any codes: all the information was freely available for the taking.
The researchers began by analyzing the Y-chromosomes of people who had participated in the 1,000 Genomes Project at the Center for the Study of Human Polymorphism, because the entire genomes of all participants had been sequenced and made publicly available. Taking advantage of the fact that the Y-chromosome is inherited only from the father, the researchers analyzed the men’s Y-chromosomes for unique genetic markers known as short tandem repeats (Y-STR).
The researchers then referred to public databases that record known links between Y-STR data and various surnames – linkages that are made possible because like Y-chromosomes, surnames tend to be inherited from fathers. With this data, the researchers were able to infer the surnames of a large number of participants in the 1,000 Genomes Project. They then used other public information sources, including genealogical websites, Internet search engines, obituaries, and data from the National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repository to identify almost 50 male and female 1,000 Genomes Project participants by name.
Out of respect for the privacy of the individuals involved, the researchers did not release the names that they had uncovered. They did; however, notify both NIGMS and the National Human Genome Research Institute (NHGRI) of the security flaw prior to the study’s publications. Both organizations said they had responded by removing certain demographic data from the public databases.
Relatives also at risk
Notably, the method of analysis used in the study could also have allowed the researchers to identify their relatives, both living and deceased, even if those people had never themselves submitted a genetic sample to anybody.
“We show that if, for example, your Uncle Dave submitted his DNA to a genetic genealogy database, you could be identified,” first author Melissa Gymrek, said. “In fact, even your fourth cousin Patrick, whom you’ve never met, could identify you if his DNA is in the database, as long as he is paternally related to you.”
“Yaniv’s work is a timely reminder that in this era in which massive amounts of genomic data are being generated rapidly and shared in the interest of scientific advancement, there is an increasing likelihood of privacy breaches,” Whitehead Institute Director, David Page said.