Are Easy Access and Security in Genome Analysis Compatible? (Yes!)

Human genome data can be analyzed to derive immense healthcare benefits, and equally misused with disastrous consequences for victims of its theft. While we want to analyze large quantities of genetic data to maximize our knowledge, we need to keep this data completely secure. A new framework for genomic data analysis using brand-new mathematical techniques for data encryption gives both freedom and security.

Every Body is Different — Medically Speaking

New discoveries are rapidly being made linking genetic variants with drug interactions and with disease risks in healthcare. Driven by these discoveries is the field of precision medicine, which means managing or treating patients based on their metabolism and risk factors, rather than by applying uniform solutions to all patients.

We know that certain drugs that successfully treat illness in some people are ineffective for other people. The root cause of these variations in effectiveness may lie in genetic differences from person to person. Current findings in pharmacogenomics, the study of how variants in genes determine drug metabolism, tell us that about half of all primary care patients are exposed to drugs whose metabolism is affected by genes. Studies have found that 18% of the 4 billion prescriptions written in the US per year are affected by such genetic variations.

Sequence data are also used to predict disease risk. For instance, the American College of Medical Genetics and Genomics (ACMG) recommends reporting secondary findings in 56 genes. As many as 7% of patients harbor one of nearly 19,000 pathogenic or likely-pathogenic variants in these 56 ACMG genes. Early detection of the conditions caused by these genes will allow clinicians to act on them and reduce the risk to the patient.

The cost of whole genome sequencing has decreased dramatically in recent years and is expected to continue to decrease. Currently, next generation sequencing of a whole genome is a fraction of the cost of an MRI. The low cost makes it practical for routine clinical use. Consequently, the use of sequence data in clinical practice is also increasing rapidly and may be routine in just a few years. In fact, genome data are increasingly used in eligibility criteria for clinical trials. Currently, there are over 34,000 active clinical trials in the United States. Algorithms and software tools are being developed for automatically matching patients and trials based on genomic data to help increase enrollment.

The Criminal Imagination and the Theft of Genomic Information

As shown above, the benefits of genomic data analysis in medical care are immense. But preserving privacy must go hand-in-hand with using sequence data. Human genomic data are highly sensitive due to their uniqueness and predictive value. Any leakage of the data is irrevocable because unlike credit cards or social security numbers, this data is permanently attached to the individual. Once the information is revealed, it cannot be controlled and can be misused. A person whose genetic information becomes publicly available can be denied insurance, employment or loans if the determining authority decides the application is risky, regardless of the merit of the assessment. Questions of paternity, hereditary illnesses or conditions can be made public. These issues can result in social embarrassment or be surrounded by superstition. Discrimination can also affect relatives due to the shared DNA. Therefore, the victim of stolen genetic data can be extorted too. It has even been suggested that DNA can be synthesized and planted to frame someone for a crime.

Maintaining security is often at the discretion of a few scientists, and an individual can be responsible for the breakdown of privacy. Even de-identification of data may not suffice. Merely 75 single-nucleotide polymorphisms (SNPs) out of billions are sufficient to uniquely re-identify an individual, and a few dozens of database queries can determine the membership of a victim in the database.

A challenge in securing genome data is its volume. A whole genome sequence may occupy gigabytes of storage. Most healthcare institutions cannot store, manage and encrypt data of this size for tens of thousands of their patients. The patient’s data also needs to be shared with other healthcare institutions where a patient may be receiving care. A cloud server can store and manage the data, but the involvement of a third party increases the possibility of data breaches. Millions of patient and healthcare records are breached every year. DNA-testing services have inadvertently exposed the data of patients due to insufficient software protection or purposefully sold customer data.

Old Methods for a New Frontier

Until recently, the ease of access to genomic data for analysis was at odds with data security. Common encryption methods are encryption at rest — they keep the encrypted data (“ciphertext”) secure as long as it is just in storage and not being analyzed. But analyzing the data, such as checking if a patient has a variant in a gene, requires the data to be unencrypted. When the data are unencrypted for computation, the “plaintext” data are vulnerable again.

In the past few years, privacy and cryptographic techniques for secure computation have been extensively studied. Multi-party computation (MPC) is considered a promising method of secure computation. In this approach, multiple parties maintain local data and communicate intermediate results. MPC can be inefficient and vulnerable when the computing parties collude, and is thus inappropriate for long-term storage and outsourcing computation.

Computation on Ciphertext

Homomorphic encryption (HE) methods encrypt the data in a way that allows mathematical operations to be done directly on ciphertext. The results, when unencrypted, are the same as the results of the operations done on plaintext. This allows us not only to store data on a cloud server, but also to outsource computations on it without compromising privacy. Since no plaintext is required on the cloud server, the data are never vulnerable. The first HE schemes were developed in the 1970s. Gentry et al. derived encryption operations to allow addition and multiplication in 2009. Security is guaranteed by cryptographic hardness assumptions, which quantum computers cannot break.

Until recently, HE was considered too slow to be useful in commercial systems. However, researchers at the University of Texas Health Center (UTH), Miran Kim and Xiaoqian Jiang, have made HE with real numbers more efficient. They have also developed a new algorithm to efficiently multiply ciphertext matrices, which has sped up computations enough to be practically useful.

What remains for genomics-based decision support is this: how do we pose our questions of sequencing data in its ciphertext space? The data and questions must have a mathematical representation to allow mathematical operations on ciphertext. If we can encrypt the questions as well as the data they are applied to, then the cloud server will be unable to interpret the data, the results, or even the question.

This idea is at the center of Elimu’s NIH-sponsored project to provide clinical decision support with genomic data, in partnership with Drs. Kim and Jiang at UTH. We are developing a framework to represent variant data and genomics questions. In this framework, gene variants comprise a basis set of vectors and we can apply linear transformations to these vectors. Therefore, our questions are represented by matrices. The patient vectors and question matrices are all homomorphically encrypted. The matrix multiplication of the HE data is efficiently carried out using UTH’s algorithms.

In this project we endeavor to answer genomics questions for clinical trials matching, pharmacogenomics and gene reanalysis. Currently, we can answer questions related to alleles and genotypes, and questions with mappings to alleles or genotypes. For instance, we can answer “is patient X a good metabolizer of clopidogrel?” We can efficiently answer questions for populations of patients by horizontally concatenating patient vectors. We can also answer probabilistic questions when genotypes are ambiguous.

We have developed a computer client-server model. A genome sequencing laboratory sends patient sequence files in variant call format (VCF) to the client, such as a healthcare institution. The client obtains a secret key, homomorphically encrypts the data and sends it to the cloud server for storage. Subsequently, questions can be posed by a clinician using the client. The client generates a question matrix, encrypts it, and sends it to the server. The server does the computation and sends the ciphertext result back to the client, who unencrypts it for the clinician with the help of the secret key.

As HE continues to become more efficient, and more mathematical operations can be added, it will become possible to apply machine learning algorithms to analyze data. This will allow researchers access to large secure databases of genomic data where there was previously none, due to the risk of data exposure. And the availability of large datasets will help immensely to advance precision medicine.

Footnote: This research is being funded by Grant 1R41HG010978-01 from the National Human Genome Research Institute of the National Institutes of Health.

Curious how secure genomic analysis could fit into your precision-medicine workflow? Let’s start a conversation.