Overview

 

Damson is an A*STAR-funded project that seeks to eliminate significant privacy risks that currently impede researchers’ ability to analyze biomedical data about individuals. Immense amounts of valuable data now exist that are unusable by the research community due to the lack of an effective method for concealing individuals’ identities. The new ADSC work generates new publication schemes for the results of data analyses, thus making detailed summaries of health data available that can offer unprecedented insight into a vast range of medical conditions and provide useful input for urban planners, public health officials, and researchers.

The widespread availability of biomedical data, ranging from reports of the locations of new cases of dengue fever to individuals’ genomic variations, appears to offer researchers a tremendous opportunity. Statistical analysis of such data can help researchers and public health officials better understand a disease and its transmission patterns, gain new insights into the human body, and develop new treatments and services that can improve the quality of life of millions of people.

Unfortunately, privacy concerns make it infeasible to provide researchers with unlimited access to biomedical information. Previous attempts to solve this problem have tried to anonymize data by removing personally identifiable information from medical records, but this does not provide sufficient protection. The main problem is that external knowledge can be used to re-identify individuals whose data appear in supposedly anonymized data sets. Many ideas for mitigating the problem have been proposed, but all of them have made the unrealistic assumption that adversaries had limited prior knowledge.

“In fact, this has been shown to be a fundamental barrier,” explains Winslett. “An anonymized database will either reveal private information, given certain external knowledge — or will be useless for answering some questions.”

To the extent that databases of patient information have already been made available, they have made many lifesaving discoveries possible. For example, a University of San Antonio study involving data collected from over 9,000 breast cancer patients showed that amplification of the HER-2 oncogene was a significant predictor of both overall survival and time to relapse in patients with breast cancer. This information subsequently led to the development of Herceptin (trastuzumab), a targeted therapy that is effective for many women with HER-2-positive breast cancer. Likewise, it was medical records research that led to the discovery that supplementing folic acid during pregnancy can prevent neural tube birth defects (NTDs), and population-based surveillance systems later showed that the number of NTDs decreased 31 percent after mandatory fortification of cereal grain food products. No one doubts that additional valuable findings would follow if a way to tackle the privacy limitations can be found, so that far more patient data can be made available to researchers.

To that end, medical studies funded by the National Institutes of Health (NIH) in the U.S. are required to make the data they collect, as well as summaries of analysis results, available to other researchers. Originally, the statistical summaries were freely available to other researchers via NIH’s dbGaP database (http://www.ncbi.nlm.nih.gov/gap), while access to the detailed patient records required researchers to undergo a rigorous and fairly arduous approval process with their Institutional Review Boards (IRBs). Privacy concerns subsequently led NIH to restrict dbGaP access, so that today many of the statistical summaries cannot be viewed without IRB approval. The need for IRB approval is a significant hurdle for researchers who want to access the summary statistics from old studies to help them plan their future work.

To find a practical solution, the ADSC team is using the recently developed concept of “differential privacy.” Differential privacy works by adding a small amount of noise to the results of statistical analyses of sensitive data sets. Under differential privacy, the contributions of any one individual’s data towards the outcome of an analysis are negligible; analysis results are essentially identical regardless of whether a particular person’s data are included. This should not limit the usefulness of the results, since in a large and well-designed medical study, the history of a single individual should not have a significant impact on overall results. When analysis of a data set begins, its owners decide on a total “privacy budget” for the entire data set. Each published analysis result uses up a little bit of the privacy budget, and once the budget has been exhausted, no more results can be published, as they could open the possibility of at least one individual’s data having a non-negligible impact on overall results.

“Differential privacy offers us the tantalizing possibility of being able to do privacy-preserving data analysis that is both useful and secure,” says Winslett. “It’s such a new concept, but the implications are immense. Whoever comes up with a practical approach to differentially private access to biomedical data — which is what we aim to develop with this new project — will set off a free-for-all. It will open up so many new opportunities to revolutionize treatments and reduce health care costs.”

The project is investigating ways to re-enable open access to the summary information in dbGaP by making the summary tables differentially private. The researchers also target other custodians and users of health-related statistics in Singapore. That work is projected to include applications in pharmacoeconomics and in analysis of hospital records to reveal the effectiveness of different treatments for a disease. Some of the research results are summarized in the Accomplishments section.

Winslett is quick to point out that several fundamental research challenges remain before differentially private analyses becomes practical, but she is optimistic that ADSC has advantages that make it an ideal location for this research. In particular, Singapore is unique in its close cooperation among the government, the medical fraternity, and research institutes. This gives the ADSC researchers exceptionally good access to the parties who have a vested interest in broader dissemination of health data summaries. This concerted effort to bring together medical researchers, computer scientists, and medical records could one day enable Singapore to be a world leader in technologies for analyzing sensitive data.