There are often compelling reasons to share sensitive data online. For example, repositories of anonymized medical data have been published to aid epidemiological research. While the release of such data can benefit society, it is critical that appropriate privacy protections are in place. Unfortunately, they often are not.
This project seeks to demonstrate an attack on the privacy protections of a real-world online data repository. The goal is to design the attack, execute it, and evaluate its effectiveness. If the attack proves effective, the findings of the project will be shared with the custodians of the online repository.
Attack demonstrations are an important part of research in data privacy. They highlight risks and motivate research into remedies.
The project aims to explore the following questions:
- How vulnerable is this repository to attack?
- What are the performance characteristics of the attack algorithm? For example, what computational resources does it require?
- How does this attack problem differ from other attacks explored in the privacy literature?
Components of the project include:
- Writing programs to scrape, clean, and organize data from the repository
- Designing and implementing attack algorithms
- Surveying literature on attacks
- Investigating attack remedies (such as differential privacy)
- Writing a report, which may be shared with the data custodians
More information about attacks can be found in this survey article: Dwork, Smith, Steinke, Ullman "Exposed! A Survey of Attacks on Private Data" Annual Review of Statistics and Its Application, 2017.
Different components of the project demand different skills.
For the programming components:
- Experience programming in python is desired.
- Experience with web scraping, regular expressions, data munging, databases is beneficial, but students who lack the experience but express an eagerness to learn will also be considered.
For the algorithm design + literature survey components:
- A strong math aptitude is desired.
- Knowledge of linear algebra, probability, and statistics is beneficial but not required.
Of course, the ideal student will possess the skills to contribute to all components of the project.