Twinify solves data privacy issues

This story is a part of the video of FCAI success stories series for explaining why fundamental research in AI is needed and how research results create solutions to the needs of people, society and companies.

Researchers at FCAI have developed a machine learning-based method that can produce research data synthetically. The application is based on a method that allows academics and companies to share data with each other without compromising the privacy of the individuals involved in the study.

Data driven technologies are revolutionizing many industries. However, in many areas of research – including health and drug development – there is too little data available due to its sensitive nature and the strict protection of individuals.

“When a person gets sick, of course, they want to get the best possible care. Then it would be important to have the best possible methods of personalized healthcare available”, says Samuel Kaski, Academy Professor and the Director of the Finnish Center for Artificial Intelligence FCAI.

However, developing such methods of personalized healthcare requires a lot of data, which is difficult to obtain because of ethical and privacy issues surrounding the large-scale gathering of personal data. “For example, I myself would not like to give insurance companies my own genomic information, unless I can decide very precisely what the insurance company will do with the information,” says Professor Kaski.

Many industries want to protect their own data so that they do not reveal trade secrets and inventions to their competitors. This is especially true in drug development, which requires big investments with high financial risk, and as a result, development of new drugs has stalled. If pharmaceutical companies could share their data with other companies and researchers without disclosing their own inventions, everyone would benefit.

The ability to produce data synthetically solves these problems. FCAI researchers found that synthetic data can be used to draw as reliable statistical conclusions as the original data.

“The strong privacy guarantee by differential privacy, allows conducting unlimited number of future analyzes on the synthetic data without further privacy concerns, which was not possible with previous approaches”, says Joonas Jälkö, a doctoral student in Professor Kaski’s group.

The application works like this: The researcher enters the original data set into the application, from which the application builds the synthetic dataset. They can then share their data to other researchers and companies in a secure way.

Researchers are further improving the application, to make it easier to use and add other functionality.

“We released the application early, to contribute to solving data scarcity during the pandemic. But the method is widely usable for other types of data as well. We welcome others to join developing it further!” says Kaski.

Twinify your dataset and share the synthetic twin without sacrificing privacy!