Labelia (ex Substra Foundation)

View Original

How to enhance privacy in Data Science projects?

This article was written by the Substra Foundation team.

In this blog article we present the most important Privacy-Enhancing Techniques (PETs) that are currently being developed and used by various tech actors. We briefly explain their principles and discuss their potential complementarities with Substra Framework. The aim of this article is to present potential Substra Framework development and possible integration with other technologies. For a more complete discussion of each PET, an excellent open access review realized by the Royal Society is available.

Photo by Tim Mossholder on Unsplash

Anonymization

Brief description

Anonymization is the most widely used and mature PET. It consists in deleting information, by removing fields, approximating or aggregating values, in order to make it more difficult to find back individual-related values. A perfectly anonymized dataset can be securely released to any scientific partner, and is out of the scope of GDPR. Although perfect anonymization seems sometimes the ideal solution, it has often been broken by clever retro-engineering techniques (cf. Netflix). Moreover, anonymization is often unworkable when high-dimensional tabular data are considered (patient follow-up, mobile phone metadata etc.). In that case perfect anonymization requires the deletion of too much information, making resulting datasets useless. 

Complementarity with Substra

New PETs are being developed for twenty years due to the mathematical limitations of anonymization techniques. Although perfect anonymization is often unworkable for ML studies on high dimensional data, imperfect anonymization (called pseudonymization in the GDPR) already greatly reduces re-identification risk and should systematically be used before registering data on Substra nodes (realizing anonymization directly on Substra node provides no advantage). Open source tools such as ARX are available. 

Secure Multi-Party Computation

Brief description

Secure Multi-Party Computation (SMPC) refers to a bunch of cryptographic primitives that make it possible to compute the result of a function f(x, y, z) when different actors hold the data x, y and z and do not want to share them. Using SMPC, data curators exchange encrypted versions of x, y and z in such a way that they can collaboratively compute the final result f(x, y, z) without revealing their individual contribution. 

The first deployment of SMPC has been realized in Denmark in 2009 to compute collaboratively an optimal market price without revealing each actor production. Although many different functions f may be computed using SMPC, efficient and scalable solutions are currently limited to additions (“secure aggregation”). 

Complementarity with Substra

SMPC appears highly complementary to Substra Framework, explaining why it has already been put forward in MELLODDY project. SMPC can be used for different applications:

  • To improve privacy of a statistical result, the aggregation principle states that the statistical analysis should be applied to many individuals before releasing its result (“hidden by the crowd”). SMPC can be used to aggregate securely model updates before their open sharing between partners of a Substra network. Models that are exchanged are statistical results, and should therefore be released only once they have been trained on a minimal amount of individuals. Using SMPC can be used to create indistinguishable batches of data from individual records that are curated by different hospitals, and ensure therefore that exchanged models are released only after having been updated on such minimal inter-hospital batches.

  • SMPC can be used to compare identification numbers between centers, without directly releasing them. When national patient IDs are used (such as the NIR in France, that should become attached to each medical record), lists of patients IDs can be compared to deduplicate records among centers. This technique can be used to clean databases and practice a train/test split of data that avoids data leakage. Sharemind platform is a platform that has been deployed to deduplicate genetic data in a distributed network. 

Photo by Markus Spiske on Unsplash

Homomorphic Encryption

Brief description

Homomorphic Encryption (HE) refers to cryptographic techniques that can be used to compute analysis on encrypted data, releasing encrypted results that only the data holder can decrypt. Compared to SMPC, HE does not necessitate a collaborative setting with many (>2) actors (although the limit between SMPC and HE is blurred). An important theoretical result has shown in 2009 that a large set of computations could theoretically be realised using HE, raising a large interest in the academic community. Although promising in the long term, HE techniques are often considered as too computationally expensive and of low maturity. 

Complementarity with Substra

There is no obvious complementarity with Substra. HE would be a solution enabling centralized data processing, avoiding the necessity of a distributed/federated learning setting.

Differential Privacy

Brief description

Differential Privacy (DP) refers both to a statistical theory and to a blurring technique. Initiated by Cynthia Dwork in 2005, DP provides a formal statistical definition of a privacy-preserving processing (using two “privacy parameters” epsilon and delta). Although mathematically relevant, this definition is somehow difficult to explain to non-engineers (for a simple presentation, see Udacity course). Moreover, in order to obtain “epsilon-delta private” processings, C. Dwork proposes to add finely calibrated noise to the result of statistical queries. An inconvenient of this noise-addition technique is that it alters result reproducibility and is therefore not appreciated by researchers. In 2020, US Census Bureau should apply DP to its statistical releases, and Google has released DP libraries for Tensorflow. 

Complementarity with Substra

DP may be complementary to Substra, although its integration is less obvious than with SMPC. 

  • DP may be associated to a dashboard for data curators, in order for them to finely tune their permission regimes and privacy levels (it is the objective of Privitar company). 

  • DP can be directly used in ML training algorithms (cf. Abadi)

In these cases, it would be crucial to integrate DP applications to the ledger. Applying noise addition / privacy assessment without accounting for it on the ledger would be of little interest.

Photo by Nina Ž. on Unsplash

Remote execution / distributed learning

Brief description

Substra Framework is currently designed for distributed learning, i.e. training of a model on distributed datasets in a subtle manner. Data scientists can access data only through Substra implying that no action can be realized without being traced by the ledger. More features could be developed on Substra in order to define more detailed permission regimes, and control a priori what computations that can be realized on datasets. The idea of controlling computations realized by remote data scientists is usually referred to as “remote execution”. In France, CASD has developed a highly secured tool for remote execution on centralized data, that has been promoted by Y.-A. de Montjoye, a influential privacy researcher. The main difficulty with this PET is that statistical analysis through remote execution are either too restricted for practical use (e.g. CASD), or too loose for efficient systematic protection. Moreover, the only systematic formal theory available to automatize the control of remote execution is Differential Privacy, which lacks maturity. 

Complementarity with Substra

The field of remote execution is very close to Substra current technology. The technical and theoretical difficulties faced by remote execution projects are mostly the same than those faced by Substra. Facing the limitations of automatic control of remote execution, Substra may:

  • Develop tools to facilitate manual audit of code

  • Facilitate ledger audit

Trusted Execution Environment

Brief description

Trusted Execution Environments (TEE) are technologies that mix soft- and hardware components. The idea is to better isolate computations on a server, so that even the server administrator cannot access the detail of the computation. Oasis Labs integrates TEE in its blockchain-based platform.

Complementarity with Substra

Trusted Execution Environments can be relevant for Substra in the case of a cloud-deployment such as MELLODDY. In that case, it would prevent an untrusted cloud provider to access sensitive information. 

Conclusion

Different PETs are currently being developed throughout the world. None of them is a silver bullet solution, and their combination will probably be necessary to build up trusted IT environments for collaboration on sensitive data. As underlined by Y.A. de Montjoye, the important move is from a release-and-forget (anonymize and release microdata) paradigm to a penetrate-and-patch one (build up complex IT infrastructures and improve them continuously). Substra Framework could become a useful secure orchestration layer in such an approach, and it will be of great interest to imagine new collaborations, integration, and development of PETs on Substra.