Securing AI: should data be centralized or decentralized?
TL;DR
The massive collection of personal data represents a new risk to privacy, and citizens-consumers are asking their representatives and their companies for higher security standards. Whereas personal data has been historically protected through anonymization, this technique often appears inefficient when artificial intelligence models are trained on personal data. New securing frameworks must be developed that can rely either on data centralization in a single vault or decentralization in multiple data warehouses.
Securing data sharing
We are providing ever more personal information to private companies in exchange of free accesses to social networks, emailing or other services. In just a few years, big web companies such as the “Big Five” have accumulated a gigantic amount of information about us, which are frequently exploited through AI algorithms in order to improve their products but also to monitor and to influence us. Recent scandals such as the Cambridge Analytica or Edward Snowden disclosures have allowed public opinion to better grasp the potential dangers related to the accumulation of personal data by organizations that are only weakly controlled by their users. The era when Facebook’s CEO could publicly denigrate its users’ privacy has dawned, and privacy protection is rising as a central issue in marketing and political campaigning. Meeting its citizens expectations, the European Union has paved the way of a tighter regulation of data sharing through the GDPR, and Brazil, Japan and California have recently increased their levels of data protection likewise. We are currently at a tipping point: citizens-consumers ask for better regulation from their politicians and for securing guarantees from their companies. But what does secure data sharing actually mean? This article presents some difficulties related to secure data sharing for AI, and focuses in particular on the (de)centralized aspect of data storage.
Data sharing before big data and AI
Let's consider a concrete example: research on medical data before the big data era. Patients’ privacy has always been at the center of medical deontology as physicians are obliged to swear to protect it. Medical research is therefore controlled by committees that ensure privacy protection and data are often anonymized or pseudonymized to prevent patients re-identification. When data is perfectly anonymized, physicians can securely share it with researchers regardless of any risk of privacy violation (“release-and-forget” approach). In that case, a combination of organizational (review committees) and technical (anonymization) solutions enables privacy-preserving data sharing.
Why former protections do not work anymore
Large-scale data collection and its analysis by AI algorithms unfortunately can no longer rely on these former solutions. AI training often implies accessing datasets of varying kinds which are collected and held by different parties (companies, administrations, hospitals, patients etc.). In the case of medical research, it may be relevant to jointly consider data coming from various producers such as social security administration, hospitals and patients’ medical devices. However from an organizational point of view, the creation of review committees able to supervise all these stakeholders appears challenging. Moreover, AI models are often efficient to detect small patterns in the collection of detailed data. Anonymizing detailed records is challenging, and when it is well conducted it often leads to a dramatic decrease of AI performances by blurring these small patterns. Therefore, anonymization and AI often appear as conflicting. Securing AI trained on massive data cannot rely anymore on data anonymization, and new securing techniques are necessary.
Towards a new securing infrastructure: centralized or decentralized?
Which solutions can we adopt to secure AI development without relying on data anonymization? An intuitive solution lies in granting a single trusted party the privilege of accessing and analyzing the identifying data. The centralization of data by this trusted party would facilitate its securing in a single highly secure vault. Although a centralized solution features many advantages, it also comprises some important limitations:
Centralizing data using a third party server may represent a single point of failure. In the context of medical research, if all the data collected by hospitals and connected medical devices were grouped by a single trusted party, the hacking of this party would be dramatic. For instance, the Anthema hack in 2014 exposed the records of 79 millions American patients
The centralizing party is granted unlimited access to records, and should thus be trusted by all data holders. Building such a trust is not easy and it may require complex control, audit and governance procedures. For instance, in 1974 the creation in France of a centralized administrative database called SAFARI had been hindered by a widely hostile public opinion, leading to the withdrawal of the project and the creation of an independent control agency, the CNIL.
AI development requires computer science, mathematical and applied skills (for instance medical) and it may be challenging to gather all these skills in a single innovative party. A monopoly of AI development may indeed be detrimental to innovation that often relies on various collaborations led for instance by start-ups or academic research teams.
Conversely, it is also possible to secure data sharing and analysis through decentralization. For instance, Tim Berners-Lee, one of the founders of the web, has developed Solid, a solution that enables each web user to keep control of her personal data avoiding its centralization by some big companies. But decentralized systems present also limitations:
Each actor of a decentralized data sharing collaboration should maintain a highly secure IT system and dedicate substantial resources to that end.
To efficiently collaborate in a decentralized setting it is necessary to authorize significant information flows between collaborators, and these flows may constitute a new threat that needs to be limited.
In a decentralized system it appears difficult to merge datasets, yet this operation may be necessary for the purposes of analysis.
Conclusion
Centralized and decentralized settings both have their pros and cons. Whatever the adopted solution, inherent risks will remain and must be considered. A range of new privacy-enhancing technologies are currently being developed to address these issues such as new cryptographic protocols (homomorphic encryption, multi-party computation), noise-addition to blur sensitive information (differential privacy) and distributed ledger technologies (often called blockchain) to ensure traceability in a decentralized collaboration. These techniques should be combined between them and with more classical approaches (encryption, access control etc.) to provide a secure framework for AI development.
The exploration of these new organizational and technical solutions has only just begun. A few more years will be necessary for such securing standards to emerge. These approaches will be collaboratively built by an enthusiastic and dynamic community of innovators, researchers, administrators and associations.