How does blockchain enable healthy collaboration between hospitals and private organization on health data?
This article was originally published in issue 28 of DSIH and is inspired by the HealthChain project
Hospitals have a huge amount of data
Every year, millions of patients are treated in French hospitals. The data of these patients are naturally stored in each of the hospitals' information systems and constitute an essential material not only for patient care but also for clinical research.
Hospitals are full of data (patient data and associated diagnoses) in many departments: mammograms and their diagnosis for breast cancer, genomic data and associated diseases, etc.
How can this data be made available for use in the world?
To advance research and innovate, it is important to collaborate between entities that have different skill sets. The hospital, and in particular university-affiliated hospitals, forge many links with external actors: research centres, pharmaceutical laboratories, start-ups, etc.
Let's imagine now that a start-up wants to develop an artificial intelligence (AI) tool on a given pathology. The easiest way for the start-up would be to retrieve all the images from the French services concerned and work on its AI algorithm. This approach is not feasible in the current data environment for two main reasons.
On one hand, medical data must be transferred in accordance with the regulations and according to their type (personal data, pseudonymised data, etc.). This obligation creates many barriers to the processing and transfer of this data.
On the other hand, the collection required a very significant human and financial effort from the hospital over many years. Indeed, it is a matter of ensuring the quality of the data and diagnosis, respecting the same protocol over several years and keeping these data in the hospitals information system. Thus, taxpayers and the institutions concerned are often legitimately reluctant for a private company to benefit from this public investment.
Behind the scenes there is a problem of trust between the hospital with a public service mission and the private company. How can a hospital be sure that the data recovered by the private actor will not be misused? In a world where the value of this medical data is still unknown and probably underestimated, sharing data with third parties presents too great a risk.
These problems are partially solved by providing external access (e. g. VPN access) to hospital data and signing a contract. However, this solution does not solve everything: it does not guarantee that a malicious actor cannot use the data without informing the hospital. How can a hospital really ensure that the data will not be misused?
AI on stationary data
The solution we present here is particularly applicable to AI, where algorithms need more and more data to be efficient. The trend is therefore towards massive centralization of data and implies a loss of control and traceability for hospitals.
Substra is an open sourced software toolopen source that mitigates the need for data centralization. It allows patient data to remain in the hospital while allowing algorithms to move between the hospital and third parties without a VPN connection. Only algorithms are exchanged to access the data and not the private actors themselves.
Role of blockchain in Substra: ensuring data control of the hospital
It is crucial that the hospital can explicitly or implicitly authorize the external actor to use its data for a specific purpose. The traditional use of a VPN implies a relationship of trust between parties, which is not always the case. Without a blockchain, this can be solved by introducing a trusted third party to administer the platform, but this requires another additional actor.
Substra uses a blockchain that makes it possible to do this without a trusted third party. Indeed, the management of data access permissions by algorithms as well as the traceability of operations is not entrusted to a legal or physical person but to an incorruptible distributed Programmable Logic controller: the blockchain.
Unlike Bitcoin, this blockchain is private: only authorized institutions can connect to it. The type of blockchain used here is called Distributed Ledger Technology (DLT). This registry is a database with special properties. It is decentralized: each computer on the network has its own ledger, but all ledgers are identical and always synchronized. The addition of information to the ledger is simultaneously recorded, validated and synchronized on the computer network.
There is no ledger with priority over others, no notion of master/slave ledger. This system therefore does not require a central administrator or a centralized database.
If the hospital wants to collaborate with a start-up, it can indicate in the distributed ledger that it gives permission to the start-up to send an algorithm on its data. When the start-up wants to launch an algorithm training on the hospital data, it must indicate this task in the ledger, which is validated or not, by all stakeholders in accordance with the permissions indicated in the ledger. An illegitimate algorithm will therefore be rejected immediately.
Collaboration between an external actor and several hospitals: federated learning
Now imagine that two other hospitals (B and C) have data similar to the first hospital A and want to participate in the research project with the start-up.
The algorithm must therefore learn from both hospital A, hospital B and C data. This is called federated learning.
The two new hospitals must first join the private network and then give permission to the start-up (via the distributed ledger) to train an algorithm on their data, see figure below:
Without a blockchain, one of the network's actors, probably the private actor, would have had to manage a master database collecting permissions on hospital data. In this way, he would have had particular rights that are not necessary.
The distributed ledger allows to have a complete traceability of actions on the network
The distributed ledger is also used to track all actions undertaken by the different actors in the network. This traceability brings transparency: we know which actor has worked on which data. It also provides trust in the models used to make predictions: we know precisely on which database(s) the models have been trained and on which database(s) they have been evaluated.
In the case where the startup values the model created (sale or use of SaaS, for example), hospitals whose data were used for learning receive a percentage of the revenue generated by the startup. This percentage can be negotiated in advance or proportional to each hospital's data contribution. Thanks to Substra's advanced traceability technology, it will even be possible to quantify how useful a dataset has been in improving the performance of a given algorithm. Thus the true added value of the data for this algorithm can be quantified.
Towards a controlled valuation of medical data
The distributed ledger allows health data to be used for research and innovation, ensuring data security and traceability. These characteristics, essential for healthy and transparent collaboration, make it a tool of choice in the field of health data exploitation. Once completed, this technology will allow the emergence of a network of hospitals accessible to external actors while remaining under the close control of CIOs and may lead to the emergence of a new, more fluid way of valuing medical data.