[Guest post] Story of the 1st Federated Learning Model at Owkin
This guest post was originally published on Owkin website. Owkin, a fast-growing health data AI startup, is a core partner to Substra Foundation and dedicates a full tech team to the development of the Substra Framework first version. Several founding members of Substra Foundation are working at Owkin.
Owkin and Substra Foundation are both members of Healthchain consortium.
Story of the 1st Federated Learning Model at Owkin
At Owkin, our mission is to fuse AI and clinical research to unlock medical discoveries. To do this we connect medical researchers, biopharma companies, and data scientists in a collaborative, privacy-preserving ecosystem.
Access to large-scale, meaningful medical data is a major challenge in healthcare. Owkin has dedicated three years of R&D to develop Owkin Connect (‘Connect’), our proprietary federated learning framework that ‘connects’ multiple data sources, to enable distributed machine learning training without aggregating or collecting private data. “Connect” proof-of-concept was achieved in January 2020, when the first-ever federated deep learning model (‘Model’) was trained on histology images which were then distributed, and stored behind hospital firewalls. Two datasets and more than 40 talented and dedicated people from Owkin, Centre Léon Bérard (‘CLB’), and Institut Curie) contributed to this success. The path to this incredible achievement was not straightforward, however, it was full of many discoveries and valuable lessons. This proof-of-concept for federated learning via Owkin Connect is an amazing technical triumph, an example of outstanding collaboration, and an empowering conclusion to our journey. In this article, we outline a roadmap for how we got here and what we learned along the way.
Step 1: Building the consortium
We knew that in order to implement federated learning, we needed to identify two or more medical centers with curated datasets of similar types of data. For that purpose, Owkin turned to its internationally-renowned partners CLB and Institut Curie, which are both part of UNICANCER. (A federation of French medical centers that are pioneering cancer research and improving patient care).
Together, Owkin, CLB, and Institut Curie decided to investigate breast cancer because of its incidence, specifically triple-negative breast cancer for which prognosis is poor and typical treatments are of limited efficacy. The project was driven by visionary leaders: Thierry Durand (Director of IT at CLB); and Alain Livartowksi MD, (Deputy Head of the Data Department at Institut Curie). The process of gathering data from hundreds of breast cancer patients was orchestrated by the two co-principal investigators: Dr. Pierre-Etienne Heudel, oncologist, and Dr. Guillaume Bataillon, pathologist at Institut Curie. All four leaders brought expertise and commitment which were essential to the project.
This initial partnership formed the basis of Healthchain, a public-private consortium with a €10M budget funded by Banque Publique d’Investissement to develop the federated learning framework (Owkin Connect) and to train predictive models in oncology (breast cancer and melanoma) and fertility. The consortium gathers seven public partners: CLB, Institut Curie, CHU Nantes, AP-HP, Université Paris Descartes, and Ecole Polytechnique, and three private partners: Owkin, Substra Foundation, and Apricity.
Step 2: Project management
Challenges arose in the management of the federated learning breast cancer project due to the multiple stakeholders involved and novelty of the machine learning framework. Close collaboration and strong coordination were essential. At Owkin alone, the project required the expertise of more than 25 people from almost 10 different teams, mostly from Tech, IT Operations, Data Science, Research, Product, Partnerships, Legal, and Operations. At CLB and Institut Curie we interacted with Medical, Valorisation & Innovation, Legal and IT teams, among others, and we could rely on HealthChain data engineers, Clément Joly at CLB and Armand Léopold at Institut Curie, for internal coordination. Project management requirements included: Interdisciplinary and precise understanding of collaborator workflows and needs; Clearly defined objectives; Stringent road-mapping; Anticipation of risks and bottlenecks; Agile decision-making; and efficient communication.
Step 3: Security
Startups are familiar with the concept of Minimum Viable Product (MVP), which means building a quick and dirty—yet functional—early version of the product in order to collect user feedback with limited effort before scaling up production. However, in this case, we could not afford to give our partners an MVP, as any mistake in the code had the potential for severe consequences. As with any software deployed within a hospital’s infrastructure, a security breach could endanger the hospital’s whole IT system and affect all activities, even patient care. Additionally, since Connect trains algorithms on patient data and shares those algorithms between centers, a data leak could have consequences on privacy, even though the patient data is pseudonymized. Hence, we had to strengthen every security aspect of Connect before deploying it at CLB and Institut Curie. In October 2019 we ran a thorough risk analysis and a successful security audit that did not identify any serious security weakness. In November 2019, we were proud to present our security protocols to the HealthChain partners and receive approval from the Director of IT and Data Protection Officers, Thierry Durand and Franck Mestre at CLB and Astrid Lang at Institut Curie, to deploy Connect behind their firewalls and run federated learning training on their data.
Step 4: Contractual support:
When delivering a new cutting-edge technology, an organization must be innovative in every step of the process and reinvent everything, even the contracts. Given that federated learning is a new, emerging technology, there were limited examples of contracts to reference and Owkin had to design specific tripartite contracts between Owkin, CLB, and Institut Curie to allow training models in a federated learning context. These contracts state the expertise each party brings to the project and how the intellectual property and potential revenues will be shared.
Step 5: Connect Software development:
Considerable effort and interdisciplinary skills were needed to design, prototype, and develop Connect. This effort mobilized a full team of 10 people including designers, software engineers specialised in frontend or backend over almost two years. The Connected framework ensures that algorithms are trained on distributed sensitive datasets that remain within the infrastructure of the medical centers which have generated the data. Only the models and non-sensitive metadata are shared between Owkin and its partners, on a secure network. Permission settings and a distributed ledger (which orchestrates the computations and ensures that they are traced and authentic) guarantee compliance with data governance requirements and privacy-preserving operations. A major recognition of these cutting-edge privacy-preserving and security standards came with the open-sourcing of Connect’s core code in October 2019, under the name Substra. This open-sourcing was in line with Owkin’s commitment to full transparency with its partners, and the quality and security of Connect backbone code. The core code repository, Substra, is hosted by Substra Foundation, which actively promotes responsible and trustworthy data science, across different sectors.
Step 6: Connect deployment
The first deployment of Connect on CLB and Institut Curie’s infrastructure turned out to be quite complex. We initially thought we could have a ‘one-size-fits-all’ deployment. It soon became apparent that we must adapt our setup to match the infrastructure of each partner.
Step 7: Data science
As expected, data science presented several challenges, including those related to machine learning on medical data and those inherent to this innovative federated learning technology. Medical researchers and data engineers at CLB and Institut Curie played a key role to overcome these difficulties, working hand in hand with Owkin data scientists. The input data used to train the Model was whole slide images (WSIs) of triple-negative breast cancer biopsies, from approximately 100 patients at CLB and 240 patients at Curie. The Model’s objective was to predict the treatment response to neoadjuvant chemotherapy. To prepare data for machine learning, the physically archived slides had to be enumerated one by one and the annotation criteria had to be uniform between the two institutions. Even within the same institution, two slides can be very heterogeneous since the source biopsies were collected over several years, and methodology can change between practitioners. Additional attention was allocated to the curation of the test database used to evaluate the model performance at each institution. It was essential that this test dataset was: (i) completely separate from the dataset used to train the model, (ii) representative of the target patient population, and (iii) account for the different data collection tools and methodologies.
Step 8: Federated model
The federated learning approach was developed by scientists in the Federated Learning Research team at Owkin. This team, together with Owkin’s talented engineering teams, has built a strong internal library of state-of-the-art federated and privacy-preserving techniques. The challenge in building such a library was to ensure the interoperability of the latest research findings in the field with the complex machine learning approaches developed by Owkin data scientists for real-world medical problems across many therapeutic areas. With these tools in hand, a federated model was trained separately on each center’s dataset for reference. Connect was then successfully deployed at each center to train the model collaboratively between both centers’ data in a federated manner.
It became clear that the already complex medical question of predicting treatment response was particularly difficult in a federated learning setting. Further work was needed to improve the performance of the federated learning and data science approaches . Besides enriching the available data by working together with CLB and Institute Curie to increase the size of their local imaging datasets and pairing them with clinical data, we are also evaluating more elaborate federated learning strategies this summer, enabled by the flexibility of Owkin’s software and tools. We will be excited to share the result with you later on in the year in an official announcement. You will hear from us soon.
Acknowledgments
The Healthchain project is supported by Bpifrance, which resulted from the “Digital Investments Program for the major challenges of the future” RFP. As part of the “Healthchain” project, a consortium coordinated by Owkin (a private company) has been established, including the Substra association, Apricity (a private company), the Assistance Publique des Hôpitaux de Paris, the University Hospital Center of Nantes, the Léon Bérard Center, the French National Center for Scientific Research, the École Polytechnique, the Institut Curie and the University of Paris Descartes.
We are grateful to Centre Léon Bérard and Institut Curie for their enthusiasm and dedication, in particular: Thierry Durand, Pierre-Etienne Heudel, Franck Mestre, Clément Joly, Charles Bongiorno, Julie Struyf, Jean-Yves Blay at CLB and Alain Livartowski, Guillaume Bataillon, Astrid Lang, Xosé Fernandez, Julien Guérin, Armand Léopold, Guillaume Arras, Timothé Cynober, Johan Archinard, You-Heng Ea at Institut Curie.
We are thankful to all HealthChain partners and we are enthusiastic about the upcoming federated learning projects between CHU Nantes and AP-HP in onco-dermatology. We are excited to bring federated learning to our growing network of top-tier medical research institutions, Owkin Loop.
Special thanks and congratulations to Owkin employees who made this project a great success: Mathieu Galtier, Camille Marini, Anne-Laure Moisson, Samuel Lesuffleur, Inal Djafar, Clément Gautier, Claire Philippe, Guillaume Cisco, Kelvin Moutet, Jérémy Morel, Thibault Robert, Aurélien Gasser, Maël Debon, Gilles Wainrib, Eric Tramel, Meriem Sefta, Etienne Bendjebbar, Amandine Lagorce, David Vallas, Adrian Gonzalez, Jocelyn Dachary, Julien Masson, Charles Maussion, Jean du Terrail.
Authors
Anne-Laure Moisson - Project coordinator
Mathieu Galtier, PHD - Chief Product Officer
Antonia Trower - Marketing Manager
Anna Huyghues-Despointes - Head of Strategy and Marketing
Eric Tramel, PHS - Group Lead, Federated Learning