What's up, doc?
The general documentation for the Substra framework has been published for a while now without taking the time to have a look at it, so let's take a retrospective look at it together!
First things first
As you probably know, open source software isn't all about publicly accessible code. It is also and above all a project, a place where discussions, tests and developments come together. It is therefore essential to be able to settle down comfortably!
Substra, a framework for orchestrating decentralized machine learning tasks, is a powerful software built on contemporary software bricks, but is not yet simple to install on the server side; it was therefore essential to ease getting grips with it as much as possible.
Battle map
Our adventure therefore began quite naturally with an inventory of the existing situation and the prioritization of the project's needs with the team. We then had to look at the solutions that would best meet them and, finally, tackle the structure of the different sections of the documentation.
Starting point
The first observation to be made was quite straightforward: it was necessary to document how to install Substra, at least locally, in order to be able to start using the software and to finally test the turnkey prediction example based on the Titanic data!
But first of all, one must understand that Substra is based on:
several open source software blocks,
spread across several publicly accessible code repositories,
that use several specific versions, hence the precious compatibility table that I'm going to end up getting tattooed on my leg
and that are deployed with Kubernetes, based itself on several softwares which also require specific versions
Let's sum things up, calmly. Substra is made up of:
Hlf-k8s: The implementation Hyperledger Fabric, a blockchain framework used for its distributed ledger mechanism. It contains only non-sensitive metadata related to the members of a Substra network and to the training tasks. These operations are handled by the Substra-chaincode part.
Substra-backend: Behind the scenes of a Substra network, server operations are handled by Django, more specifically with Django Rest Framework. Each organisation of a Substra network needs to have a backend server so that they can "talk" to each other.
Substra-frontend: The Web interface is built with React/Redux.
The whole thing is deployed and orchestrated by Kubernetes, with the help of:
Kubectl, version 1.16.7 and Minikube, version 1.9.2 for the local setup part (that will be different for production configuration)
Helm, version 2.16.1 or Skaffold, version 1.9.2, as Kubernetes "package manager" using Charts.yaml files.
Note: You can now deploy Substra with Helm v3!
For example, if you wish to use the version 0.6.0 of the command line interface (Substra CLI), you will have to deploy on the server side what is indicated in the compatibility table:
It is also possible to directly install these different programs with Helm using the different "charts" published here, always referring to the compatibility table.
Tooling
Nice to have features
Initial objective:
"Not to blush in front of the Kubeflow documentation website"!
After discussing with the team, we have retained and favored certain functionalities while observing a principle of continuous improvement, among which:
Clear structure with a nice table of contents, which remains "sticky" on the left side of your browser so that you don't get lost in it.
Large sections such as "Getting started", deployment, tutorials, troubleshooting, definitions, frequently asked questions, etc.
Content! For example, the description of options and so on.
Explicitly announce that this may not be easy for Windows users
Complete and precise help sections for less experienced users (see Python Env)
A paragraph url generation mechanism to facilitate referencing and sharing of a specific section
Having a breadcrumb line
Offer the possibility to report a problem on the documentation
Rendering with syntax highlighting for code parts
Using images and gifs to show things
Open a "help" channel (#help)
Have badges (builds, links, etc.)
Offer different approaches depending on whether you are a developer, data scientist or administrator (see Find your way in this documentation).
Being able to change the theme and apply our colors
And a million other features that have become less of a priority as our adventure has progressed.
Stack
After evaluating different solutions to best meet these needs, we turned to the open source software Python Sphinx which stood out for its large community, myriad extensions and ease of use.
It was then easy to add extensions for compatibility with the Markdown format, more flexible than reStructured Text (.rst) and to have a contemporary theme (Read The Doc) adjusted to our colors.
For the hosting of the website, we opted to use directly the Github Page which comes with the code repository dedicated to the documentation. We then registered a sub-domain of substra.ai with our super domain name provider (Gandi ♥), namely doc.substra.ai. The whole thing was then assembled using the already deployed in the Substra Foundation projects build tool: Travis.
Writing & editing
Once this framework had been established, we were able to start working on a review of the existing documentation. We then first had to undertake a work of synthesis and restructuring of various contents which were distributed on several code repositories and sometimes in several sub-folders.
The first objective was therefore to propose a clear and efficient information structure with:
Overview: General presentation of the framework and navigational markers in the documentation's paths.
Getting Started: Everything you need to get started with Substra, debug its resources and even install the platform locally!
Platform Description: This section includes a presentation of the concepts handled by Substra (e.g. objective, traintuple & testtuple) and a second one that presents the software architecture of the Substra framework.
Specific entry points: Under this ugly name, you will find the Frequently Asked Questions and a glossary to gather the recurrent questions asked by the users of the software (dev, admin, data scientist, curious, open source aficionados) and an ordered list of notions that cross the project.
Contribute: This section gathers the very rich contribution guide to the Substra project as well as the procedure to build a new version of the documentation.
As an initial assessment
Let's be straightforward: it is possible to install Substra by following the doc!
And as for the not blushing at the Kubeflow documentation website goal: objectively speaking, we're very happy, but we'll let you give us your feedback on Github or Slack!
As for the functionalities that are not yet fully integrated (see the roadmap), we are working on it and of course remain open to discuss it with you!
Challenges
After this little ointment, a few points can be noted on which improvement is possible:
On the software side:
Substra remains a complicated software to deploy, in particular because of its composition (several code repositories, several versions to "align", several configuration files yaml that not very easy to use without notions of Kubernetes), it will thus be necessary to clarify again and again what is not yet enough!
Hardware requirements (not easy to run Substra on a computer with 8Gb of RAM) and multiple possible error sources do not make the stack easy to handle at first, but it should be noted that the new local debug mode (available from version 0.7.* of the Substra package) greatly improves the first contacts with the Python SDK!
Upgrades are often a moment of anxiety...
It would be great to publish some roadmap elements (Please note that the documentation repository gathers a number of questions, suggestions or feature requests)!
On the documentation side:
changelogs, the modifications made in each versions, still need to be collected and published (in the meantime, you can find it on the release page)
The roadmap is not progressing fast enough!
I still dream of a fully automatic generation of the documentation but the process is not easy to control reliably...
I often find some broken links in the documentation. I'm currently trying Dr Link Check, but do you have any suggestions?
[Bonus] Demo
If you don't want to bother doing a whole local deployment, we now offer you a demo!
It is a small Substra instance, hosted at OVH, which will allow you to make your first attempts with a remote Substra network composed of two organisations (org-1 & org-2)! You will then be able to concentrate on your code, configure the access permissions to your resources and connect to the two Substra nodes directly via the CLI, the SDK or even via the web interface in order to launch and follow your training and testing tasks!
All resources for this demo can be found here:
Guide to connect to this instance
Titanic example implemented in Substra with the fake_data of the opener.py or the debug mode
The community examples are also a great source of inspiration
The documentation of the CLI
The documentation of the Python SDK
substra-tools will serve as a basis for the definition of your resources
More generally, the documentation website will provide you with the elements you need
Important: this demo instance is not intended to host any sensitive data. It is, moreover, frequently reset to zero!
Open source
Without listing all the advantages of contributing to an open source project, we can still say that it allows to:
meet wonderful people!
learn, among others, bits and pieces of Linux, Kubernetes, Helm, Ingress, Docker, Python, React, Rabbit, Postgresql, Hyperledger Fabric, Git, etc.
see how to make a build with Sphinx! I promise, it's very simple, just edit the files that are in the src folder and type make livehtml in your favorite terminal! Everything is described here!
If you feel like lending a hand, but don't really know which way to start, you can:
help us to find the typos that are lying around in the documentation by reporting it here - I know there are some left...
suggest improvements to translations, ask questions, ask for clarification, suggest examples - it helps us and it helps everyone!
read, comment, create issues
or even take on a task/issue - we try to flag them in a coherent way, especially the [good-fist-issue] - and open a Pull Request!
Acknowledgement
First of all, we would like to express our thanks and admiration for Owkin and its technical teams who are openly developing such a framework for orchestrating decentralized Machine Learning tasks! Thanks also to the Substra team and community for their reactivity and kindness in all the operations that allow the publication of these resources! And a special thank you to Chris from Apricity, companion in the meanders of the server configurations!
Stay tuned & explore!
If you want more, you can join us on Slack, subscribe to the newsletter, participate in a Substra Open Source or Responsible and Trusted Data Science workshops, or explore -and star ʕᵔᴥᵔʔ- our projects on Github:
Photo credit: Coffee Talk by Jean Zar, Creative Common BY-NC 2.0.