[Guest post] - The Paradox of Data Sharing

This blog post is the first guest post on Substra Foundation blog. It is written by Noggin, a digital, data & analytics company based in Singapore. Noggin is interested in Substra framework and our themes of responsible and trustworthy collaborative data science. We are delighted to publish this article presenting their vision and a use case of federated learning (disclaimer: a guest post doesn't imply contractual links between organizations).

“More sharing gets more data, more data creates more values but also worries”.

Using 5G, IoT and smart technologies, organizations, cities and nations are transforming themselves digitally for purposes of improving productivity, increasing public outreach or reducing poverty. For digitalization to succeed, the technological capability of Digital, Data & Analytics on the organization is integral. The key point here is "Data is at the Centre of it all". Besides office automation and enterprise application systems, digital tech like edge devices, mobile apps or websites, smart stores or things, robots, cars are generating data exponentially.

According to global consultancy firm Accenture, "Personal data now makes up about 75 percent of all digital data created. In the United States, for example, a typical white-collar worker creates as many as 5,000 megabytes of personal data each day—roughly the size of two high-definition movies. While individual data points may have relatively little value in themselves, the potential for them to be linked - and the insights that companies can gain from making those connections - can be particularly lucrative."

The increasing applications of data analytics, machine learning and AI by public and private organizations for actionable insights, decision-support and smart or autonomous automation, is in turn driving the consumption of more and more data. The global pandemic of covid-19 is accelerating this pace of digitalization as it is no longer just an option but a necessity for business survival.

In many situations today, digital data are shared by organizations with external parties for purposes of data analytics, machine learning or AI processing. Broadly speaking, there are 3 main types of data sharing situation for an organization – i) with AIaaS operators, ii) with business ecosystem partners (ie. between collaborating orgs in an ecosystem), and iii) with industry alliance or association members (ie. between multiple similar orgs in an industry).

Organizations and their end-users (the individuals) understand the imperative and value for data to be shared in order for a product, service or incentive to be created, used/consumed, and gained. However, there are grave concerns about data breaches, privacy and confidentiality that are causing the mass adoption inertia of those organizations stuck in a paradox. This paradox is such that when organizations do not share they go 'off-grid' or 'stay in analog' which works against how businesses operate effectively today; otherwise organizations want to share only if they have the assurance by stakeholders of 'digital trust and safety' that meets global and local regulatory compliances.

To address organizations’ requirement for ‘digital trust and safety’ when sharing data with external AIaaS operators, business ecosystem partners or industry alliance members, we implement the privacy-preserving, distributed computing of Federated Learning(1) (FL).

—

(1) Wikipedia defines "“Federated learning (aka collaborative learning) as a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging their data samples. It enables multiple actors to build a common, robust machine learning model without sharing data, thus addressing critical issues such as data privacy, data security, data access rights and access to heterogeneous data.”

Federated Learning Approach diagram - Noggin article Jun 2020.png

In our example of a digital banking ecosystem process, the micro/SME customer of a new digital bank can share her transaction invoices with the bank through a financing app to secure loans with dynamic overdraft limits or factor her receivable assets to meet immediate cash flow. The bank feeds the invoices data into an AI model for credit scoring and risk pricing. And by collectively comparing across other relevant customers, the AI model can better assess the customer risk propensity to default for the bank. However, this process presents some high-risk problems of unintended data leakage, exposure of sensitive customer information, and fraudulent customer transactions for the new digital bank.

With FL implementation, the customer does not need to transfer her invoices data to the bank for training the AI model centrally. Instead the AI model is trained locally on the private data inside the customer server, no raw data of the customer is shared or exchanged in the process. Only the computed parametric results of the local models (from the customers) are privately updated into a global model in the bank, and then the parametric deltas of the aggregated global model are securely sent and updated back into the local models.

Using secure aggregation protocol, the global model is updated without the bank-internal staffs ever learning about the local model data and being able to make any reverse inferences. The models’ data at-rest and in-motion are encrypted to protect against data breaches, and any PII in the data are tokenized to de-identify the customers. And the auditable traceability of data provenance recording enhances the trust (thereby credibility) between the bank and her customers against malice and frauds. It protects against fraudulent customer transactions by online monitoring of transaction origination veracity.

Therefore the SME customer is assured of her data safety, privacy and confidentiality with the bank; and the new digital bank's regulatory compliance risk (hence cost) is also reduced without having to deal centrally with the transaction data of her customers.

About the author…

Noggin is a digital, data & analytics company. We are your privacy-first, digitalization engineer and data miner using our prescriptive Noggin.ai technology.

In Digital Transformation, Noggin.ai enables businesses of all sizes to activate their data by providing a collaborative data mining platform where remote pools of private data plus big data are meaningfully mined for context and actionable insights, while safeguarding the data confidentiality of organizations and data privacy of their end users.

Our mission focus is to democratize digitalization so no organizations are competitively disadvantaged, and no individuals are socially excluded.

This article is written by Chua Lai Chwang, Noggin Co-Founder & CT0, June 2020.

For more information or a demonstration, please do not hesitate to contact me at laichwang@nogginasia.com.

Labelia Labs12 June 2020