News archive‎ > ‎

Re-thinking applications in the edge computing era

posted Mar 17, 2019, 4:51 AM by Enrico Fagnoni   [ updated Mar 18, 2019, 2:18 AM ]

The EU GDPR directive was a cornerstone in Information Society. More or less, it states that the ownership of data is an inalienable right of the data producer; before GDPR the data ownership was something marketable. Now, to use some else data, you need always get permissions that can be revoked anytime. Beside this, IoT requires more and more local data processing driving the edge computing paradigm.

Recent specifications like SOLID and IPFS promise radical but practical solutions to move toward a real data distribution paradigm, trying to restore the original objective of the web:  knowledge sharing. 

This view, where each person/machine has full control of his data, contrasts with the centralized application data architecture used by the majority of applications. 
Many signs tell us that this new vision is gaining consensus, both in the political and social world;  but today, even when applications claim to be distributed (e.g. Wikipedia), as a matter of fact, they still adopt a centralized data management architecture.

According to Sir Tim Berner Lee, "The future is still so much bigger than the past". To be ready, we need to rethink data architectures, allowing applications to use information produced and managed by someone, people or machines, out of our control.

The  Eric Brewer theorem (also known as CAP theorem), states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a (non-error) response – without the guarantee that it contains the most recent write
  • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made. 

But in a really distributed data model, where datasets are not in your control, network failure is ALWAYS an option, so you have always to chose.

Dynamic caching is probably the only practical solution to face the dataset distribution problem, but as soon as you replicate data, a tradeoff between consistency and latency arises.

Daniel J. Abadi from Yale University in 2010 found that even (E) when the system is running normally in the absence network errors, one has to choose between latency (L) and consistency (C). This is known as the PACELC theorem.

What all this does it means? You must start rethinking applications forgetting the deterministic illusion that functions return the same outputs when you provide the same inputs.
In fact, the determinism on which much of today's information technology is based should be questioned. We have to start thinking about everything in terms of probability.

That's already happening with search engines (you do not get the same result for the same query), or with social networks (you can't see the same list of messages). It is not a feature, it's due to technical constraints but Facebook, Google, and many other companies cleverly turned this problem into an opportunity, prioritizing ads, for instance.

If the edge computing paradigm will get the momentum,  all applications, also the corporate ones, will have to address similar issues. For instance, the customer/supplier registry could (or should ) be distributed.

Technologies and solutions such as IPFS,  Linked Data, and RDF Graph Databases provide practical solutions to caching and querying distributed dataset, helping to solve inconsistencies and performance issues. But they can not be considered a drop-in replacement of older technology: they are tools to be used to design a new generation of applications that are able to survive to the distributed dataset network.