GDD - Distributed Data Management
Distributed infrastructures are at the heart of the digital revolution. These infrastructures are evolving and becoming more complex with new developments around edge computing, fog computing, federations and the Internet of Things.
Each evolution brings new challenges in writing distributed programs, data management, security and privacy issues.
GDD aims to propose models and principles for writing federated programs on federated infrastructuresA federation interconnects a large number of autonomous, heterogeneous and socially organized participants. These participants agree on a minimum set of services giving access to their resources, i.e. cloud federations with a common programming interface, data web federation with SPARQL servers or social networks distributed with publish-subscribe middleware.
Federations offer an execution environment on which a federated program accesses federated resources i.e. CPU, data, disk. A federated program can be seen as a federated request on the linked data, a datalog program distributed on a data federation or a program using a federated programming interface in a cloud federation. Federations are defined and studied by several scientific communities for different motivations using different approaches. They are studied in databases with federated databases or collaborative data sharing systemss. They are studied in "cloud computing" with cloud federations or virtual organizations. In the context of the semantic web, they appear with the open federation of the linked data. In distributed systems and distributed collaborative systems, they are represented by distributed social networks, in networks, with distributed CDNs or identity federations, in ubiquitous computing, with federations managers.
The motivations of federated systems are different in different communities :
- In the cloud community, federations can be economically efficient, reduce energy consumption and reduce latency by using geo-replication of data.
- Distributed social networks promote federations as an infrastructure that is better able to preserve privacy.
- From a database and web data perspective, federations allow large-scale data integration.
Many approaches, architectures, algorithms have been proposed by the different scientific communities to build federations: language approaches such as Bud or WebdamLog, approaches to standardize programmaming interface for the cloud, approaches based on federated requests, on epidemic propagation systems or distributes data structures.
Whatever the approach considered, federations must face recurring scientific obstacles: ease of programming and impossibility results from the CAP theorem, ease of access to data and semantic heterogeneity of data, autonomy of participants and access to secondary data.
The scientific project considers the federations as the subject of the study. We have identified 3 scientific challenges to be able to write and deploy federated programs on federated infrastructures (see Figure):
- Challenge 1: Data structures and coherence for federations. Each infrastructure requires its own distributed data structures and consistency criteria, i.e. P2P networks have brought DHT, the cloud has put forward consistency in the long term. The aim is to develop data structures and coherence models adapted to federations.
- Challenge 2: Collaborative data sharing systems in federations. Federated programs must be able to update and query data distributed across the federation in an efficient manner. This challenge will build on the results achieved by Challenge 1.
- Challenge 3: Confidentiality and security in federations. The challenge is to be able to detect attacks on a federation of autonomous participants and to allow participants to monitor the actual use of the data made available.