UE Data management in large-scale distributed systems

Diplômes intégrant cet élément pédagogique :

Descriptif

Target skills : Data management and knowledge extraction have become the core activities of most organizations. The increasing speed at which systems and users generate data has led to many interesting challenges, both in the industry and in the research community.

The data management infrastructure is growing fast, leading to the creation of large data centers and federations of data centers. These can no longer be handled exclusively with classic DBMS. It requires a variety of flexible data models (relational, NoSQL…), consistency semantics and algorithms issued by the database and distributed system communities. In addition, large-scale systems are more prone to failures, and should implement appropriate fault tolerance mechanisms.

The dissemination of an increasing amount of sensors and devices in our environment highly contribute to the “Big Data” and the development of ubiquitous information systems. Data is processed in continuous streams providing information related of users context, such as their movement patterns and their surroundings. This data can be used to improve the context awareness of mobile applications and directly target the needs of the users without requiring an explicit query.

Combining large amounts of data from different sources offers many opportunities in the domains of data mining and knowledge discovery. Heterogeneous data, once reconciled, can be used to produce new information to adapt to the behavior of users and their context, thus generating a richer and more diverse experience. As more data becomes available, innovative data analysis algorithms are conceived to provide new services, focusing on two key aspects: accuracy and scalability.

Program summary : In this course, we will study the fundamentals and research trends of distributed data management, including distributed query evaluation, consistency models and data integration. We will give an overview of large-scale data management systems, peer-to-peer approches, MapReduce frameworks and NoSQL systems. Ubiquitous data management and crowdsourcing will also be discussed.

 

 

Evaluation:

2-hours written exam (E) and a report on practical work (P). The final mark in session 1 is obtained as 0.7E+0.3P. The final mark in session 2 is obtained as a written exam only.

Pré-requis

Fundamentals of DBMS, parallel programming (threads)

Compétences visées

At the end of the course, the students will know how to use Big Data software tools to efficiently store and process large amounts of data, including tools that can operate in realtime.

Bibliographie

Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified data processing on large clusters.” Communications of the ACM 51.1 (2008): 107-113.

Zaharia, Matei, et al. “Apache spark: a unified engine for big data processing.” Communications of the ACM 59.11 (2016): 56-65.

Murray, Derek G., et al. “Naiad: a timely dataflow system.” Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.

Lakshman, Avinash, and Prashant Malik. “Cassandra: a decentralized structured storage system.” ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.

Informations complémentaires

Méthode d'enseignement : En présence
Lieu(x) : Grenoble - Domaine universitaire
Langue(s) : Anglais