BigData

Big Data computation models

Description: The goal of this course is to teach students how to develop high-performance data analysis applications in the Spark environment on distributed platforms (clusters and clouds). Distributed file system mechanisms such as HDFS will be studied, as well as Spark’s extended map-reduce programming model and algorithmic on top of Spark “RDD”, followed by higher-level programming models on top of Spark “Data Frames”, and finally programming models on Clouds. Scaling criteria and metrics will also be studied. Throughout the course, implementations will take place on clusters and in a Cloud, and the developed solutions will be evaluated by the performance obtained on test cases, and by their ability to scale.

Content: Emergence of Big Data technologies: motivations, industrial needs, main players. Hadoop software stack, architecture and operation of its distributed file system (HDFS) Spark distributed computing architecture and deployment mechanism Spark “RDD” programming model and algorithmics of Spark extended map-reduce Spark “Data Frames” programming model applied to graph analysis (GraphX module) Architecture et environnement d’analyse de données sur Cloud Experiments and performance measures Performance criteria and metrics

Learning outcomes: After this course, students will be able:

Learning Outcome AA1: to design and implement extended map-reduce algorithms, powerful and scaling on distributed platforms,
Learning Outcome AA2: to analyse the scaling capabilities of an application,
Learning Outcome AA3: to use a cluster or a cloud to achieve large scale data analysis,
Learning Outcome AA4: to synthetically present a data analysis solution designed on top of a "map-reduce" model.

Teaching methods: This course links 3 parts relating to “Big Data” computing models: the first on PC clusters, the second in the Cloud, and the third which assesses “scaling-up” solutions.

Course plan in 4 parts:

Part 1: Software architecture and development with Spark RDD on top of HDFS and PC clusters.

Part 2: Criteria and metrics for performance and scaling.

Part 3: Large scale computation and data analysis on Cloud.

Part 4: Development with Spark Data Frames on top of HDFS and PC clusters.

Means: Teaching team: Stéphane Vialle and Gianluca Quercini (CentraleSupelec), Wilfried Kirschemann (ANEO) Development and execution plateform: computing clusters of the Data Center for Education (DCE) of CentraleSupelec Metz campus access to a professional cloud Development environment: Spark+HDFS on DCE machines other environment on Cloud ressources

Evaluation methods: Evaluation from Labs:

The reports of the Labs will be evaluated (the content and the number of pages of the reports will be constrained, in order to force the students to an effort of synthesis and clarity)
In case of unjustified absence from a practical course, a mark of 0 will be applied; in case of justified absence, the practical course will not be included in the final mark.
The remedial exam will be a 1 hour written exam, which will constitute 100% of the remedial mark.

Evaluated skills:

Be operational, responsible, and innovative in the digital world
Know how to convince

Course supervisor: Stéphane Vialle

Geode ID: 3MD4130