This course aims the students learn to develop powerful data analysis applications, using Spark environment on distributed platforms (clusters and clouds). Distributed file systems like HDFS will be studied, as well as programming model and algorithmics of Spark extended map-reduce, and "scaling" criteria and metric will be introduced. Finally, many experiments will be conducted during labs on clusters and clouds, and the designed and implemented solutions will be evaluated according to the performances reached on use cases, and to their capability to "scale".
- Emergence of Big Data technologies : motivations, industrial needs, main players.
- Hadoop software stack, architecture and operation of its distributed file system (HDFS)
- Spark distributed computing architecture and deployment mechanism
- Spark programming model, Spark’s extended map-reduce algorithmics
- Optimization of algorithms and codes on distributed architectures
- Architecture et environnement d’analyse de données sur Cloud
- Experiments and performance measures
- Performance criteria and metrics