BIG DATA: Volume architecture
Duration:
38h
Difficulty:
4/5
Price:
1495€

Prerequisite:
Mastery of Python and advanced programming.
Database management.
Skills acquired at the end of the course:
Load data and process it in HDFS.
Transform this data with Hadoop Streaming or PySpark.
Optimize queries on structured data in Apache Hive.
Train Machine Learning algorithms on a cluster of machines with PySpark.
The curriculum:
Introduction to Apache Hadoop (15h)
- Theories of distributed architectures
- Introduction to the Map Reduce paradigm
- File management with HDFS
- Distributed calculations with Hadoop MapReduce
- Distributed computing with Hadoop Streaming
Introduction to Pyspark (20h)
- How Apache Spark Works Internally
- Manipulating Unstructured Data with Spark
- Manipulating structured data with SparkSQL
- Machine Learning with SparkML
Introduction to Apache Hive (10h)
- Internal functioning of Apache Hive and articulation around Hadoop
- Reading, Ingestion, modification and deletion of data with HQL
- Optimization of data storage per partition
Les prochaines dates :
Format Bootcamp
6 octobre
9 novembre
9 décembre
Format Continu
22 octobre
30 novembre
You wish to build a tailor-made course adapted to your needs?
A member of our team can help you!