The deadline for the project submission is November, 30th, 2019.
If you want to have a feedback in advance you should submit your draft at least two weeks before the deadline.
Materials can be found here
Dr. Radu Tudoran (Huawei Munich)
- Lecture, 2 SWS, 09.09.2019 - 14.09.2019; 08:30 - 18 Uhr
- Exercises, 2 SWS (integrated in lecture) + project after the course
Big Data is one of the main buzz words nowadays, being a primary focus both for academic research and for industry. Big Data has emerged as a revolution driven by the continuous increasing volumes of data that are being collected at increasing velocities from various source: social networks, IoT, scientific simulations, finance, weather forecasting, etc. Tackling such challenges, commonly referred to as the V’s of Big Data, has lead to the development of a plethora of technologies and concepts. Batch and stream processing are the main classes of dealing with the data, which can be either offline or in real time. Starting from these two categories, different programming models such as MapReduce or reactive programming have been recently proposed. Additionally multiple technologies have been, and are developed to facilitate the processing and the data management of Big Data scenarios: HDFS, MapReduce, Spark, Storm, Flink, Kafka, HBase, Hive, etc. All these form today the Hadoop ecosystem. This course aims to give an introduction to technologies and concepts that build the Hadoop ecosystem, both as lecture courses and practical sessions. From the point of view of the lecture courses the focus lays with giving the theoretical backgrounds of the concepts and mechanisms that enable Big Data processing. The course will present the different programming models, strategies to deal with large data sets or with data sets on the fly (e.g., MapReduce and MapReduce pipelines, Stream topologies, Windows, SQL and Hive Queries and interactive queries). From the point of view of the practical sessions the objective is to make the students familiar with the main Big Data processing tools used today in industry such as MapReduce, HDFS, Spark, Flink, HBase, Kafka. At the end of the course the students will have a good understanding of feasible approaches to address various Big Data scenarios as well as hands-on experience with some of the most commonly used Hadoop tools.
- Overview of Big Data: what it is, why it has emerged and future trends
- Data models and large scale infrastructures (cluster, grid, cloud, HPC)
- Batch processing
- Distributed storage systems concepts: GFC, HDFS and Cloud Public Storage (Azure Blobs and AWS S3)
- NoSQL storage and distributed message queues
- Google MapReduce programming model and Hadoop MapReduce
- High level semantics processing tools for offline data: Spark, Hive, Pig, Flink
- Stream processing:
- Stream overview: what it is and what are the main difference with respect to batch processing,
- Stream concepts for data processing: operators, windows, sinks, ETLs
- Project topics
Project (after the course):
- A topic can be choosen from multiple available ones (sentiment analysis, twitter trends analysis, internet/social media search...).
- The solution will be built using multiple advanced technologies covered in the lecture.
- The results will be presented toghether with a demo to show the specific use case.
The result after complition of this course is that the students will:
- Have an overview of the principles of Big Data analytics
- Have an understanding of the data analytics ecosystem
- Have knowledge about the Big Data technologies most used in industry and research
- Have practical experience with Big Data tools from the Hadoop ecosystem, which will give competitive advantage for getting jobs in the domain
- Have a reference project in the area of Big Data that they can showcase in the future to prove their practical experience for industry
Literature will be given during the lecture
- Compulsory module in the area of practical and technical computer science
- As focus module
- Individual complementary module
- Application module for the complementary area in the Master studies mathematics
Bachelor students must have been passed the following modules:
- „Algorithmen und Datenstrukturen”
- „Theoretische Informatik”
Students should bring a laptop with them. Mandatory tools, listed below, should be installed on your laptop before the course starts:
- Java JDK 8 (JRE is not enough)
- Apache Maven 3.6.1:https://maven.apache.org/install.html
- IDE, e.g. IntelliJ or Eclipse (latest stable version)
- Apache Flink 1.9 (stable); useful hints:
More tools will be installed and used during the practical sessions.
Requirements for credit points
- Successfull participation in hands-on exercises
- Submission of a final software project -> description is in the introduction (lecture slides)
Frequency of offering