Big Data Techniques, Technologies and Trend

Big Data Techniques, Technologies and Trends

News
The deadline for the project submission is November, 30th, 2019.
If you want to have a feedback in advance you should submit your draft at least two weeks before the deadline.

Materials can be found here

Dozent
Dr. Radu Tudoran (Huawei Munich)

Studiengang
Master-Studiengang Informatik

Leistungspunkte
5 LP

Course
- Lecture, 2 SWS, 09.09.2019 - 14.09.2019; 08:30 - 18 Uhr
- Exercises, 2 SWS (integrated in lecture) + project after the course

Description
Big Data is one of the main buzz words nowadays, being a primary focus both for academic research and for industry. Big Data has emerged as a revolution driven by the continuous increasing volumes of data that are being collected at increasing velocities from various source: social networks, IoT, scientific simulations, finance, weather forecasting, etc. Tackling such challenges, commonly referred to as the V’s of Big Data, has lead to the development of a plethora of technologies and concepts. Batch and stream processing are the main classes of dealing with the data, which can be either offline or in real time. Starting from these two categories, different programming models such as MapReduce or reactive programming have been recently proposed. Additionally multiple technologies have been, and are developed to facilitate the processing and the data management of Big Data scenarios: HDFS, MapReduce, Spark, Storm, Flink, Kafka, HBase, Hive, etc. All these form today the Hadoop ecosystem. This course aims to give an introduction to technologies and concepts that build the Hadoop ecosystem, both as lecture courses and practical sessions. From the point of view of the lecture courses the focus lays with giving the theoretical backgrounds of the concepts and mechanisms that enable Big Data processing. The course will present the different programming models, strategies to deal with large data sets or with data sets on the fly (e.g., MapReduce and MapReduce pipelines, Stream topologies, Windows, SQL and Hive Queries and interactive queries). From the point of view of the practical sessions the objective is to make the students familiar with the main Big Data processing tools used today in industry such as MapReduce, HDFS, Spark, Flink, HBase, Kafka. At the end of the course the students will have a good understanding of feasible approaches to address various Big Data scenarios as well as hands-on experience with some of the most commonly used Hadoop tools.

Lecture content

Overview of Big Data: what it is, why it has emerged and future trends
Data models and large scale infrastructures (cluster, grid, cloud, HPC)
Batch processing
- Distributed storage systems concepts: GFC, HDFS and Cloud Public Storage (Azure Blobs and AWS S3)
- NoSQL storage and distributed message queues
- Google MapReduce programming model and Hadoop MapReduce
- High level semantics processing tools for offline data: Spark, Hive, Pig, Flink
Stream processing:
- Stream overview: what it is and what are the main difference with respect to batch processing,
- Stream concepts for data processing: operators, windows, sinks, ETLs
Project topics

Project (after the course):

A topic can be choosen from multiple available ones (sentiment analysis, twitter trends analysis, internet/social media search...).
The solution will be built using multiple advanced technologies covered in the lecture.
The results will be presented toghether with a demo to show the specific use case.

Study results/competences

The result after complition of this course is that the students will:

Have an overview of the principles of Big Data analytics
Have an understanding of the data analytics ecosystem
Have knowledge about the Big Data technologies most used in industry and research
Have practical experience with Big Data tools from the Hadoop ecosystem, which will give competitive advantage for getting jobs in the domain
Have a reference project in the area of Big Data that they can showcase in the future to prove their practical experience for industry

Literature
Literature will be given during the lecture

Module usage

Compulsory module in the area of practical and technical computer science
As focus module
Individual complementary module
Application module for the complementary area in the Master studies mathematics

Prerequisites

Bachelor students must have been passed the following modules:

„Programmierung”
„Rechnerarchitektur“
„Algorithmen und Datenstrukturen”
„Theoretische Informatik”

Students should bring a laptop with them. Mandatory tools, listed below, should be installed on your laptop before the course starts:

Java JDK 8 (JRE is not enough)
Apache Maven 3.6.1:https://maven.apache.org/install.html
IDE, e.g. IntelliJ or Eclipse (latest stable version)
Apache Flink 1.9 (stable); useful hints:
https://ci.apache.org/projects/flink/flink-docs-stable/quickstart/setup_quickstart.html
https://ci.apache.org/projects/flink/flink-docs-stable/quickstart/java_api_quickstart.html
https://ci.apache.org/projects/flink/flink-docs-release-1.6/internals/ide_setup.html

More tools will be installed and used during the practical sessions.

Requirements for credit points

Successfull participation in hands-on exercises
Submission of a final software project -> description is in the introduction (lecture slides)

Frequency of offering
Unregular

Verantwortlichkeit: