Deadline for project submission, Dec. 4th, 2017.
All materials can be found here.
This was a block course: 25.09.2017 - 30.09.2017
Dr. Radu Tudoran (Huawei Munich)
- Lecture „Big Data Techniques, Technologies and Trends“, 2 SWS
- Hands-on exercises, 2 SWS
Course description: Big Data is one of the main buzz words nowadays, being a primary focus both for academic research and for industry. Big Data has emerged as a revolution driven by the continuous increasing volumes of data that are being collected at increasing velocities from various source: social networks, IoT, scientific simulations, finance, weather forecasting, etc. Tackling such challenges, commonly referred to as the V’s of Big Data, has lead to the development of a plethora of technologies and concepts. Batch and stream processing are the main classes of dealing with the data, which can be either offline or in real time. Starting from these two categories, different programming models such as MapReduce or reactive programming have been recently proposed. Additionally multiple technologies have been, and are developed to facilitate the processing and the data management of Big Data scenarios: HDFS, MapReduce, Spark, Storm, Flink, Kafka, HBase, Hive, etc. All these form today the Hadoop ecosystem. This course aims to give an introduction to technologies and concepts that build the Hadoop ecosystem, both as lecture courses and practical sessions. From the point of view of the lecture courses the focus lays with giving the theoretical backgrounds of the concepts and mechanisms that enable Big Data processing. The course will present the different programming models, strategies to deal with large data sets or with data sets on the fly (e.g., MapReduce and MapReduce pipelines, Stream topologies, Windows, SQL and Hive Queries and interactive queries). From the point of view of the practical sessions the objective is to make the students familiar with the main Big Data processing tools used today in industry such as MapReduce, HDFS, Spark, Flink, HBase, Kafka. At the end of the course the students will have a good understanding of feasible approaches to address various Big Data scenarios as well as hands-on experience with some of the most commonly used Hadoop tools.
Course Topics to be addressed:
- Overview of Big Data: what it is, why it has emerged and future trends
- Data models and large scale infrastructures (cluster, grid, cloud, HPC)
- Batch processing
- Distributed storage systems concepts: GFC, HDFS and Cloud Public Storage (Azure Blobs and AWS S3)
- NoSQL storage and distributed message queues
- Google MapReduce programming model and Hadoop MapReduce
- High level semantics processing tools for offline data: Spark, Hive, Pig, Flink
- Stream processing:
- Stream overview: what it is and what are the main difference with respect to batch processing,
- Stream concepts for data processing: operators, windows, sinks, ETLs
- Project topics
- Project: A topic will be choosen from multiple available ones (sentiment analysis, twitter trends analysis, internet/social media search...)
- Solution: A software solution will be design, built and delivered as the outcome of the project.
- Technology: The solution will be built using multiple advanced technologies covered in the course.
- Evaluation: The solutoin design will be presented toghether with a demo to show the specific use case.
The result after complition of this course is that the students will:
- Have an overview of the principles of Big Data analytics
- Have an understanding of the data analytics ecosystem
- Have knowledge about the Big Data technologies most used in industry and research
- Have practical experience with Big Data tools from the Hadoop ecosystem, which will give competitive advantage for getting jobs in the domain
- Have a reference project in the area of Big Data that they can showcase in the future to prove their practical experience for industry
Literature will be given during the lecture
- Compulsory module in the area of practical and technical computer science
- As focus module
- Individual complementary module
- Application module for the complementary area in the Master studies mathematics
Bachelor students must have been passed the following modules:
- „Algorithmen und Datenstrukturen”
- „Theoretische Informatik”
Students should bring a laptop with them. Mandatory tools, listed below, should be installed on your laptop before the course starts:
- Java JDK 7 (or higher; JRE is not enough)
- Apache Maven 3.x:https://maven.apache.org/install.html
- Eclipse Scala IDE 4.0.0: scala-ide.org/download/sdk.html
More tools will be installed and used during the practical sessions.
Requirements for credit points
- Successfull participation in hands-on exercises
- Submission of a final software project -> description
- Presentation of the own project and answer questions about the project
Frequency of offering
Dr. Radu Tudoran (Huawei, Munich)