Data-intensive Systems and Algorithms (DAT535)
The course provides a basis in administrative, programming, and design aspects of data-intensive systems.
Course description for study year 2023-2024. Please note that changes may occur.
Course code
DAT535
Version
1
Credits (ECTS)
5
Semester tution start
Autumn
Number of semesters
1
Exam semester
Autumn
Language of instruction
English
Content
The emergence of Big Data and Data-intensive Systems as specialized fields in computing has been motivating development of new techniques and technologies needed to extract knowledge from large datasets. Since Hadoop was conceived in 2005, popular interest in data-intensive systems began to grow. It resulted - over time - in a collection of technologies, methodologies, and practices to cover the complete data lifecycle.
This course is a first step to a variety of roles related to data-intensive systems. The core tasks in these roles that we will address are: roles in a data team, system administration, low-level algorithm design and implementation (direct implementation of MapReduce jobs), high-level algorithm design and implementation (utilizing one of data processing frameworks e.g. SparkSQL, MLlib), data modelling and algorithm optimisation, advocating technology application both in technical and non-technical setting, providing introductory training to coworkers.
Learning outcome
Knowledge
- Characterize Hadoop architecture incl. job tracker, task tracker, scheduling issues, communications, and resource management, etc.
- Characterize Spark architecture incl. context, cluster manager, worker node, executor, etc.
- Describe elements of Hadoop/Spark ecosystem and identify their applicability
- Describe and compare RDBMS, NOSQL databases, data warehouse, unstructured big data, and keyed files, and show how to apply them to typical data processing problems
Skills
- Assume various roles in a data team
- Setup a local test environment (based on containers or VMs)
- Use and reconfigure a remote data processing setup (based on Hadoop/Spark, OpenStack, or other Cloud setup)
- Analyze real-life problems and propose suitable solutions
- Construct and optimize programs based directly on MapReduce paradigm for typical problems
- Construct and optimize programs based on high-level tools (for MapReduce paradigm) for typical problems
General qualifications:
- Evaluate, communicate and defend a data-intensive solution w.r.t. relevant criteria
Required prerequisite knowledge
Recommended prerequisites
Exam
Form of assessment | Weight | Duration | Marks | Aid |
---|---|---|---|---|
Project with oral presentation | 1/1 | Letter grades |
Project with oral presentation. Project is completed in groups. Both parts must be done before final grade is given. All group members must participate in the oral hearing. If a student fails the project, she/he has to take this next time the course is given.
Coursework requirements
Three assignments
Students start with 3 mandatory assignments that contain programing and system administration. Assignments are to be completed individually. All mandatory assignments must be passed within deadline so that the student has the right to start with the project. The obligatory assignments give access to the project only in the current semester.
Completion of mandatory lab assignments is to be made at the times and in the groups that are assigned and published. Absence due to illness or for other reasons must be communicated as soon as possible to the laboratory personnel. One cannot expect that provisions for completion of the lab assignments at other times are made unless prior arrangements with the laboratory personnel have been agreed upon.
Course teacher(s)
Course coordinator:
Tomasz WiktorskiLaboratory Engineer:
Jayachander SurbiryalaHead of Department:
Tom RyenMethod of work
Overlapping courses
Course | Reduction (SP) |
---|---|
Data-intensive Systems (DAT500_1) | 5 |