Skip to main content

Data-intensive Systems DAT500

The course provides a strong basis in administrative, programing, and algorithm design aspects of data-intensive systems.


Course description for study year 2022-2023. Please note that changes may occur.

Facts
Course code

DAT500

Version

1

Credits (ECTS)

10

Semester tution start

Spring

Number of semesters

1

Exam semester

Spring

Language of instruction

English

Content

The emergence of Big Data and Data-intensive Systems as specialized fields in computing has been motivating development of new techniques and technologies needed to extract knowledge from large datasets. Since Hadoop was conceived in 2005, popular interest in data-intensive systems began to grow. It resulted - over time - in a collection of technologies, methodologies, and practices to cover the complete data lifecycle. 

This course is a first step to a variety of roles related to data-intensive systems. The core tasks in these roles that we will address are: roles in a data team, system administration (setup, containerization, test and benchmark a cluster), low-level algorithm design and implementation (direct implementation of MapReduce jobs), high-level algorithm design and implementation (utilizing one of data processing frameworks e.g. SparkSQL, MLlib), data modelling and algorithm optimisation, reliable infrastructure design for data collection and processing, advocating technology application both in technical and non-technical setting, providing introductory training to coworkers.

Learning outcome

Knowledge:

  • Characterize Hadoop job tracker, task tracker, scheduling issues, communications, and resource management
  • Describe elements of Hadoop ecosystem and identify their applicability
  • Describe and compare RDBMS, NOSQL databases, data warehouse, unstructured big data, and keyed files, and show how to apply them to typical data processing problems
  • Understand areas of application for a variety of Cloud Computing architectures and technologies

 

Skills:

  • Assume various roles in a data team;
  • Design, construct, test, and benchmark a small data processing setup (based on Hadoop, Spark, OpenStack)
  • Analyze real-life problems and propose suitable solutions
  • Construct and optimize programs based directly on MapReduce paradigm for typical problems
  • Construct and optimize programs based on high-level tools (for MapReduce paradigm) for typical problems

 

General qualifications:

  • Evaluate, communicate and defend a data-intensive solution w.r.t. relevant criteria.
Required prerequisite knowledge
None
Recommended prerequisites
DAT220 Database Systems, DAT320 Operating Systems and Systems Programming
Exam
Form of assessment Weight Duration Marks Aid
Project work with oral presentation 1/1 Letter grades All

Project with oral presentation. Project is completed in groups.Both parts must be done before final grade is given. Each group member can receive a different grade based on their performance in group during the oral presentation.If a student fails the project, she/he has to take this part next time the subject is lectured.

Coursework requirements
Four assignments

Students start with 4 mandatory assignments that contain program and system administration. First 3 obligatory assignments are to be completed individually. All mandatory assignments must be passed within deadline so that student has the right to start with the project. The obligatory assignments give access to the project only in the current semester.

Completion of mandatory lab assignments is to be made at the times and in the groups that are assigned and published. Absence due to illness or for other reasons must be communicated as soon as possible to the laboratory personnel. One cannot expect that provisions for completion of the lab assignments at other times are made unless prior arrangements with the laboratory personnel have been agreed upon.

Course teacher(s)
Course coordinator: Tomasz Wiktorski
Coordinator laboratory exercises: Jayachander Surbiryala
Head of Department: Tom Ryen
Method of work
The work will consist of 6 hours of lecture, scheduled laboratory, supervised group work per week. Students are expected to spend additional 6-8 hours a week on self-study, group discussions, and development work (open laboratory).
Open for
Computer Science, Master of Science Degree Programme Industrial Automation and Signal Processing - Master's Degree Programme - 5 year Exchange programme at Faculty of Science and Technology
Course assessment
Form and/or discussion.
Literature
Search for literature in Leganto