Course

Data Engineering (DAT535)

Facts

Course code DAT535

Credits (ECTS) 5

Semester tution start Autumn

Language of instruction English

Number of semesters 1

Exam semester Autumn

Time table View course schedule

Literature Search for literature in Leganto

Introduction

The course provides a basis in data management, performance optimization, integration, and the ethical aspects of developing data-driven solutions.

Content

The emergence of Big Data and Data-intensive Systems as specialized fields in computing has been motivating development of new techniques and technologies needed to extract knowledge from large datasets. Since Hadoop was conceived in 2005, popular interest in data-intensive systems began to grow. It resulted - over time - in a collection of technologies, methodologies, and practices to cover the complete data lifecycle.

This course is a first step to a variety of roles related to data-intensive systems. The core tasks in these roles that we will address are: roles in a data team, data acquisition and integration (using files, APIs, etc.), data cleaning and augumentation (often using direct implementation of MapReduce jobs), data analytics and ML (often using one of data processing frameworks e.g. SparkSQL, MLlib), advocating technology application both in technical and non-technical setting, providing introductory training to coworkers.

Learning outcome

Knowledge

  • Understanding of Medallion Architecture: Students will gain a comprehensive understanding of the Medallion Architecture, including its layers (bronze, silver, and gold) and how it supports data processing and analytics.
  • Apache Spark Fundamentals: Students will learn the core concepts of Apache Spark, including its architecture, components, and how it handles big data processing.
  • Data Management and Governance: Knowledge of data management principles, data governance, and best practices for ensuring data quality and integrity.
  • Big Data Ecosystem: Familiarity with the broader big data ecosystem, including tools and technologies that complement Apache Spark, such as Hadoop, Kafka, Delta Lake, NOSQL databases.

Skills

  • Data Processing and Transformation: Proficiency in using Apache Spark for data processing tasks, including batch and stream processing, data cleaning, and transformation.
  • Performance Tuning: Skills in optimizing Apache Spark jobs for performance, including resource management, partitioning, and tuning Spark configurations.
  • Data Integration: Competence in integrating data from various sources and formats into a unified data platform using Medallion Architecture principles.
  • Problem-Solving: Ability to troubleshoot and resolve issues related to data pipelines, data quality, and performance bottlenecks.

General qualifications:

  • Collaboration and Communication: Effective communication and collaboration skills to work with cross-functional teams implementing data-intensive solutions.
  • Ethical Considerations: Awareness of ethical considerations in data engineering, including data privacy, security, and responsible data usage.

Required prerequisite knowledge

Python programming

Recommended prerequisites

Database Systems (DAT220), Operating Systems and Systems Programming (DAT320), Cloud Computing Technologies (DAT515)

Bash programming

Administration of Cloud and container-based environments

Databases, SQL

Exam

Project

Weight 1/1

Duration 6 Weeks

Marks Letter grades

Aid All

Project is completed in groups. Project lasts for 8 weeks in addition to obligatory labs that give basis for the project.

When artificial intelligence is used in assessments, the student must document this by completing and submitting the self-declaration form. If you submit text, calculations, etc. that are directly copied from an AI writing tool, this will be regarded as presenting the work of others as your own and therefore constitutes cheating.

No re-sit opportunities are offered for project assignments. Students who do not pass the project can retake it the next time the course is held.

Coursework requirements

Mandatory Assignments, Oral presentation

Three assignments

Students start with 3 mandatory assignments that contain programing and system administration. Assignments are to be completed individually. All mandatory assignments must be passed within deadline so that the student has the right to start with the project. The obligatory assignments give access to the project only in the current semester.

Completion of mandatory lab assignments is to be made at the times and in the groups that are assigned and published. Absence due to illness or for other reasons must be communicated as soon as possible to the laboratory personnel. One cannot expect that provisions for completion of the lab assignments at other times are made unless prior arrangements with the laboratory personnel have been agreed upon.

All group members must participate in the project presentation.

Method of work

The work will consist of 4 hours of lecture, scheduled laboratory, supervised group work per week starting in September. Students are expected to spend additional 6-8 hours a week on self-study, group discussions, and development work (open laboratory).

Overlapping courses

Course Reduction (SP)
Data-intensive Systems (DAT500_1) , Data Engineering (DAT535_1) 5

Open for

Admission to Single Courses at Master Level at the Faculty of Science and Technology
Data Science Computer Science Computer Science - Master of Science Degree Programme, Part-Time
Exchange programme at The Faculty of Science and Technology

Admission requirements

Must meet the admission requirements of one of the study programmes the course is open for.

Course assessment

The faculty decides whether early dialogue will be held in all courses or in selected groups of courses. The aim is to collect student feedback for improvements during the semester. In addition, a digital course evaluation must be conducted at least every three years to gather students’ experiences.
The course description is retrieved from FS (Felles studentsystem). Version 1