module specification

CS7079 - Data Warehousing and Big Data (2017/18)

Module specification Module approved to run in 2017/18
Module title Data Warehousing and Big Data
Module level Masters (07)
Credit rating for module 20
School School of Computing and Digital Media
Total study hours 200
 
152 hours Guided independent study
48 hours Scheduled learning & teaching activities
Assessment components
Type Weighting Qualifying mark Description
Coursework 60%   Coursework (2,500 words + artefact)
Unseen Examination 40%   Unseen exam (2 hours)
Running in 2017/18

(Please note that module timeslots are subject to change)
Period Campus Day Time Module Leader
Autumn semester North Thursday Morning

Module summary

Initially the module aims to provide students with a sound understanding of database concepts and Data base management system in reference to modern enterprise-level database development. Students will be able to learn data models, internal data representation and system architecture. Also students will be able to learn data processing using DBMS; data dictionary, data definition and data manipulation using SQL. Subsequently, students will gain an in-depth understanding of data warehousing; concepts and analytical foundations as well as data warehousing development which includes system architecture, data transformation and data analysis. Students will be able to grasp different issues faced in real world data warehouse application development. Most importantly, the module presents the whole theory of Big data management based on Apache Hadoop platform (HDFS). This will involve a hands on session which is designed for data analysts, business intelligence specialists, developers, administrators or anyone who has a desire to learn how to process and manage massive and complex data to infer hidden knowledge from data. It will apply to a wide range of areas such as engineering, transportation, finance, health sciences, security, marketing and customer insight.

Prior Knowledge: Relational databases

Module aims

The module will primarily provide students with a good understanding of database concepts and Database management system in reference to modern enterprise-level database development using real life applications. The module addresses numerous issues faced in the real word data warehouse application development and aims to familiarise students with related technological trends and development in the field. Most importantly this module will provide students with a broad introduction to Big Data technologies including Hadoop based architectures, data ingestion, data transformation, data management, analytics and predictive analytics for manipulating and discovering insight. Topics include Hadoop, HDFS, MapReduce, Spark, Sqoop, Hive, Pig and MLlib. The module will have significant hands-on sessions and draw from numerous case studies and applications.

Syllabus

• Introduction of database models and system architecture
• Data processing using DBMS; data definition and manipulation using SQL
• Introduction of Data warehouse concepts and analytical foundations.
• Data warehouse development; system architecture and data transformation; investigation of techniques for distributing and mining data.
• An introduction to Big Data technology stack, emerging trends and use cases where Big Data outperforms traditional data warehouse.
• An overview of the function components of Big Data technology stack including open source tools like Hadoop, HDFS, MapReduce, Yarn, Spark, Storm, Hive for massively parallel on-disk data processing.
• An overview of batch and real-time data ingestion patterns using Apache Flume and Kafka, data transformation techniques and generation of summary statistics using Apache Spark.
• Data Analytics on Hadoop platform using Apache Spark for data analysis on HDFS

Learning and teaching

The delivery and the teaching of the materials will be through a mixture of lectures, workshops, and laboratory and tutorial sessions and under the following strategy: the first hour of lectures will be delivered to introduce concepts and principles of the module’s topics. The second hour will be run in a form of workshop to further explain approaches through real life examples. Each lecture will be followed either by a laboratory or a seminar. Seminar time might be used to facilitate group meetings to cultivate research oriented skills or to introduce students to state of the art not covered in the specific syllabus. For the self-study exercises and assessment, students are expected to spend time on unsupervised work in the computer laboratories, searching primary sources of information in the library and in private study. It is also expected that students will dedicate hours for coursework and case study implementation and for summative assessment (final exam). The teaching and learning methods will encourage open and self-directed learning, deepen students’ understanding and stress analytical skills.

Class contact hours: 2 hours lecture and 2 hours workshop/seminar/tutorials.

Blended learning: use the university’s VLE and online tools to provide and deliver content, assessment and feedback, to encourage active learning and to enhance students’ engagement and learning experience.

Learning outcomes

Students will be able to:

LO1: Reveal a deep  understanding of and demonstrate familiarity with the operation of DBMS systems and to appreciate the complexity of developing real life applications
LO2: Develop, configure, utilise and manage data warehouse applications in a variety of contexts.
LO3: Display a sound understanding of the principles of organisation, validation, transformation and analyse large volumes of data on specialized platforms (Big Data) from various data sources – files, databases, server logs, etc.
LO4: Use the Big Data platform ecosystem for processing Big Data
LO5: Comprehend the advantages and limitations of Big Data technologies, including predictive analytics and build the confidence to interpret data as insights to drive organisational success.

Assessment strategy

The assessment will consist of one coursework and an unseen examination. The examination will test the students’ retention, an understanding and insight drawn from the entire course (LO1, LO2, LO3, LO4 and LO5). The Coursework will have one piece of assignment that assesses part of the practical aspects of the module; Students will be given a case study that will be a scaled down version of a real life Big data application. Also in the coursework  students will be required to demonstrate their awareness of recent research developments and current Big Data technology trends and writing an essay in which they will contrast new approaches to conventional ones (LO3, LO4, LO5). Some aspects of the coursework will also prepare students for their curricular projects.

The module will be passed on the aggregate mark of all assessment items.

Bibliography

Connolly, T.M., Begg, C.E.,  (2009) Database Systems – A Practical Approach to Design, Implementation and Management, 5th Edition, Addison Wesley; ISBN-10: 0321523067
Ponniah, P., (2001) Data warehousing fundamentals, Joh Wiley & Sons; ISBN: 0-471-41254-6 [CORE]
Rainardi, V., (2007) Building a data warehouse with examples in SQL server, Apress; ISBN-13:978-1-59059-931-0
White, T., (2015) Hadoop: The definite Guide. Sebastopol: O’Reilly & Associates. ISBN-10: 1491901632 [CORE]
Ryza, S., Laserson, U., Owen, S., and Wills, J., (2015) – Advanced analytics with Spark. ISBN-10: 1491912766