BIG DATA ANALYTICS AND TECHNOLOGIES
2018/2019, Semester 2
Arts & Social Sciences (Economics)
Modular Credits: 4
analytics,big data,database,hadoop,machine learning,python,spark
This module covers the concepts of big data, analytics and technologies. The main goal aims at managing and analysing a set of big data. Big data differs from traditional data, as the nature of big data is massive, unstructured, granular, and heterogeneous. Big data is produced by various digital resources and domains including smart phones with multiple sensors, a variety of digital media produced by various social media, and billions of on-line financial transactions. The topics of this module covers big data scalability and process, infrastructure, and analytics using Hadoop, MapReduce, Spark, Python, in-database analytics, mining of data streams, etc.
: 14 Jan 2019
Introduction to Big Data Analytics and Technologies
Setting up virtual environment
: 21 Jan 2019
Introduction to Python 3
Python 3: Comparing things and controlling flow
: 28 Jan 2019
More Python 3: Introduction to functions, user-defined functions, Object Oriented Programming (classes, inheritance, etc)
: 04 Feb 2019
No class on 04 Feb 2019
(Eve of Chinese New Year)
Replacement class on
Saturday, 16 Feb 2019
9am to 12 noon
: 11 Feb 2019
Introduction to Hadoop
Individual Assignment Deadline: 12th Feb 2019, 11:59pm
- 9am - 12noon
Replacement class for 4th February 2019
: 18 Feb 2019
Quiz 1: Written Assessment (20%), 90 minutes
Introduction to Apache Spark
Apache Spark: Resilient Distributed Datasets
Data Processing using Spark DataFrames
Recess: 23 Feb 2019
: 04 Mar 2019
Introduction to Apache Spark SQL
User-defined functions (UDFs), Pandas UDF (Vectorized UDF)
Introduction to Machine Learning
: 11 Mar 2019
Introduction to MLlib: Apache Spark’s machine learning library
Extracting, transforming and selecting features
Classification and regression
: 18 Mar 2019
Quiz 2: Written Assessment (20%), 90 minutes
: 25 Mar 2019
Natural Language Processing
Stop words removal
Bag of words
TF-IDF, term frequency-inverse document frequency
: 01 Apr 2019
: 08 Apr 2019
Processing of live data streams using Spark
: 15 Apr 2019
Analytics Project Presentation & Submission (15%)
Reading Week: 20 Apr 2019
Final Examination (30%)
9th May 2019, Thursday, 5 to 7pm
Venue: To be announced
Strictly 4 to 5 students in each group.
Group members will decide the analytics goals and which open dataset(s) to use.
Chosen dataset must contain at least 8000 data points.
Must use Spark MLlib for analysis.
Must submit codes written in Python 3.
Must submit a 2-page written report.
Peer review for the group project will be solicited toward the end of the semester.
Students who do not contribute much may face penalty scores.
Workload Components : A-B-C-D-E
A: no. of lecture hours per week
B: no. of tutorial hours per week
C: no. of lab hours per week
D: no. of hours for projects, assignments, fieldwork etc per week
E: no. of hours for preparatory work by a student per week