CMSC5724 Data Mining and Knowledge Discovery

Fall 2020

Professor: Yufei Tao
TA: Shangqi Lu

Quick navigation links:
[Lecture notes][Exercises and quizzes][Project]

Brief Description

This course will cover the conceptual and algorithmic aspects of fundamental problems in data mining and knowledge discovery, including (subject to time permission) classification, clustering, association rule analysis, and so on. On completion, students are expected to have developed the ability to perform an array of mining tasks that are essential to numerous applications in practice.

Announcements

News 24 (8 Dec): The final exam will take place at 7 pm, 17 Dec. Here is the Zoom link: https://cuhk.zoom.us/j/93633967025.

The exam will have 4 sessions, each 30 minutes long. Each session will be organized as follows:
  • At the beginning, the questions will be released at a URL, which will be announced in Zoom's chatbox.
  • In the meantime, a link will be created in Blackboard to collect your submissions.
  • You must upload your work to Blackboard or email it to the instructor at taoyf@cse.cuhk.edu.hk within 30 minutes after the session has started.
  • There will be a 5-minute break between two sessions.
The whole exam is expected to last for 2.5 hours.

For every session, you must hand-write your solutions on paper, take a picture of your solutions together with your student ID card or a Photo ID (a national ID card or the passport page), and submit the picture.

The scope covers Lect Notes 1-11.

Please log into Zoom at least 15 minutes before the exam for identity verification.

News 23 (4 Dec): Project 5 released. Exercise list 11 released.
News 22 (29 Nov): Quiz 3 solution released.
News 21 (27 Nov): Exercise list 10 released.

News 20 (24 Nov): Project 4 released.
News 19 (21 Nov): Exercise list 9 released.

News 18 (20 Nov): Quiz 3: will be held in the lecture on 26 Nov. The scope covers everything from Lect Notes 6-8. The questions will be released with an URL at the beginning of the quiz. You will need to submit a picture containing your hand-written work and student ID card within 15 minutes after the URL is released. We will collect your submission in the Blackboard system. You can submit multiple times, but only the last submission before the deadline will be graded.

News 17 (14 Nov): The final exam: will be held from 6:30 pm to 9:30 pm on 17 Dec.

News 16 (12 Nov): Exercise list 8 released.

News 15 (10 Nov): Project released. Please scroll to the bottom of the page to see it.

News 14 (9 Nov): Quiz 2 solution released.
News 13 (7 Nov): Exercise list 7 released.

News 12 (31 Oct): Quiz 2: will be held in the lecture on 5 Nov. The scope covers everything from Lect Notes 3-5. The questions will be released with an URL at the beginning of the quiz. You will need to submit a picture containing your hand-written work and student ID card within 15 minutes after the URL is released. We will collect your submission in the Blackboard system. You can submit multiple times, but only the last submission before the deadline will be graded.

News 11 (31 Oct): Exercise list 6 released.
News 10 (24 Oct): Exercise list 5 released.
News 9 (20 Oct): Quiz 1 solution released.
News 8 (16 Oct): Exercise list 4 released.
News 7 (10 Oct): Exercise list 3 released.

News 6 (9 Oct): Quiz 1: will be held in the lecture on 15 Oct. The scope covers everything from Lect Notes 1-2. The questions will be released with an URL at the beginning of the quiz. You will need to submit a picture containing your hand-written work and student ID card within 15 minutes after the URL is released. We will collect your submission in the Blackboard system. You can submit multiple times, but only the last submission before the deadline will be graded.

News 5 (8 Oct): Arrangements for quizzes and the final exam::
  • Quiz 1, 2, and 3 will take place in the 5th, 8th, 11th lecture, respectively.
  • The time of the final exam will be released later.
  • Each quiz will be 15 mins long. The questions will be released at the beginning of those 15 minutes. You need to submit your answers in the Blackboard system.
  • The final exam will contain multiple sessions, each of which will be 20 mins long. In each session, the questions will be released at the beginning, and you need to submit your answers in the Blackboard system.
  • All quizzes and the exam will be open book.
News 4 (18 Sep): Exercise list 2 released.
News 3 (10 Sep): Exercise list 1 released (see the end of the page).
News 2 (10 Sep): Lecture of 09/10 and the whiteboard image can now be found in the Blackboard system. I will do so after every lecture, provided that the video capturing mechanism does not fail.
News 1. (5 Sep): Hello all.

Time and Zoom Link

Lecture: Thu 6:30pm - 9:30pm, Zoom link

Grading Scheme

Project: 30%
Short Tests or Assignments: 30%
Final: 40%

Textbook and Lecture Notes

No textbooks cover all the material of this course. Some reference books may be useful for extra reading:

[Book 1] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining.
[Book 2] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques.
[Book 3] Mohammed J. Zaki, and Wagner Meira Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms.
[Book 4] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of Data Science.

Ownership of the above books is not mandatory. The instructor will make lecture notes available before each class. His notes cover all the content required in this course, some of which is outside the above books.

As usual, lecture attendance is vital for thorough understanding.

Lecture Notes Extra Reading
1
[Classification] Decision Trees and a Generalization Theorem
(Updated at 10:15am, 24 Sep)
Changes: the statement of the generalization theorem.

Old version

Sections 4.1-4.4 of [Book 1],
Section 8.2 of [Book 2],
Or Chapter 19 of [Book 3]
2
[Classification] The Bayesian Method
(Updated at 4:45pm, 24 Sep)
Changes: the conditional independence assumption on Slide 22.

Old version

Sections 5.3.1-5.3.3 of [Book 1],
Section 9.1.2 of [Book 2],
Or Sections 18.1-18.2 of [Book 3]
3
[Classification] Perceptron

Section 5.4.1 of [Book 1]
4
[Classification] Generalization Theorems Using VC-Dims and Margins


--
5
[Classification] SVM and Margin Perceptron

Sections 5.5.1-5.5.2 of [Book 1]
Section 9.3 of [Book 2]
Sections 21.1-21.2 of [Book 3]
6
[Classification] The Kernel Method


Pages 273-276 of [Book 1],
Sec 9.3.2 of [Book 2]
Sec 21.4 of [Book 3]
7
[Classification] Multiclass Perceptron and Its Mistake Bound

--
8
[Clustering] Centroid-Based Methods

Section 8.2 of [Book 1]
Section 10.2.1 of [Book 2]
Section 13.1 of [Book 3]
9
[Clustering] Connectivity-Based Methods

Sections 8.3-8.4 of [Book 1]
Section 10.3 of [Book 2]
Chapter 14 and Section 15.1 of [Book 3]
10
[Dimensionality Reduction] PCA

Appendix B.1 of [Book 1]
Section 7.2 of [Book 3]
11
[Association Rules] Apriori

Sections 6.1, 6.2.1-6.2.4, 6.3 of [Book 1]
Sections 6.1, and 6.2.1-6.2.3 of [Book 2]
Sections 8.1, 8.2.1, and 8.3 of [Book 3]

Exercises and Quizzes

Exercise List 1 (Solutions)
Exercise List 2 (Solutions)
Exercise List 3 (Solutions) (Updated on 24 oct)
Exercise List 4 (Solutions)
Exercise List 5 (Solutions)
Exercise List 6 (Solutions)
Exercise List 7 (Solutions)
Exercise List 8 (Solutions)
Exercise List 9 (Solutions)
Exercise List 10 (Solutions)
Exercise List 11 (Solutions)

Quiz 1 solutions
Quiz 2 solutions
Quiz 3 solutions

Project

The project page is here.

Deadline:
  • For the first 2 projects: 11:59pm, 30 Nov
  • For Project 3: 11:59pm, 7 Dec
  • For Project 4: 11:59pm, 15 Dec
  • For Project 5: 11:59pm, 25 Dec
To submit your project, email to the TA (i) all the deliverables required (as detailed in the project page) and (ii) your contribution declaration (again see the project page). Please use the subject "CMSC5724 project submission" for your email. Remember to list the ids and names of all the members in your team.