CMSC5724 Data Mining and Knowledge Discovery

Fall 2022

Professor: Yufei Tao
TA: Shiyuan Deng (sydeng@cse)

Quick navigation links:
[Lecture Notes][Exercises and Quizzes][Project]

Brief Description

This course will cover the conceptual and algorithmic aspects of fundamental problems in data mining and knowledge discovery, including (subject to time permission) classification, clustering, association rule analysis, and so on. On completion, students are expected to have developed the ability to perform an array of mining tasks that are essential to numerous applications in practice.

Announcements

News 14 (30 Nov): The scope of the final exam includes all lecture notes. During the exam, you will be allowed to consult only a single-sided, A4-sided, note sheet. As mentioned before, the exam will start at 6:30 pm on 13 Dec and will be 2 hours long.

News 13 (28 Nov): Your Quiz 3 scores have been released in Blackboard. You can now collect your papers from the TA, Mr. Shiyuan Deng.

News 12 (17 Nov): Venue of the final exam: our lecture classroom.

News 11 (16 Nov): Scope of quiz 3: Lecture Notes 6-9.

News 10 (27 Oct): Your Quiz 2 scores have been released in Blackboard. You can now collect your papers from the TA, Mr. Shiyuan Deng.

News 9 (19 Oct): Scope of quiz 2: Lecture Notes 3-5.

News 8 (11 Oct): The final exam will be held from 6:30pm to 8:30pm on 13 Dec. You will be allowed to bring in a single-sided, A4-sized, note sheet, on which you can print/write anything you deem useful. No other material can be consulted during the exam. The venue will be announced after it has been decided.

News 7 (6 Oct): Your Quiz 1 scores have been released in Blackboard. You can now collect your Quiz 1 papers from the TA, Mr. Shiyuan Deng. Please make appointments with him by email at sydeng@cse.cuhk.edu.hk.

News 6 (26 Sep): All quizzes will be open-book.

News 5 (24 Sep): We will adopt the following policy for quizzes and the final exam.
  1. All students must take the tests (i.e., quizzes and the final exam) in the classroom, except for cases in point 2 and 3.

  2. If you are Covid positive, you need to report your case to the HK government at this site. You will receive a confirmation after completing the case reporting. Send the confirmation to Prof. Tao, and he will arrange a special test session for you.

  3. If you plan to be absent from a test by citing health reasons, you must email Prof. Tao at least one hour before the test and follow up with a doctor's letter. You will be allowed to take a make-up test that may be harder than the plenary test (how much harder will depend on the number of extra days you have to prepare for the test).
News 4 (21 Sep): Scope of quiz 1: Lecture Notes 1 and 2. All the students are expected to take the test in the classroom.

News 3 (10 Sep): Please note the test schedule for this course:
Quiz 1 (20 minutes): To be held in the lecture of Sep 27 (Wed, Week 4)
Quiz 2 (20 minutes): To be held in the lecture of Oct 25 (Wed, Week 8)
Quiz 3 (20 minutes): To be held in the lecture of Nov 22 (Wed, Week 12)

News 2. (10 Sep): Exercise List 1 is released.

News 1. (5 Sep): Hello all.

Time, Venues, and Zoom Link

Lecture: 6:30pm - 9:30pm Tue, William M W Mong Eng Bldg LT
Zoom Link: https://cuhk.zoom.us/j/94660066343

Click here for a map of the campus.

Grading Scheme

Project: 30%
Short Tests or Assignments: 30%
Final: 40%

Textbook and Lecture Notes

No textbooks cover all the material of this course. Some reference books may be useful for extra reading:

[Book 1] Mohammed J. Zaki, and Wagner Meira Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms.
[Book 2] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of Data Science.

Ownership of the above books is not mandatory. The instructor will make lecture notes available before each class. His notes cover all the content required in this course, some of which is outside the above books.

As usual, lecture attendance is vital for thorough understanding.

Lecture Notes Extra Reading
1
[Classification] Decision Trees and a Generalization Theorem
(video 1)
(video 2)

Chapter 19 of [Book 1]
Sec 5.5-5.6 of [Book 2]
2
[Classification] The Bayesian Method
(video, start-00:49)

Sections 18.1-18.2 of [Book 1]
3
[Classification] Perceptron
(video 1, 00:49-end)
(video 2, start-01:41)

Section 5.8.3 of [Book 2]
4
[Classification] Generalization Theorems Using VC-Dims and Margins
(video 1, 01:41-end)
(video 2, start-01:40; mistake at 00:30 - I said rectangular classifiers could not shatter 3 colinear points, which is wrong; the class of rectangular classifiers can shatter any set of 3 colinear points)

--
5
[Classification] SVM and Margin Perceptron
(video 1, 01:40-end; mistake at 02:10:40: we want to maximize the margin, not minimize)
(video 2, white board)

Sections 21.1-21.2 of [Book 1]
6
[Classification] The Kernel Method
(video)

Sec 21.4 of [Book 1]
7
[Classification] Multiclass Perceptron and Its Mistake Bound
(video)

--
8
[Clustering] Centroid Methods
(video)

Section 13.1 of [Book 1]
9
[Clustering] Connectivity Methods
(video, start-01:04:30)

Chapter 14 and Section 15.1 of [Book 1]
10
[Dimensionality Reduction] PCA
See here for a nice example of using PCA for image compression.
(video 1, 01:04:30-end)
(video 2, start-01:15:40)

Section 7.2 of [Book 1]
11
[Association Rules] Apriori
(video, 01:15:40-end)

Sections 8.1, 8.2.1, and 8.3 of [Book 1]
12
[Graph Mining] Page Ranks and Random Walks
(video)

--

Exercises and Quizzes

Exercise List 1 (Solutions)
Exercise List 2 (Solutions) Note: Problem 5 is outside the scope of quizzes and exams
Exercise List 3 (Solutions)
Exercise List 4 (Solutions)
Exercise List 5 (Solutions)
Exercise List 6 (Solutions)
Exercise List 7 (Solutions)
Exercise List 8 (Solutions)
Exercise List 9 (Solutions)
Exercise List 10 (Solutions)
Exercise List 11 (Solutions)
Exercise List 12 (Solutions)

Quiz 1 solutions. Average = 74, Std. Dev. = 33
Quiz 2 solutions. Average = 68, Std. Dev. = 21
Quiz 3 solutions. Average = 95, Std. Dev. = 13
Final exam: Average = 86, Std. Dev. = 11

Project

The project page is here.

Deadline:
  • For Project 1 11:59pm, 25 Oct, 2022
  • For Project 2 11:59pm, 22 Nov, 2022
  • For Projects 3-5 11:59pm, 16 Dec, 2022
To submit your project, email to the TA all the deliverables (as detailed on the project page). Please use the subject "CMSC5724 project submission" for your email. Remember to list the ids and names of all the members in your team.