CMSC5724 Data Mining and Knowledge Discovery

Fall 2021

Professor: Yufei Tao
TA: Shangqi Lu

Quick navigation links:
[Lecture Notes][Exercises and Quizzes][Project]

Brief Description

This course will cover the conceptual and algorithmic aspects of fundamental problems in data mining and knowledge discovery, including (subject to time permission) classification, clustering, association rule analysis, and so on. On completion, students are expected to have developed the ability to perform an array of mining tasks that are essential to numerous applications in practice.

Announcements

News 8 (26 Nov): Quiz 3 will take place in the lecture of 2 Dec. The scope covers Lecture Notes 6-9. The quiz will be 20 minutes long and will start at 7:30 pm.

News 7 (22 Oct): The final exam will start at 6:30 pm, Dec 16. The venue is CYT202.

News 6 (22 Oct): Quiz 2 will take place in the lecture of Oct 28. The scope covers Lecture Notes 3-5. The quiz will be 20 minutes long and will start at 7:30 pm. You will need to take the quiz in the classroom, unless you have obtained prior approval from the instructor to take the test online. In general, approval will be given only to students with genuine travel restrictions that prevent them from coming to the campus. The same policy will apply to Quiz 3 and the final exam.

News 5 (22 Oct): No lecture on Nov 4. There will be a make-up lecture at 6:30 pm, Dec 9. The venue is CYT202.

News 4 (29 Sep): All the tests (i.e., quizzes and the final exam) are open book.

News 3 (24 Sep): Test schedule of the course: Quiz 1: 30 Sep, Quiz 2: 28 Oct, Quiz 3: 2 Dec.

Sick Leave Policy: If you want to be absent from a quiz/exam by citing health reasons, you must email the instructor your application at least one hour before the test and follow up with a doctor's letter.

News 2 (24 Sep). Quiz 1 will take place in the next lecture (30 Sep). The scope covers Lecture Notes 1-2. You will need to take the quiz in the classroom, unless you have obtained prior approval from the instructor to take the test online. In general, approval will be given only to students with genuine travel restrictions that prevent them from coming to the campus. The quiz will be 20 minutes long and will start at 7:30 pm.

News 1. (1 Sep): Hello all.

Time, Venues, and Zoom Link

Lecture: 6:30pm - 9:30pm Thu, Wu Ho Man Yuen Bldg 506
Zoom

Click here for a map of the campus.

Grading Scheme

Project: 30%
Short Tests or Assignments: 30%
Final: 40%

Textbook and Lecture Notes

No textbooks cover all the material of this course. Some reference books may be useful for extra reading:

[Book 1] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining.
[Book 2] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques.
[Book 3] Mohammed J. Zaki, and Wagner Meira Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms.
[Book 4] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of Data Science.

Ownership of the above books is not mandatory. The instructor will make lecture notes available before each class. His notes cover all the content required in this course, some of which is outside the above books.

As usual, lecture attendance is vital for thorough understanding.

Lecture Notes Extra Reading
1
[Classification] Decision Trees and a Generalization Theorem

Sections 4.1-4.4 of [Book 1],
Section 8.2 of [Book 2],
Or Chapter 19 of [Book 3]
2
[Classification] The Bayesian Method

Sections 5.3.1-5.3.3 of [Book 1],
Section 9.1.2 of [Book 2],
Or Sections 18.1-18.2 of [Book 3]
3
[Classification] Perceptron (video)

Section 5.4.1 of [Book 1]
4
[Classification] SVM and Margin Perceptron (video)

Sections 5.5.1-5.5.2 of [Book 1]
Section 9.3 of [Book 2]
Sections 21.1-21.2 of [Book 3]
5
[Classification] Generalization Theorems Using VC-Dims and Margins (video)

--
6
[Classification] The Kernel Method (video)

Pages 273-276 of [Book 1],
Sec 9.3.2 of [Book 2]
Sec 21.4 of [Book 3]
7
[Classification] Multiclass Perceptron and Its Mistake Bound (video)

--
8
[Clustering] Centroid-Based Methods (video)

Section 8.2 of [Book 1]
Section 10.2.1 of [Book 2]
Section 13.1 of [Book 3]
9
[Clustering] Connectivity-Based Methods (video)

Sections 8.3-8.4 of [Book 1]
Section 10.3 of [Book 2]
Chapter 14 and Section 15.1 of [Book 3]
10
[Dimensionality Reduction] PCA (video)

See here for a nice example of using PCA for image compression.

Appendix B.1 of [Book 1]
Section 7.2 of [Book 3]
11
[Association Rules] Apriori (video)

Sections 6.1, 6.2.1-6.2.4, 6.3 of [Book 1]
Sections 6.1, and 6.2.1-6.2.3 of [Book 2]
Sections 8.1, 8.2.1, and 8.3 of [Book 3]
12
[Graph Mining] Page Ranks and Random Walks (video)

--

Exercises and Quizzes

Exercise List 1 (Solutions)
Exercise List 2 (Solutions) Note: Problem 5 is outside the scope of quizzes and exams
Exercise List 3 (Solutions)
Exercise List 4 (Solutions)
Exercise List 5 (Solutions)
Exercise List 6 (Solutions)
Exercise List 7 (Solutions)
Exercise List 8 (Solutions)
Exercise List 9 (Solutions)
Exercise List 10 (Solutions)
Exercise List 11 (Solutions)
Exercise List 12 (Solutions)

Quiz 1 solutions, mean = 65, standard deviation = 28
Quiz 2 solutions, mean = 62, standard deviation = 21
Quiz 3 solutions, mean = 66, standard deviation = 21

Project

The project page is here.

Deadline:
  • For the first 2 projects: 11:59pm, 13 Dec
  • For Project 3: 11:59pm, 18 Dec
  • For Project 4: 11:59pm, 25 Dec
  • For Project 5: 11:59pm, 25 Dec
To submit your project, email to the TA (i) all the deliverables required (as detailed in the project page) and (ii) your contribution declaration (again see the project page). Please use the subject "CMSC5724 project submission" for your email. Remember to list the ids and names of all the members in your team.