CMSC 5724 Project Page


Team Coding

Each team can contain up to 7 members, and needs to implement a designated data mining algorithm either in C++, Java, or Python. The implementation must be from scratch, i.e., it can use only functions from a standard library: Use of any function outside the above libraries is not permitted unless prior approval has been obtained from the instructor. All source codes are subject to plagiarism scrutiny (using some of the softwares listed here). All kinds of dishonesty will incur severe disciplinary penalty.

Deploying a programming language other than the above requires an approval from the instructor.

Contribution Declaration

Every team member is required to declare the percentage of work that s/he has done. If necessary, the tutor will carry out a one-to-one interview with each team member to assess whether the percentage is reasonable. If the program receives a score of s (from the previous two bullets) and the team has a size of t, then a team member with p percent contributions receives a final score of s * min{p * t, 1}.


Project List

Each team can choose to work on any of the following projects. The list is growing. The other projects will be released after their topics have been covered in the lectures.


Project #1: Decision Tree

Goal

Implement the algorithm discussed in Lecture 1.

Dataset

We will use the Adult dataset whose description is available here. The training set (adult.data) and evaluation set (adult.test) can be downloaded here.

Preprocessing

Remove all the records containing '?' (i.e., missing values). Also, remove the attribute "native-country".

Deliverables


Project #2: Margin Perceptron

Goal

Implement the margin perceptron algorithm in Lecture 5.

Dataset

Your implementation should work on any dataset stored in a text file of the following format: An example dataset of 4 two-dimensional points is:
4 2
1 2 0
2 1 0
3 2 1
2 4 1

Deliverables


Project #3: Bayes Classifier, K-Center, K-Means

This project has two parts.

=============
=== PART I ===
=============

Goal

Implement the Bayes Classifier in Lecture 2.

Dataset, Preprocessing

Same as Project #1.

Deliverables

=============
=== PART II ===
=============

Goal

Implement the k-means algorithm (Lecture 8), using the k-center algorithm (also in Lecture 8) for center initialization.

Dataset

Download here (obtained from the data collection here). Each line has the following format:

x y

which represent the x- and y-coordinates of a point.

Task

Partition the dataset into 8 clusters.

Deliverables


Project #4: DBSCAN

Goal

Implement the DBSCAN algorithm discussed in Lecture 9.

Dataset

Download here (obtained from the data collection here). Each line has the following format:

x y

which represent the x- and y-coordinates of a point.

Task

Partition the dataset into 3 clusters.

Deliverables


Project #5: Association Rule Mining

Goal

Implement the association rule mining algorithm discussed in Lecture 11.

Dataset

Download here (by courtesy of Alexander Dekhtyar). Each line has the following format:

tid, a, b, c, ...

where tid is the transaction id, and a, b, c ... are the items of the transaction (each item is represented by an integer).

Task

Find all the association rules with support at least 0.1n and confidence at least 0.9, where n is the number of transactions.

Deliverables