CMSC 5724 Project Page

Team Coding

Each team can contain up to 6 members, and needs to implement a designated data mining algorithm either in C++, Java, or Python. The implementation must be from scratch, i.e., it can use only functions from a standard library:

Use of any function outside the above libraries is not permitted unless prior approval has been obtained from the instructor. All source codes are subject to plagiarism scrutiny. All kinds of dishonesty will be reported to the university for disciplinary actions.

Deploying a programming language other than the above requires an approval from the instructor.

Contribution Declaration

Every team member is required to declare the percentage of work s/he has done. If the program receives a score of s and the team has a size of t, then a team member with p percent contributions receives a final score of s * min{(p/100) * t, 1}. For example, if a group has size 5 and all members contribute equally, then p = 20 and every member receives a score of s * (20/100) * 5 = s.

Project List

Each team can choose to work on any of the following projects.

Decision Tree
Margin Perceptron
Bayes Classifier, k-Means, k-Center
DBSCAN
Association Rule Mining

The list is growing. The other projects will be released after their topics have been covered in the lectures.

Project #1: Decision Tree

Goal

Implement Hung's algorithm for decision tree classification

Dataset

We will use the Adult dataset whose description is available here. The training set (adult.data) and evaluation set (adult.test) can be downloaded here.

Preprocessing

Remove all the records containing '?' (i.e., missing values). Also, remove the attribute "native-country".

Deliverables

An executable program, which should output a decision tree to the disk when given an input training set.
A readme file detailing how to use the program.
Source code.
A document describing (i) the decision tree built from the Adult training set, and (ii) a report on using the tree to classify the records of the evaluation set. The report should contain a detailed list of all the records in the evaluation set, containing for each record its attributes and whether it has been classified successfully.

Project #2: Margin Perceptron

Goal

Implement the margin perceptron algorithm.

Dataset

Your implementation should work on any dataset stored in a text file of the following format:

The first line contains two numbers n and d, where n is the number of points, and d is the number of describing attributes.
The i-th line (where i goes from 2 to n + 1) gives the (i - 1)-th point in the dataset as:
x₁ x₂ ... x_d c
where the first d values are the coordinates of the point, and c = 1 (if the point is blue) or 0 (red).

An example dataset of 4 two-dimensional points is:
4 2
1 2 0
2 1 0
3 2 1
2 4 1

Deliverables

An executable program which should work for any dimensionality d.
A readme file detailing how to use the program.
Source code.
A test dataset with at least 10000 points.

Project #3: Bayes Classifier, K-Center, K-Means

This project has two parts.

=============
=== PART I ===
=============

Goal

Implement the Bayes Classifier.

Dataset, Preprocessing

Same as Project #1.

Deliverables

An executable program.
A readme file detailing how to use the program.
A report on using the program to classify the records of the evaluation set. The report should contain a detailed list of all the records in the evaluation set, containing for each record its attributes and whether it has been classified successfully.

=============
=== PART II ===
=============

Goal

Implement the k-means algorithm using the k-center algorithm for center initialization.

Dataset

Download here (obtained from the data collection here). Each line has the following format:

x y

which represent the x- and y-coordinates of a point.

Task

Partition the dataset into 8 clusters.

Deliverables

An executable program.
A readme file detailing how to use the program.
A report explaining the clusters found (e.g., giving a visualization of each cluster).
Source code.

Project #4: DBSCAN

Goal

Implement the DBSCAN algorithm.

Dataset

Download here (obtained from the data collection here). Each line has the following format:

x y

which represent the x- and y-coordinates of a point.

Task

Partition the dataset into 3 clusters.

Deliverables

An executable program.
A readme file detailing how to use the program.
A report explaining the clusters found (e.g., giving a visualization of each cluster).
Source code.

Project #5: Association Rule Mining

Goal

Implement the apriori algorithm for association rule mining.

Dataset

Download here (by courtesy of Alexander Dekhtyar). Each line has the following format:

tid, a, b, c, ...

where tid is the transaction id, and a, b, c ... are the items of the transaction (each item is represented by an integer).

Task

Find all the association rules with support at least 0.1n and confidence at least 0.9, where n is the number of transactions.

Deliverables

An executable program.
A readme file detailing how to use the program.
A report detailing all the association rules found and their respective support and confidence values.
Source code.