Table of Contents

CSC5250 Information Retrieval and Search Engine

Fall 2006

Lecture I Lecture II Tutorial
Time M2, 9:30 am - 10:30 am T7-8, 2:30 pm - 4:15 pm T9, 4:30 pm - 5:15 pm
Venue SHB 503 LSB C3 ERB 804

The Golden Rule of CSC5250: No member of the CSC5250 community shall take unfair advantage of any other member of the CSC5250 community.

:teaching:csc5250:s8.jpg This is the class photos for CSC5250 in 2006. More photos are available <here>. :-)

Course Description

  1. This course surveys the current research in information retrieval for the Internet and related topics.
  2. The course will focus on theoretical development of information retrieval system for multimedia contents as well as practical design and implementation issues associated with Internet search engines.
  3. Topics include probabilistic retrieval, relevance feedback, indexing of multimedia data, and applications in e-commerce.

Personnel

Lecturer Tutor Tutor
Name Irwin King Xiang Peng Jacky Zhu
Email king AT cse.cuhk.edu.hk xpeng AT cse.cuhk.edu.hk jkzhu AT cse.cuhk.edu.hk
Office Rm 908 Rm 1013 Rm 110
Telephone 2609 8398 2609 8431 TBD
Office Hour(s) TBD Thursday 3:00 pm - 5:00 pm TBD

Note: This class will be taught in English. Homework assignments and examinations will be conducted in English.

Syllabus

The pdf files are created in Acrobat 6.0. Please obtain the correct version of the Acrobat Reader from Adobe.

Week Date Topics Tutorials Homework & Events Resources
1 4/9 1. Course Information [ color-b&w]
2. Introduction [ color - b&w ]
No Tutorial Readings:
1. Google
2. Information Retrieval on the World Wide Web
1. Information Retrieval (2nd Ed.) by C.J. van Rijsbergen
2. Searching on the Web
3. Usefulness of Search Engines
4. Web Search Engines
5. Comparing Search Engines
6. The Anatomy of a Large-Scale Hypertextual Web Search Engine
2-3 11-18/9 1. Search Engine [ color-b&w ]
2. Modeling [ color-b&w ]
Introduction to Python I Readings:
1. The Google Cluster Architecture
Programming Exercises:
1. No.17 in v.099b–Simple Page Crawler in Python: Due on Friday, 29 Sept on or before 11:59 PM
1. Major Search Engines
2. Extended Boolean model
4 TBD 1. Probabilistic Modeling [ color-b&w ] Introduction to Python II Readings:
1. Information Retrieval on the Web
2. Information Retrieval Algorithms
1. Vector space model and document ranking
2. Probabilistic model
5 25/9 Performance Evaluationcolor-b&w ] Introduction to Python III TBD 1. Retrieval System Performance
2. Precision of Web Search Services
3. Measuring Search Engine Quality
4. MS Site Server's Capacity and Performance Analysis Search
5. MS Site Server Indexing
6 2/10 Text Algorithm
1. Text Algorithm [ Color - B&W ]
2. Text Analysis [ Color - B&W ]
Scipy I Written Exercises:
1. No. 2.1, 2.3, 2.4, 2.5, 3.2, 5.1, 5.2, 5.3, 5.4, 5.5(1, 3) inv.099c: Due on Friday, 27 Oct on or before 11:59 PM
1. Lexical Analysis of Messages
2. Porter's Stemming Algorithm
3. Different versions and analysis of the Porter's Stemming Algorithm
4. Animation of String Matching Algorithm I
5. Animation of String Matching Algorithm II
7 9/10 1. Indexing and Searching[ Color - B&W ]
2. Chinese Language Processing [ Color - B&W ]
Scipy II TBD 1. Index Structure
2. On tries and suffix trees
3. Approximate String Matching
4. Signature File Method
5. Lexical Analysis
6. Inverted File Compression
7. Chinese Lexicon and Brown Corpus
8. WordNet
8 16/10 1. Ranking and Metasearching[ ppt ]
2. Query Language [ Color - B&W ]

3. Query Operations[ Color - B&W ]
Information Retireval Software TBD 1. The shape of the web
2. The structure of the web
3. Evolutionary Dynamics of the World Wide Web
4. Graph Theory in Practice: Part II
5. Internet Ecologies
6. How Google Finds Your Needle in the Web's Haystack
7. Google Keeps Tweaking Its Search Engine
9 23/10 1. Web Structures[ Color - B&W ]

2. Clustering Algorithms[ Color - B&W ]
TBD TBD 1. Suffix Tree notes
2. PAT Tree algorithm and data structure notes
3. PAT Trees and PAT Arrays
10 30/10 Multimedia Information Retrieval TBD TBD 1. How much information is on the Internet?
2. Computation on the Web
11 6/11 Teoma and P2P TBD TBD 1. Original lecture notes from Berkeley
2. Teoma Search Engine
3. Gnutella
12 13/11 Wikis TBD TBD 1. How much information is on the Internet?
2. Computation on the Web
13 20/11 Web Business Models TBD TBD TBD
14 27/11 Wrap up and Summary TBD TBD TBD

Notes:

  1. For detailed tutorial information, please go to Tutorial Page.
  2. Please submit your homework assignment to csc5250@cse.cuhk.edu.hk.

Class Project

Class Project Assessment Scheme

Presentation (50%)

  1. Key points of the project, e.g., problem definition, proposed solution, validation, etc.
  2. Demonstration

Report (50%)

  1. It is due at 11:59 on Dec 24, 2006.
  2. No more than 15 pages, single-column, single spacing.
  3. Submit a single compressed directory file that includes: (1) The written report, preferrably in pdf, but it is fine to include both pdf and the original file. (2) Presentation files. (3) Program files. (4) References. (5) Other relevant files, e.g., webpages, lexicons, databases, other supporting files, etc. (6) Name your top level directory using your student ID(s), e.g., one student-99123456, two students-99123456-00123456.
  4. It should contain a cover page with abstract, introduction/background, problem definition, proposed solution, validation procedure, etc. It should use graphs, figures, tables, etc. to illustrate the outcome.
  5. You will be assessed by: (1) (5%) understanding of the problem. (2) (10%) your proposed solution. (3) (10%) the extend of the solution. (4) (10%) the result (any update from the presentation and demonstration). (5) (5%) the validation procedure. (6) (10%) the clarity and the technical quality of the report.
  6. If your files are more than 5M, pls prepare a CD instead of just an email copy.

CSC5250 Class Project Presentation Schedule

The class presentation will be on either December 18 or December 19. Please sign-up by sending an email to the xpeng@cse.cuhk.edu and king@cse.cuhk.edu.hk.

Notice: - There are some adjustments for the project presentation. The time slot of 10:30 - 12:00 on December 18 are not available due to the invited seminar in CSE Dept. So we could do the presentation in the morning without the slot of 10:30 - 12:00 and the whole afternoon from 2:00 to 6:30. And every group has 30 minutes to present the work. Thanks for your attention.

  1. The venue is Room 1022, Ho Sin Hang Engineering Building.
December 18
Presentation Time
Team Name
9:00 AM TBD
9:30 AM Chan Hoi Tung (04730173) and Chan Kam Tong (04730584)
10:00 AM Hongbo Deng (06239340) and Yingyi Bu (06238270)
10:30 AM - 12:00 AM Not Available
Break Break
14:00 PM Wong Tsz Keung (03558911)
14:30 PM Wu Di (06236600) and Zhou Tu (06238680)
15:00 PM Wong Tik Shun (06235200)
15:30 PM LIU Renting (06239670) and SHAN Qi (06238500)
16:00 PM Chu Ka Cheong (06836804)
16:30 PM CHIO Ka In (04527223) and PUN Iek Hoi (04532334)
17:00 PM TBD
17:30 PM TBD
18:00 PM TBD

Examination Schedule

Time Venue Notes
Midterm Examination 14:30 pm - 16:15 pm Tue, Nov 7 2006 LSB C3 This is a close-book and close-note examination. It will cover all topics discussed in the class. Remarks: One A4, double-side “cheat sheet”, Approved calculators are allowed.
Final Examination 9:30 am - 11:30 am Fri, Dec 12 2006 Thomas H C Cheung Gymnasium, UC This is a close-book and close-note examination. It will cover all topics discussed in the class. Remarks: One A4, double-side “cheat sheet”, Approved calculators are allowed.

Grade Assessment Scheme

  1. Homework Assignments and Quizzes, 20%
  2. Midterm Examination, 15%
  3. Project, 25%
    1. Proposal and Work, 10%
    2. Presentation and Demonstration, 10%
    3. Report, 5%
  4. Final Examination, 40%
  5. Optional Extra Credits

Note: The minimum passing grade is to achieve at least 40 out of 100 in the final examination.

Required Background

  1. Pre-requisites
    1. - None.

Programming Requirement

Familiarity with the following topics is highly recommended:

  1. Data Structure: data types and structures, lists, queues, stacks, trees, sets, etc.
  2. Algorithm: analysis, design, sorting methods, numerical methods, algorithms on graphs, etc.
  3. Operating System & Programming Environment: Unix systems, C, SQL, and Matlab.

Reference Books

  1. Modern Information Retrieval, Ricardo Baeza-Yates and Berthier Ribeiro-Neto, ACM Press, 1999.
  2. Information Retrieval: Data Structures and Algorithms, William B. Frakes and Ricardo Baeza-Yates, Prentice Hall, 1992.
  3. Information Retrieval: Algorithms and Heuristics, David A. Grossman and Ophir Frieder, Springer, 2004.
  4. Understanding Search Engines: Mathematical Modeling and Text Retrieval, Michael Berry, Murray Browne, and Jack Dongarra, SIAM Society for Industrial and Applied Mathematics, 2005.
  5. Managing Gigabytes: Compressing and Indexing Documents and Images, Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Morgan Kaufmann, 1999.
  6. The Geometry of Information Retrieval, C. J. van Rijsbergen, Cambridge University Press, 2004.

Book Sources

  1. Academic & Professional Book Centre, 1H Cheong Ming Bldg., 80-86 Argyle St., Kowloon, 2398-2191, 2391-7430 (fax)
  2. Caves Books (H. K.), 4B Ferry St., G/F., Yaumatei, Kowloon, 2780-0987, 2771-2298
  3. Man Yuen Book Company, 45 Parkes street, Jordan Road, Kowloon, Hong Kong, 2366-0594. Not very large, Asian edition books, fair price, wide range, some 10% discount.
  4. Swindon Book Co. Ltd, 13-15 Lock Road, Tsim Sha Tsiu, Kowloon, 2366-8001. One of the largest book stores in Hong Kong, exchange rate is not favorable.
  5. Hongkong Book Centre, 522-7064. A branch of the Swindon book shop.

FAQ

A: Currently, we are building an extensive archive on this topic at the Knowledge Bank. You are definitely welcome to contribute your expertise and knowledge on this site.


A: Yes, absolutely you should learn PERL. PERL is a very nice text manipulation language for the web. You will find that using it will cut down the development time since it is an easy to learn shell-type language.


A: You can get the source code from ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/.


A: For 2000:


A: The CSE department has a very strict guideline on this issue. The guideline is as follows:

If a student is found plagiarizing, his/her case will be reported to the Department Discipline Committee. If the case is proven after deliberation, the student will automatically fail the course in which he/she committed plagiarism. The definition of plagiarism includes copying of the whole or parts of written assignments, programming exercises, reports, quiz papers, mid-term examinations. The penalty will apply to both the one who copies the work and the one whose work is being copied, unless the latter can prove his/her work has been copied unwittingly. Furthermore, inclusion of others' works or results without citation in assignments and reports is also regarded as plagiarism with similar penalty to the offender.

A student caught plagiarizing during tests or examinations will be reported to the Faculty Office and appropriate disciplinary authorities for further action, in addition to failing the course.

Resources

  1. Please refer to Multimedia Information Processing Lab's (MIP Lab) knowledge bank for more resources on this and other related issues.
  2. CS460 by Michael Berry CS460.
  3. Relevant Conferences
    1. - WWW 1999, 2000, 2001, 2002, 2003, 2004
    2. - INET 2000, 2001, 2002, 2003, 2004

Python

Python Discussion Site