====== Query Categorization ======
===== Pre-processing =====
* Stemming
* [[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]]
* Abbreviation Extension
* Use [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]]
* Stopword filtering
* Use [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]]
* Misspelled words
* [[http://aspell.net/|GNU Aspell]]
* Location-based queries
* NER for location detection
* Part-of-speech (POS) tagging
* [[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]]
* Named entity recognition (NER)
* [[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]]
* Person (e.g., Bill Gates)
* Location (e.g., Hong Kong)
* Thing (e.g., Table)
===== Knowledge Base =====
* Lexicon (e.g., [[http://dbpedia.org/About|DBpedia]] person, location, organization, and product lists)
* [[http://snowball.tartarus.org/algorithms/english/stop.txt|Stop-word list]] (e.g, of, the)
* [[http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html|Abbreviation list]] (e.g., ad for advertisement)
===== Useful tools =====
*[[http://tartarus.org/~martin/PorterStemmer/index-old.html|The Porter Stemming Algorithm]]
*[[http://aspell.net/|GNU Aspell]]
*[[http://htmlparser.sourceforge.net/|Web page structure analysis]]
*[[http://www.nzdl.org/Kea/|KEA for key word extraction]]
*[[http://nlp.stanford.edu/software/tagger.shtml|Stanford POS tagger]]
*[[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford NER tagger]]
*[[http://wordnet.princeton.edu/|WordNet]]
*[[http://search.cpan.org/dist/WordNet-Similarity/|WordNet:: Similarity]]
* [[http://www.nltk.org/|NLTK Toolkit]]
* [[http://docs.python.org/library/bsddb.html|bsddb — Interface to Berkeley DB library]]
===== Input Examples =====
* the chinese university of hk
* new york pizza
* How do I play mp3 using the java programming language
===== Crowdsourcing =====
* Top 1000 queries => label them into 32 categories
===== Centroid Method =====
- Function Query2Term(string query) \\ **Input**: a query, **Output**: terms of this query \\ \\ Example1: the chinese university of hk -> [the chinese university of hk]1 ([]i is the i-th term of this query) \\ Example2: new york pizza -> [new york]1 [pizza]2 \\ Example3: How do I play mp3 using the java programming language -> [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 \\ \\
- Function Term2Centroid(string terms) \\ **Input**: terms of a query, **Output**: centroid of this query \\ \\ Example1: [the chinese university of hk]1-> the chinese university of hk \\ Example2: [new york]1 [pizza]2 -> pizza \\ Example3: [play]1 [mp3]2 [use]3 [java]4 [program]5 [language]6 -> mp3 \\ \\
- Function synonym(string keyword) \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar
===== Similarity-based Method =====
- Function catURL(string category, string engine, int n) \\ **Input**: a category, **Output**: top n URLs from search engines (e.g., Google) \\ \\ Example: catURL(cuhk, Google, 3) \\ www.cuhk.edu.hk/ \\ www.cuhk.edu.hk/chinese/ \\ www.cuhk.edu.hk/gss/ \\ \\
- Function keywordsURL(string URL) \\ **Input**: a URL, **Output**: key words of Web pages for this URL \\ \\ Example: keywordsURL(http://www.cuhk.edu.hk/english/) \\ research, education, shatin, campus, college, etc \\ \\
- Function synonym(string keyword) \\ **Input**: a word, **Output**: a set of synonyms of this term in WordNet \\ \\ Example: synonym(car) \\ auto, automobile, machine, motorcar
===== Overall Workflow =====
{{:projs:qcat:figure-updated1.pdf|Workflow for query categorization}}