IK0601 Chinese Text Processing
Introduction
Chinese text is very different from English text, it doesn't use spaces nor have any delimiters. Boundaries between words has always been different in different contexts or even in the same context ambiguity could happen as well. Chinese words often comprise several characters, typically two, three, or above four character words which are rare. Many characters can stand alone as words themselves, while on other occasions the same character can form another meaning when it combines with some other characters. Thus word segmentation becomes the first and very important step in prcossing Chinese text processing.
What comes next right after that is the problem of part-of-speech tagging as it's more than listing words and their parts of speech. Some words can represent more than one part of speech at different times. To find out relationship with adjacent and related words in a phrase, sentence or paragraph is involved for any word in a given text.
Another part of the project is to develop a Chinese Grammer Checker. The goal is to detect what were assumed to be the errors of “standard” users.
Important Topics
- Chinese Text Segmentation
The process of identifying word boundaries. Different methods can be used:
- Statistical methods,
based on statistical properties and frequencies of characters and character strings in a corpus
- Dictionary-based methods,
often complemented with grammar rules. This approach uses a dictionary of words to identify word boundaries. Grammar rules are often used to resolve conflicts (choose between alternative segmentations) and to improve the segmentation.
- Syntax-based methods,
which integrate the word segmentation process with syntactic parsing or part-of-speech tagging
- Conceptual methods,
that make use of some kind of semantic processing to extract information and store it in a knowledge representation scheme. Domain knowledge is used for disambiguation.
Apart from the above methods, following issues are included in the study:
- Contextual information
E.g. Predict whether the bigram BC in the character string A B C D constitutes a word, we need to investigate whether the frequencies for AB, CD, A and D should be included in the formula.
- Different types of frequency:
- Relative frequency of individual characters and bigrams (character pairs) in the corpus, i.e. the number of times the character or bigram occurs in the corpus divided by the total number of characters in the corpus.
- Document frequency of characters and bigrams, i.e. the number of documents in the corpus containing the character or bigram divided by the total number of documents in the corpus.
- Weighted document frequency of characters and bigrams.
- Local frequency in the form of within-document frequency of characters and bigrams, i.e. the number of times the character or bigram occurs in the document being segmented.
- Frequency information of characters adjacent to a bigram is used to help determine whether the bigram is a word.
- Positional information.
Whether the position of a character string (at the beginning, middle or end of a sentence) gives some indication of whether the character string is a word.
- Investigate whether the position of the bigram (at the beginning of the sentence, before a punctuation mark, or after a punctuation mark) had a significant effect.
- Segmentation algorithm to segment sentences and resolve conflicts.
- Chinese POS Tagging
- Chinese Grammar Checker (CGC)
As much grammatical information as possible about text being checked is necessary while performing any such grammatical analysis is not easy since grammatical features(“errors”) essential for the analysis might be missing. Methods for CGC is as follow:
- Grammar used(e.g. Constraint-Based Grammar or Lexical-Functional Grammar, LFG)
- A morphological analyser which provides each word form with all of its lexically possible readings (grammatical tags).
- A morphological CG disambiguator, which eliminates incorrect tags according to the grammatical context.
- An error detector that identifies different kinds of grammatical errors.
Important Papers
Important Links
Important Conferences
- SIGIR [ 2003 | 2004 | 2005 | 2006 ]
- Airs [ 2003 | 2004 | 2005 | 2006 ]