Table of Contents
Text Processing and Analysis
Input
- Text
- File
- URL
Output
- Statistics
- Scores
Filtering Procedure
- Misspelled words
- Stop words
- Stemming
- Filtering
- Minimum characters per word
- special word or expression
- number of words to be analyzed
- analyze number
- log the query (only for websites)
- apply stoplist
- apply internal stoplist
- extract links
- polyword phrases
Basic Text Analysis
- character, word, paragraph, sentence, syllable, etc. counting
- Average syllables per word
- Average sentence length (words)
- Max sentence length (words)
- Min sentence length (words)
- line feed, return, tab, special characters, etc.
- repetition words in a short range
- Frequent words
- Word length
- bi-gram, tri-gram, n-gram
- Unique words
- Lexical density
Readability Scores
- Lau-King Chinese Readability
Grammar Analysis
- POS tagger
- Parser
- NER
- location, person, thing, etc.
- event, year, telephone, address, etc.
- Summarization/annotation
- sentiment/opinion analysis
- concept clustering
Visualization
- Histogram
- Similar words, concept graph
- Word cloud
Document Segmentation
- Title
- Name
- Abstract
- Conclusion
- References
Resources
Things To Do
- Check the web for similar software packages
- Learning Python and packages
- Check Python packages on text analysis, NLTK, etc.