NLP
A set of techniques for automated generation, manipulation and analysis of human languages.
Applications
Processing large amount of texts;
Index and search large texts;
Speech understanding;
Information retrieval;
Automatic summarization;
Human-Computer Interaction;
Common Tasks
1 Stemming(单词还原)
Stemming is the process of reducing inflected words to their word stem, base or root form.(将单词各种时态还原为原始形式)1
2
3
4
5
6
7```
Used widely in normalize text: search engine etc.
### 2 Part-of-Speech Tagging(POS Tagging)(语法分析)
Given a sentence, determine the part of speech for each word;
Grammatical analysis:nouns, verbs, adjectives, adverbs;
Very difficult because words are ambiguous.
The ball is red.
article noun verb adjective1
2
3
4
### 3 Parsing(短语构建,划分句子结构)
Determine the parse tree(Grammatical analysis) of a given sentence;
The sentence can be either natural language or program language.
The boy went home.
NP(noun phrase) VP(verb phrase)
article noun verb none1
2
3
4
5
6
7
8
9
10
11
### 4 Semantics(语义学)
Machine Translation;机器翻译
Natural Language Generation;自然语义生成
Natural Language Understanding;自然语言理解
Sentiment Analysis;情感分析
Topic Modeling;主题分类
## TF-IDF
TF-IDF is short for Term Frequency-Inverse Document Frequency.
It measures how important a word is to a document in a collection or corpus.
TF for cat is (3/100) = 0.03;
IDF is log(10,000/1,000) = 4;
TF-IDF: 0.03*4 = 0.12`