ML-NLP

NLP

A set of techniques for automated generation, manipulation and analysis of human languages.

Applications

Processing large amount of texts;
Index and search large texts;
Speech understanding;
Information retrieval;
Automatic summarization;
Human-Computer Interaction;

Common Tasks

1 Stemming(单词还原)

Stemming is the process of reducing inflected words to their word stem, base or root form.(将单词各种时态还原为原始形式)

gives, gave, given, giving} --> {give}
1
2
3
4
5
6
7
```
Used widely in normalize text: search engine etc.

### 2 Part-of-Speech Tagging(POS Tagging)(语法分析)
Given a sentence, determine the part of speech for each word;
Grammatical analysis:nouns, verbs, adjectives, adverbs;
Very difficult because words are ambiguous.

The ball is red.
article noun verb adjective

1
2
3
4

### 3 Parsing(短语构建,划分句子结构)
Determine the parse tree(Grammatical analysis) of a given sentence;
The sentence can be either natural language or program language.

The boy went home.
NP(noun phrase) VP(verb phrase)
article noun verb none

1
2
3
4
5
6
7
8
9
10
11

### 4 Semantics(语义学)
Machine Translation;机器翻译
Natural Language Generation;自然语义生成
Natural Language Understanding;自然语言理解
Sentiment Analysis;情感分析
Topic Modeling;主题分类

## TF-IDF
TF-IDF is short for Term Frequency-Inverse Document Frequency.
It measures how important a word is to a document in a collection or corpus.

TF for cat is (3/100) = 0.03;
IDF is log(10,000/1,000) = 4;
TF-IDF: 0.03*4 = 0.12
`