Xinxin Tang

A technical blog


  • Home

  • Archives

Tech stack

Posted on 2018-05-30

Data structures & Algorithms

1 Book:
“Data Structures and Algorithms in JAVA”(Entry)
“Intro to Algorithms”(Promition)

Programming Languages

Java Conceptions

1 Books:
Thinking in Java;
Effective Java;
Programmer interview preparing

C++

大量练习!!!
1 Books:
C++ Primer;
Essential C++;
Effective C++;
C++语言的设计与演化;
深度探索C++对象模型;
STL源码剖析;
代码大全;
编译原理;
自动机理论;
语言哲学.

Big data

Bid data interview blog

Threads & Locks

  1. Difference between Threads & Process
  2. Multithreads, lock, semaphore
  3. Resource management
  4. Deadlock and how to prevent
  5. Blocking Queue implement
  6. Producer-Consumer implement

Careerup interview question 1
Careerup interview question 2

System design

View engineering blogs as many as possible

OOD

implement Singleton, Factory and MVC pattern. Design a class: LRU, Trie, Iterator, BST, Blocking Queue.

Soft skills

Active to learn;
Quick to learn;
Excellent communication ability.

ML-Basic

Posted on 2018-05-30

ML Basic

Applications of ML

Text Recognition;
Spam Filtering;
Recommendation System;
Medical Diagnosis;
Defeat Human;
Self-driving Car.

Types

Supervised learning: learning by examples
Labeled data;
Direct feedback;
Predict outcome/future’

Unsupervised learning: discovering patterns itself
No labels;
No feedback;
Find hidden structure.

Reinforcement learning: feedback right/wrong
Decision process;
Reward system;
Learn series of actions.

Types of tasks of ML

Classification:
Map an element to a category
Regression:
Predict numerical values
Similarity:
Find more like this
Ranking:
Return an ordered list(based on relevance)
Sequence Prediction:
Predict the next one in the series

Regression vs. Classification

1 A regression model predicts continuous values.
e.g.: What is the value of a house in California?
What is the probability that a user will click on this ads?

2 A classification model predicts discrete values.
e.g.: Is a given email message spam or not spam?
Is this an image of a dog, a cat, or a hamster?

ML-NLP

Posted on 2018-05-30

NLP

A set of techniques for automated generation, manipulation and analysis of human languages.

Applications

Processing large amount of texts;
Index and search large texts;
Speech understanding;
Information retrieval;
Automatic summarization;
Human-Computer Interaction;

Common Tasks

1 Stemming(单词还原)

Stemming is the process of reducing inflected words to their word stem, base or root form.(将单词各种时态还原为原始形式)

gives, gave, given, giving} --> {give}
1
2
3
4
5
6
7
```
Used widely in normalize text: search engine etc.

### 2 Part-of-Speech Tagging(POS Tagging)(语法分析)
Given a sentence, determine the part of speech for each word;
Grammatical analysis:nouns, verbs, adjectives, adverbs;
Very difficult because words are ambiguous.

The ball is red.
article noun verb adjective

1
2
3
4

### 3 Parsing(短语构建,划分句子结构)
Determine the parse tree(Grammatical analysis) of a given sentence;
The sentence can be either natural language or program language.

The boy went home.
NP(noun phrase) VP(verb phrase)
article noun verb none

1
2
3
4
5
6
7
8
9
10
11

### 4 Semantics(语义学)
Machine Translation;机器翻译
Natural Language Generation;自然语义生成
Natural Language Understanding;自然语言理解
Sentiment Analysis;情感分析
Topic Modeling;主题分类

## TF-IDF
TF-IDF is short for Term Frequency-Inverse Document Frequency.
It measures how important a word is to a document in a collection or corpus.

TF for cat is (3/100) = 0.03;
IDF is log(10,000/1,000) = 4;
TF-IDF: 0.03*4 = 0.12
`

Big data-web scrapers

Posted on 2018-05-30

Web Scraping is data scraping used for extracting data from websites.

Applications

Data source:
When service is unavailable
Crawler & Indexer:
Offers web data search and data integration
Test:
Simulating client behavior to do the testing

Basic Flow

1 Retrieval:get web page
2 Analysis: Raw HTML file –> extract useable data

Goal

parse the HTML into a structured data then extract the data we want.

Methods
1 XPath: XML Path Language is a query language for selecting nodes from an XML document

1
2
3
4
5
/bookstore/book[price>35.00]/title: select all title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00

//title[@lang='en']:selects all title elements that have a "lang" attribute with a value of "en"

/bookstore/book/[position()<3]:selects the first two book elements that are children of the bookstore element

Must know the position or the structure to use XPath.
Package in Python:

lxml import html```
1
2
3

2 Regular Expression:
**Package in Python**: ```import re

3 Beautiful Soup:
Package in Python:from bs4 import BeautifulSoup

Simple Python Scraper

Basic flow:
1 Request Web server to retrieve HTML content:requests
2 Parse the HTML into structured data:lxml
3 Use XPath and Regex to extract useful information:lxml, re
4 Store the information:pymongo

Integration with RabbitMQ

Why integrating with queue?
1 Store scraping tasks temporarily
2 Make scraper running continuously
3 Let scraper feed itself
4 Coordinate multiple scrapers working together

Avoid blocking

Scraping is a gray area:
Data is priceless;
Great pressure on target website;
Scraping is usually unauthorized.
Methods:
1 Limit Scraping Rate:
2 Follow Website’s robots.txt
3 User Agent: A web browser telling a web site information about operating system.
4 Proxy
5 TOR(The Onion Router):Packages: socks, socketTor directs Internet traffic through a free, worldwide, volunteer network consisting of more than seven thousand relays.
6 Captcha
7 One-click Captcha:
Criteria:
1)Behavior prior to clicking
2)Cursor movement(path/acceleration)-PC
3)Test against your browser
4)Cookie
5)History

Front end-Security

Posted on 2018-05-29 | Edited on 2018-05-30

Cookie

Cookie is a small piece of data sent from a website and stored on the user’s computer by the user’s web browser while the user is browsing.

Why Cookie?

To remember stateful information:
is logged in?
shopping car
state on filling a long form
browse history(ads)

Principle

Server uses “Set-Cookie” to pass a cookie to Client;
Client uses “Cookie” to pass a cookie back to Server.

Browser sends a HTTP GET request;
Server sends a HTTP GET response with Set-Cookie;
Browser sends the Cookie in following requests.

Session

we sue Cookie to implement Session.

Password

We never want to store Plain password.

Hash function: SHA1(Still not safe); MD5(MD5 is broken ,never use it);

Issues:
Rainbow table attack;
Password Collision.

SHA1 with SALT

1
f(password, salt) = hash(password + salt)

generating a random salt makes the password harder to be broken.

1
hash([provided password] + [stored salt]) == [stored hash]

the user is authenticated.

SHA1 with SALT is not 100%

it just adds extra effort to crack a password;
binary and data should be isolated;
focus more on web related security;
social engineering & phishing can make all protections.(主要攻破途径)

big data-Service Oriented Architecture

Posted on 2018-05-27 | Edited on 2018-05-31

Service-Oriented Architecture:

Benefits
Isolation:
language, technology, tools
decoupling, independency
deployment and maintenance
Ownership: minimal gray area and gap
Scalability: easy to scale up and modify
Cons
Complexity: sometimes unnecessary
Latency: Network communication eats time
Test effort: all services require E2E tests
DevOp, On-call

APIs

An Application Programming Interface(API) is best though of as a contract provided by one piece of computer software to another.

Classification of APIs

Web Service APIs: SOAP, XML-RPC, JSON-RPC, REST
Library-based APIs: JavaScript(Google Map APIs)
Class-based APIs: Java API, Android API
OS functions and Routines: File system, user interface
Hardware APIs: Cameras

RPC v.s. REST

RPC
Remote Procedure Call is a protocol that one program can use to request a service from a program located in another computer on a network on a network without having to understand the network’s details

RPC call is more like a local function call
E.g.

1
http://<host>:<port>/service?request=GetOrders?orderId=123

Representational State Transfer(RESTful)
is an architecture style for designing networked applications. The idea is that, rather than using complex mechanisms such as CORBA, RPC or SOAP to connect between machines, simple HTTP is used to make calls between machines.

REST is like a HTTP request.

E.g.

1
http://<host>:<port>/service/orders?orderId=123

Comparison

RPC
Pros:
Action Oriented;
RPC is transparent: it is like a local procedure call;
RPC is strictly defined: client must know method name, arguments and types to make a call;
RPC is heavily used in big tech companies
适合SOA架构,API定义清晰

Cons:
RPC is difficult to modify: both server and client need to change.

REST

Pros:
Resource oriented;
Easy to modify: client doesn’t need to dependent on method name, arguments or types. It depends on message format;
Easy to cache.
适合前端架构

Cons:
Format can be changed;
Hard to do authentication.

JSON RPC vs XML-RPC

JSON: less space; easy to read

Package: pyjsonrpc

API design

why API design important?

APIs can be one of greatest assets of companies
API can also be one of greatest liabilities of companies
Public APIs are forever

Why API design important to YOU?

We are natural API designers;
Useful modules tend to get reused;
Good API means great code quality.

Characteristics of Good APIs

Easy to learn and use with good documentation;
Hard to misuse;
Easy to read and maintain the code that uses it;
Sufficient powerful to satisfy requirements;
Easy to extend.

API design principles

Functionality should be easy to explain;
If it’s hard to name, that’s generally a bad sign;
Good names drive development.

Big Data-Databases No-SQL

Posted on 2018-05-27 | Edited on 2018-06-01

NoSQL databases

Two styles of distributing data:

1 Sharding: it distributes different data across multiple servers, so each server acts as the single source for a subset of data.
2 Replication: it copies data across multiple servers, so each bit of data can be found in multiple places.
Two forms:
Master-Slave replication: it makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads.
Peer-to-Peer replication: it allows writes to any node; the nodes coordinate to synchronize their copies of the data.
Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids loading all writes onto a single server creating a single point of failure.

CAP theorem

Consistency
Availability
Partition-toleration

Distributed database is doing a trade-off between Consistency and Availability.
Most distributed database is “eventually consistency” in order to get high availability.

Why choose NoSQL Databases

Broad reasons:
1 To improve programmer productivity by using a database that better matches an application’s needs
2 To improve data access performance via some combination of handling larger data volumes, reducing latency, and improving throughput

General guidelines:
1 K/V databases are generally useful for storing session information, user profiles, preferences, shopping cart data. We would avoid using Key-value databases when we need to query by data, have relationships between the data being stored or we need to operate on multiple keys at the same time.
2 Document databases are generally useful for content management systems, blogging platforms, web analytics, real-time analytics, E-commerce applications. We would avoid using document databases for systems that need complex transactions spanning multiple operations or queries against varying aggregate structures.
3 Column family databases are generally useful for content management systems, blogging platforms, maintaining counters, expiring usage, heavy write volume such as log aggregation. We would avoid using column family databases for systems that are in early development, changing query patterns.
4 Graph databases are very well suited to problem spaces where we have connected data, such as social networks, spatial data, routing information for goods and money, recommendation engines

DB-1 MongoDB

1 Document oriented

The document is the unit of storing data in a MongoDB database
Documents map nicely programming language data types:直接将PHP array, JAVA Bean, Python Dictionary, JavaScript Object数据导入Document中。
Embedded documents reduce the need for joins
Dynamic schema makes change easier
JSON-style, stored as BSON

RDBMS MongoDB
Table Collection
Column Key
Value Value
Records/Rows Document/Object

2 JSON

JSON-JavaScript Object Notation
a syntax for storing and exchanging data
an easier-to-use alternative to XML
more human friendly
less data size
easier to read

BSON
Binary JSON used in MongoDB
Compared to JSON, BSON is designed to be efficient both in storage space and scan-speed.

存储时转为BSON,读取数据转化为JSON

3 High Performance

Written in C++;
Use of memory mapped files;
Serialization in BSON for fast parsing;
Indexes can include keys from embedded documents

High Scalable

Vertical scaling v.s. Horizontal scaling
MongoDB supports horizontal scaling through sharding

Structures

A MongoDB instance may have zero or more databases
A database may have zero or more collections
can be though of as the relation(table) in SQL database
A collection may have zero or more Documents
Docs in the same collection don’t even need to have the same fields
Docs are the records in RDBMS
A document may have one or more fields
MongoDB indexes is much like their RDBMS counterparts

Pros

Simple queries;
Much faster than SQL database
Easier and faster integration of data

Cons

not well suited for heavy and complex transactions systems

PyMongo

Super easy to use;
Thread safe;
Built-in connection pool when using MongoClient;
No need to explicitly close to a connection

Steps
1 Create a client connecting to MongoDB instance on 27017

1
2
3
4
from pymongo import MongoClient

client = MongoClient('localhost:27017')
db = client.real_estate_smart_view

read_estate_smart_view: database name

2 Insert a new document

1
db.users.insert({"Last name": "Wang", "First name": "Wei"})

3 Query a document

1
db.users.find({"Last name":"Mao"})

4 Update an existing document

1
2
3
4
5
6
7
8
9
10
db.users.update(
{"Last name": "Mao",
{
"$set":{
"First name": "Qian"
}
}

}
)

$set is a update operator which is to set the field of using the provided value

Big data-Message Queue

Posted on 2018-05-27 | Edited on 2018-05-30

Message queue is a queue of messages sent between applications.

Message queues provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time.

What is Message Queues?

An application framework for sending and receiving messages;
A way to communicate between applications/systems;
A way to decouple components;
A way to offload work(handle to another worker)

Common use cases

1 Allow web servers to respond to request quickly instead of being forced to perform resource-heavy procedures;
2 Able to distribute a message to multiple recipients for consumption or for balancing loads between workers.

Message Queue in real life

Image resizing
Video processing
Sending out Emails (Ad Campaigns)
Log Analysis

Message Queue Protocol

AMQP: Advanced Message Queue Protocol (RabbitMQ)
STOMP: Streaming Text Oriented Messaging Protocol (ActiveMQ)
XMPP: Extensible Messaging and Presence Protocol

AMQP

Network wire-level Protocol
Defines how clients and brokers talk
Data serialization, heartbeat
AMQP Model
Defines routing and storing of messages
Defines rules how these are wired together
Exported API
Exchange

Queues

consumer gets a message from a queues;
Store message;(short time, has expire time)
FIFO;
Queues can be dynamic;
One message from a queue goes only to one consumer

Exchanges

Route messages
Publisher sends a message to exchange
Can switch between different routing algorithm

Exchange type

Direct(w/o binding)
point to point communication

Direct (w/binding)
binding provides a selector for routing messages from exchange to Queues
Each queue has a binding queues
Each messages has a binding key used fo routing

Fanout
broadcast messages
routing key is ignored
many queues / many consumers
Duplicate and send same message to all queues

Topic
routing key must be in from of xxx.xx.xxx
queues can use wild cast to match topics
exchange copies and sends messages to all matching queues

RabbitMQ

Implements AMQP
Easy to use
Runs on all major operating systems
Supports a huge number of developer platforms
Open source and commercially supported
Cloud-based solution: reduce dependency from localhost
Lower throughput than Apache Kafka

import pika in python

Big Data-Databases SQL

Posted on 2018-05-27 | Edited on 2018-05-31

Relational databases:

Aggregate Data Models: A collection of data that we interact with as a unit. It makes distribution of data easier, since the distribution mechanism has to move the aggregate and not have to worry about related data, as all the related data is contained in the aggregate.

MySQL, PostgreSQL, SQLite

List all users following bob:

1
2
3
4
5
6
7
8
SELECT
User.username
FROM
User
WHERE User.username IN
SELECT from_user FROM User LEFT JOIN Following
ON User.user_id = Following.to_user
WHERE User.username = ‘bob’

Google File System

Posted on 2018-05-27 | Edited on 2018-05-28

HDFS is similar to GFS
To solve how to save data

Metadata:
1 File Info: file name, file size, create time
2 Index: block 1,2,3…n

Index finds the position of data.

How to save big file?

save as a chunk instead block
1 Chunk = 64 MB; 1 Block = 1024 Bytes
Pros
less metadata;
less network traffic
Cons
waste space if file is not big enough

How to save huge file?

Master server + Many chunk servers
Master server: save metadata and Index
Chunk server: save metadata and chunks

Master server manages all chunk servers

How to detect data error?

Verify checksum when reading data

How to handle data error?

Replicas: 3

How to restore chunk?

ask master for help to restore

How to detect Chunk Server down?

Heartbeat

How to restore chunk after CS down?

if CS down, delete the index and add into repair procedure to restore data based on replicas.
Repair priority is based on the number of replicas.

If all replicas loss, the data loss forever.

How to avoid hot spot? 热点数据

Replicate a chunk into more replicas;
Fill the Chunk Server with more space and bandwidth.

How to read data?

read

How to write data?

write

123…7

Xinxin Tang

63 posts
44 tags
© 2018 Xinxin Tang
Powered by Hexo v3.7.1
|
Theme — NexT.Pisces v6.2.0