Big Data Course Note

PPT 1

Big data: the amount of data just beyond technology’s capability to store, manage and process efficiently.

Types of Data:
structured:
usually has defined length and format;
Ex: numbers, dates, strings

unstructured:
doesn’t follow a specific format;
Ex: documents, images, video, audio, sets of data

semistructured:
falls between the other two;
may not conform to a fixed schema, but may be self-defining;
Ex: RDF, XML, Linked data

The situation where our most difficult problem is not how to store the data, but how to process it in meaningful ways.

Top 5 advantages of successfully managing Big Data:

Improving overall agency efficiency
Improving speed/accuracy of decision making
Ability ot forecast events
Ease of identifying opportunities for savings
Greater understanding of citizens needs

PPT 2

Graph theory:

Maximum flow

PPT 3

Analytics:analytics are a suite of tools for processing data to achieve actionable intelligence to support decision-making processes.

Big Data issues affecting analytics:
–Volume
–Velocity
–Variety
–Veracity
–Value

PPT 4

Statistical Machine Learning

Unsupervised:
K-means Clustering
Principal Component Analysis (PCA): Dimension reducation

Normalization:reshape value range from 0 to 1

PPT 5

Classification:
K-Nearest Neighbor
Linear Regression Model
Logistic Regression Model
Generalized Linear Model:
3 essential parts:
–Error distribution
–Link function
–Variance function

Support Vector Machines (SVM)

PPT 6

SQL:Select, Insert, Update, Delete statements
Data Definition:
–Schema defined at the start
–Triggers to respond to Insert, Update and Delete
–Alter and Drop
–Security and Access Control

Transactions: Properties
–Atomic
–Consistent
–Isolated
–Durable

NoSQL:
CAP not ACID:
–Consistency:all nodes see the same data at the same time.
–Availability:a guarantee that every request receieves a response about whether it was successful or failed. All clients always read and write data
–Partitioning: system continues to operate despite arbitrary message loss or failure of part of the system.
–Multiple entry points
Imgur
Imgur

K/V stores

Column stores:
–Data is stored across rows rather than in columns as in an RDBMS
–Read faster than write

Ex: Cassandra, HBase(Based on Google’s Big Table)

Document stores:
–Key document stores: document can be seen as value part, so we consider this model as a super set of Key-Value
–Document can be represented in many ways, but XML, JSON and BSON are typical representations.

Ex:MongoDB, CouchDB

Graph DBs
Graph databases are built with nodes, relationships between nodes and the properties of nodes.

Case: Cassandra
-Both Key-Value and Column store;
-High availability;
-Eventual consistency;
-Incremental scalability

Architecture
-Implements a peer-to-peer distribution system across nodes
-Each node is independent but interconnected with all other nodes
-Each node can accept read and write requests, regradless of where the data is located in the cluster
-When a node goes down, read/write requests can be served from other nodes

Case: MongoDB
Goals:
-Scale horizontally over commodity hardware
-Resolve problems that don’t distribute well
-Use in-memory processing wherever possible
-Auto-sharding built-in
-Dynamically add/remove capacity with no downtime
-Query contents as well as structure

JSON style documents with dynamic schemas;
No predefined schema;
Stores documents in BSON(binary format)
Imgur

Imgur

All indexes in MongoDB are B-Tree indexes.

Queries:
Imgur

CRUE Operations:
Create:

1
2
3

db.collection.insert(<document>)
db.collection.save(<document>)
db.collection.update(<document>)

Read:

1 2	db.collection.find(<query>, <projection>) db.collection.findOne(<query>, <projection>)

Update:

1	db.collection.update(<query>, <update>, <options>)

Delete:

1	db.collection.remove(<query>, <justOne>)

Imgur

Blog in Relational DB
Imgur

Blog in Document DB
Imgur

Query:
Imgur

PPT 7

Hadoop
Written in Java
3 core components: HDFS; Yarn(Scheduler); MPP(Massive Parallel Processing)/MR(MapReduce)(Processing modules)

HDFS:
Imgur

Communications:
HDFS Communication protocols are layered on the top of the TCP/IP protocol.

Pig

Hive

PPT 8

Spark
Computations in-memory

PPT 9

Graph Databases
Allow us to store entities and relationships between these entities.

It is a DBMS that supports CRUD operations:
-Normally optimized for OLTP