PPT 1
Big data: the amount of data just beyond technology’s capability to store, manage and process efficiently.
Types of Data:
structured:
usually has defined length and format;
Ex: numbers, dates, strings
unstructured:
doesn’t follow a specific format;
Ex: documents, images, video, audio, sets of data
semistructured:
falls between the other two;
may not conform to a fixed schema, but may be self-defining;
Ex: RDF, XML, Linked data
The situation where our most difficult problem is not how to store the data, but how to process it in meaningful ways.
Top 5 advantages of successfully managing Big Data:
- Improving overall agency efficiency
- Improving speed/accuracy of decision making
- Ability ot forecast events
- Ease of identifying opportunities for savings
- Greater understanding of citizens needs
PPT 2
Graph theory:
Maximum flow
PPT 3
Analytics:analytics are a suite of tools for processing data to achieve actionable intelligence to support decision-making processes.
Big Data issues affecting analytics:
–Volume
–Velocity
–Variety
–Veracity
–Value
PPT 4
Statistical Machine Learning
Unsupervised:
K-means Clustering
Principal Component Analysis (PCA): Dimension reducation
Normalization:reshape value range from 0 to 1
PPT 5
Classification:
K-Nearest Neighbor
Linear Regression Model
Logistic Regression Model
Generalized Linear Model:
3 essential parts:
–Error distribution
–Link function
–Variance function
Support Vector Machines (SVM)
PPT 6
SQL:Select, Insert, Update, Delete statements
Data Definition:
–Schema defined at the start
–Triggers to respond to Insert, Update and Delete
–Alter and Drop
–Security and Access Control
Transactions: Properties
–Atomic
–Consistent
–Isolated
–Durable
NoSQL:
CAP not ACID:
–Consistency:all nodes see the same data at the same time.
–Availability:a guarantee that every request receieves a response about whether it was successful or failed. All clients always read and write data
–Partitioning: system continues to operate despite arbitrary message loss or failure of part of the system.
–Multiple entry points
K/V stores
Column stores:
–Data is stored across rows rather than in columns as in an RDBMS
–Read faster than write
Ex: Cassandra, HBase(Based on Google’s Big Table)
Document stores:
–Key document stores: document can be seen as value part, so we consider this model as a super set of Key-Value
–Document can be represented in many ways, but XML, JSON and BSON are typical representations.
Ex:MongoDB, CouchDB
Graph DBs
Graph databases are built with nodes, relationships between nodes and the properties of nodes.
Case: Cassandra
-Both Key-Value and Column store;
-High availability;
-Eventual consistency;
-Incremental scalability
Architecture
-Implements a peer-to-peer distribution system across nodes
-Each node is independent but interconnected with all other nodes
-Each node can accept read and write requests, regradless of where the data is located in the cluster
-When a node goes down, read/write requests can be served from other nodes
Case: MongoDB
Goals:
-Scale horizontally over commodity hardware
-Resolve problems that don’t distribute well
-Use in-memory processing wherever possible
-Auto-sharding built-in
-Dynamically add/remove capacity with no downtime
-Query contents as well as structure
JSON style documents with dynamic schemas;
No predefined schema;
Stores documents in BSON(binary format)
All indexes in MongoDB are B-Tree indexes.
Queries:
CRUE Operations:
Create:1
2
3db.collection.insert(<document>)
db.collection.save(<document>)
db.collection.update(<document>)
Read:1
2db.collection.find(<query>, <projection>)
db.collection.findOne(<query>, <projection>)
Update:1
db.collection.update(<query>, <update>, <options>)
Delete:1
db.collection.remove(<query>, <justOne>)
Blog in Relational DB
Blog in Document DB
Query:
PPT 7
Hadoop
Written in Java
3 core components: HDFS; Yarn(Scheduler); MPP(Massive Parallel Processing)/MR(MapReduce)(Processing modules)
HDFS:
Communications:
HDFS Communication protocols are layered on the top of the TCP/IP protocol.
Pig
Hive
PPT 8
Spark
Computations in-memory
PPT 9
Graph Databases
Allow us to store entities and relationships between these entities.
It is a DBMS that supports CRUD operations:
-Normally optimized for OLTP