Big-data

Posted on 2018-05-24

负载均衡

负载均衡算法：

round-robin轮询法，所有server按顺序轮流分配
minimum connection：first to choose minimize connection, whcih has minimum pressure. If the event takes a long time, use this approach
hash

作用

1.解决单点故障：multi-server to keep healthy operation.
2.响应慢：增加负载均衡器和服务器，分摊访问压力，提高响应速度.
3.建立load balancer cluster to prevent load balancer fails.

References

负载均衡

面试

Posted on 2018-05-23 | Edited on 2018-05-24

Java初/高级面试回答要点

宗旨：每题答案都有得分点，答案涵盖即可。不需要太多细节。

架构方向

初级：

熟悉SSM架构：用一个业务说明Spring MVC如何发挥作用。
熟悉Spring MVC细节，比如@Autowired用法，如何将URL映射到Controller, ModelAndView对象返回到方式等
结合项目说如何使用AOP。

数据库

如何建索引，怎么用？比如建好索引后，where语句中写name like ‘123%’会不会用到索引，什么情况下不改建索引，哪些语句不用索引？
除了索引，还可以怎样优化SQL，比如分库分表，或通过执行计划查看SQL的优化点。结合项目讲。
会优化很有利于通过面试。（高级开发更要了解优化）

Java Core方面

主要有集合，多线程，异常处理，JVM虚拟机等。

集合

hashcode有没有重写过？什么场景下重写？结合hash table，实现hashmap。
高级开发最好能通过ConcurrentHashMap说明并发方面的底层代码实现。
ArrayList, LinkedList的差别？ArrayList如何扩容？
高级开发最好研究底层代码。
Set如何实现防冲的，比如TreeSet和HashSet？
Collection方法，比如方法比较，包装成线程安全的方法？
如何通过ArrayList实现队列或堆栈？

多线程，项目里用的少，但问题如下：

synchronized和可重入锁的区别，问一下信息量防止并发机制
线程如何返回值？其实就是callable， runnable区别
ThreadLocal, volatile关键字来说明线程的内存模型
线程池的使用与常用参数
线程问的多的说并发机制。

virtual machine

1.结构图，工作流程

针对队的垃圾回收机制，画个图，描述一下新生代，老生代等
回收机制流程，如何在代码中优化内存性能
如果出现OOM异常，如何排查，如何看Dump文件？
GC概念，强弱软引用， finalize方法等

算法，设计模式

正确回答问题

架构

证明自己能干活。同时结合底层代码说出Spring MVC, AOP,ICO架构流程。或者拦截器，controller等高级用法
证明自己有Spring Boot, Spring Cloud经验。说出Spring cloud组件用法
证明自己有分布式开发经验。说出分布式服务等运行机制，如何部署，如何通过nginx等做到负载均衡

数据库

如何SQL调优，比如索引，执行计划，或其他优化点。

java core

结合concurrentHashMap源码，说出final，volatile，transient用法，以及如何用lock防止写并发。

结合项目说设计模式
多线程，说出lock或volatile等高级知识点
GC流程，如何通过日志和dump文件排查OOM异常；在高级一些，如何在代码中优化内存代码

诀窍：结合源代码，实际项目说出资深问题

Redis面试

Advantages of Redis
(1) blazing fast. Because data stores in memory.
(2) rich data types, like String, List, Set, Sorted set, Hash
(3) support transactions, atomic operation.
(4) it is used to Cache, Message transportation, and set expired time
Advantages of Redis compared to Memcached
(1) MemCached: only support simple string
(2) Redis has faster speed than MemCached
(3) Redis can persist data
Common performance problems and solutions
(1) It is better not to persistency for Master node, like RDB snapshot, AOF log file
(2) Turn on AOF replication in a Slave node if data is important; Synchronize once per second.
(3) For stability of speed and connectivity, master-slave node in one local area net.
(4) Singly linkedlist data structure is better than graph structure in Master-Slave replication, like Master <- Slave1 <- Slave2 <- Slave3…

reference

Java初高级面试
 Redis面试

Resources Library for Programmers

Posted on 2018-05-23 | Edited on 2018-06-01

Data Structures & Algorithms Design

Data structures in Python
Design Pattern
Microsoft Onsite Summary
Google Onsite Summary
Facebook interview coding

Languages

Java summary
超全面Java面试
 java面试通关要点
 java工程师成神之路
 Java Spring框架组件

Python summary

C++ summary

SQL summary

Big Data

大数据技术

多线程详解
 32大数据核心算法

Message Queue

Kafka summary
Kafka面试

RabbitMQ summary
RabbitMQ practice

Computing framework

Spark summary

Storage framework

NoSQL overview
NoSQL概述

HDFS summary
Google File System
Google File System-R

MongoDB summary
MongoDB commands
MongoDB overview

Cassandra summary
Dynamo

Redis summary
Redis面试
 Redis Official PPT

MySQL summary
MySQL开发问题及优化
 MySQL锁，事务及并发

Search Engine

ELK summary

Cloud Computing

AWS EC2 summary

AWS S3 summary

Docker summary
Docker Started
Docker vs. VM

Optimization

负载均衡
 Nginx

Tech Stack

Posted on 2018-05-23 | Edited on 2018-05-30

Coding practices

300+ & 3+

GeeksforGeeks

search solutions

Data structures & Algorithms

Book:
“Data Structures and Algorithms in JAVA”(Entry)
“Intro to Algorithms”(Promition)

Java Conceptions

Books:
Thinking in Java;
Effective Java;
Programmerinterview

Big data

Bid data interview blog

Threads & Locks

Difference between Threads & Process
Multithreads, lock, semaphore
Resource management
Deadlock and how to prevent
Blocking Queue implement
Producer-Consumer implement

Careerup interview question 1
Careerup interview question 2

OOD

implement Singleton, Factory and MVC pattern. Design a class: LRU, Trie, Iterator, BST, Blocking Queue.

System design

View engineering blogs as many as possible

Resume

Extremely familiar own projects.

Soft skills

Active to learn;
Quick to learn;
Excellent communication ability.

HDFS

Posted on 2018-05-23 | Edited on 2018-06-04

HDFS is short of Hadoop Distributed File System, which is a distributed file system to run on commodity hardware. HDFS is highly fault-tolerance and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS is part of Apache Hadoop Core project.

Assumptions and goals

Hardware failure
Hardware failure is the norm rather than the exception. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
Streaming data access
Applications that run on HDFS need streaming access to their datasets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
Large data sets
Simple coherency model
HDFS applications need a write-once-write-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of files is supported but cannot be updated at arbrirary point.
Moving computation is cheaper than moving data
Portability across heterogeneous hardware and software platforms

NameNodes and DataNodes

The File System Namespace

Data Replication

The persistence of File System Metadata

The Communication Protocols

Robustness

Data Organization

Accessibility

Space Reclamation

References

HDFS

Cassandra

Posted on 2018-05-22 | Edited on 2018-05-23

Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mision-critical data.

Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Main features

Proven
Cassandra is in use at Constant Contact, CERN, Comcast, eBay and over 1500 more companies
Fault tolerance
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
Performance
Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.
Decentralized
There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.
Scalable
Easy to scale out
Durable
Cassandra is suitable for applications that can’t afford to lose data, even when an entire data center goes down.
You’re in control
Choose between synchronous or asynchronous replication for each updata. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.
Elastic
Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications
Professionally supported

References

Cassandra

Redis

Posted on 2018-05-22

Redis is an open-source, in-memory data structure store, used as a database, cache and message broker.

It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries.

Redis has built-in replication, Lua Scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Resia Sentinel and automatic partitioning with Redis Cluster.

We can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing an element to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set.

In order to achieve its outstanding performance, Redis works with an in-memory dataset. Depending on your use case, you can persist it either by dumping the dataset to disk every once in a while, or by appending each command to a log.

Redis also supports trivial-to-setup master-slave asychronous replication, with very fast non-blocking first synchronization, auto-reconnection with partial resynchronization on net split.

Other features

Transactions
Pub/Sub
Lua Scripting
Keys with a limited time-to-live
LRU eviction of keys
Automatic failover

References

Redis

MongoDB

Posted on 2018-05-22 | Edited on 2018-05-23

A open-source distributed document database that provides high performance, high availability, and automatic scaling.

Document Database

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

Advantages of using docuemnts are:

Documents correspond to native data types in many programming languages.
Embedded documents and arrays reduce need for expensive joins.
Dynamic schema supports fluent polymorphism.

Key Festures of MongoDB

High Performance

provides high performance data persistence. In particular,
Support for embedded data models reduces I/O activity on database system.
Indexes support faster queries and can include keys from embedded documents and arrays.
Rich Query Language
Rich query language to support CRUD as well as: Data Aggregation; Text Search and Geospatial Queries.
High Availability
It provides:
automatic failover
data redundancy

Horizontal Scalability

sharding distributes data across a cluster of machines
Starting in 3.4, MongoDB supports creating zones of data based on the shard key. In a balanced cluster, MongoDB directs reads and writes covered by a zone only to those shards inside the zone. See the Zones manual page for more information.

Support for multiple storage engines

WiredTiger Storage Engine
In-Momory Storage Engine
MMAPv1 Storage Engine

Expressive query language & Secondary indexes

References

MongoDB

AWS-EC2

Posted on 2018-05-22

Amazon EC2, is short of Amazon Elastic Compute Cloud, is a web service that provides resizeable computing capacity that we use to build and host software systems. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch as mnay or as fewer virtual servers as you need to builder a cluter.

Features of EC2

Virtual computing environments, known as instances
Preconfigured templates for your instances, known as Amazon Machine Images
Various congigurations of CPU, memory, storage, and networking capacity of your instances, known as instance types
Secure login information for your instances using key pairs
Storage volumes for temporary data, known as store volumes
Persistent storage volumes for your data using Amazon Elastic Block Store(Amazon EBS), known as Amazon EBS volumes
Virtual networks you can create that are logically isolated from the rest of the AWS cloud, and that you can optionally connect to your own network, known as Virtual Provide Clouds(VPCs)

References

AWS EC2

AWS-S3

Posted on 2018-05-22

Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web.

We can send requests to create buckets, store and retrieve objects, and manage permissions on our recources. The guide also describe access control and the authentication process. Access control defines who can access objects and buckets within S3, and the type of access.

Advantages to Amazon S3

Create Buckets: create and name a bucket that stores data. Buckets are the fundational container in S3 for data storage.
Store data in Buckets: store an infinite amount of data in a bucket. Upload as many objects as you like into an S3 bucket. Each object can contain up to 5 TB of data. Each object is stored and retrieved using a unique developer-assigned key.
Download data
Permissions
Standard interfaces: Use standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit.

Concepts of S3

1. Buckets

A bucket is a container for objects stored in S3. Every objects is contained in a bucket. For example, if a object named photes/puppy.jpg is stored in the johnsmith bucket, then it is addressable using the URL: http://johnsmith.s3.amazonaws.com/photos/puppy.jpg

Buckets serve several purposes:

They organize the S3 namespace at highest level
They identify the account reponsible for storage and data transfer charges
They play a role in access control
They serve as the unit aggregation for usage reporting

Objects

Objects are the fundamental entities stored in S3. Objects consist of object data and metadata. Metadata is a set of name-value pairs that describe the object. These include some default metadata, such as data last modified, and standard HTTP metadata, such as Content-Type.

Keys

A Key is the unique identifier for an object within a bucket. Each object in a bucket has exactly one key. Because the combination of a bucket, key, and version ID uniquely identify each object. S3 can be thought of as a basic data map between “bucket + key + version” and the object itself. Every object in Amazon S3 can be uniquely addressed through the combination of the web service endpoint, bucket name, key, and optionally, a version. For example, in the URL http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, “doc” is the name of the bucket and “2006-03-01/AmazonS3.wsdl” is the key.

Regions

We can choose the geographical region where S3 will store the buckets we create.

Data consistency model of S3

S3 provides eventual consistency for read-after-write.

References

AWS S3

Xinxin Tang

63 posts

44 tags

负载均衡

负载均衡算法：

作用

References

Java初/高级面试回答要点

架构方向

数据库

Java Core方面

集合

多线程，项目里用的少，但问题如下：

virtual machine

算法，设计模式

正确回答问题

架构

数据库

java core

Redis面试

reference

Data Structures & Algorithms Design

Languages

Big Data

大数据技术

Message Queue

Computing framework

Storage framework

Search Engine

Cloud Computing

Optimization

Coding practices

GeeksforGeeks

Data structures & Algorithms

Java Conceptions

Big data

Threads & Locks

OOD

System design

Resume

Soft skills

Assumptions and goals

NameNodes and DataNodes

The File System Namespace

Data Replication

The persistence of File System Metadata

The Communication Protocols

Robustness

Data Organization

Accessibility

Space Reclamation

References

Main features

References

Other features

References

Document Database

Advantages of using docuemnts are:

Key Festures of MongoDB

High Performance

Rich Query Language

High Availability

Horizontal Scalability

Support for multiple storage engines

Expressive query language & Secondary indexes

References

Features of EC2

References

Advantages to Amazon S3

Concepts of S3

1. Buckets

Buckets serve several purposes:

Objects

Keys

Regions

Data consistency model of S3

References