Terminology explaination

l1 distance is Manhattan Distance.
l2 distance is Euclidean Distance.

l1 norm(is also called Lasso Regression) is
l2 norm(is also called Ridge Regression) is

Activation function: squashes number to a range. Smooth to find best gradient direction.

Batch normalization: do it before activation function. zero-centered and range from [0.1] commonly.

to prevent gradient vanish;
to promote learning rate
to reduce dependency from initialization.

Regularization:

  1. Add term to loss
    L1 Regularization;
    L2 Regularization;
    Elastic net(L1+L2)

  2. Dropout