CS231N-Optimization

Score function -> Loss function -> Optimization.

Recall that SVM was formulated as:
Imgur
This formula archieves a very low loss L. Optimization is the process of finding the set of parameters W that minimize the loss function.

Visualizing the loss function

For high-dimensional spaces, linear classifier weight matrix is difficult to visualize. However, we can gain some intuitions about one by slicing through the high-dimensinal space along rays(1-dimension) or along planes(2-dimensions).
sisual
Left: one-dimensional loss by only varying a. Middle and Right: two-dimensional loss slice, Blue = low loss, Red = high loss. The loss is piecewise-linear structure.

For single example, we have:
Imgur

Imgur
We can visualize this as follows:
Visual
x-axis is a single weight and y-asix is the loss.

Th picture above has shown an example of a convex function.

Optimization

The goal of optimization is to find W that minimizes the loss function.

Strategy 1: Random search(very bad)

Try out many different random weights and keep track of what works best.

Strategy 2: Random local search

Start out with a random W, generate random perturbations &W to it and if the loss is lower.

This strategy is better, but still wasteful and computationally expensive.

Strategy 3: Following the Gradient

Find a direction in the weight-space that would improve our weight vector. I means we can compute teh best direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descend.

In one-dimensional functions, the mathematically expression is:
Imgur

Gradient check: unlike the numerical gradient it can be more error prone to implement, which is why in practice it is very common to compute the analytic gradient and compare it to the numerical gradient to check the correctness of your implementation.

SVM loss function for a single datapoint:
Imgur

We can differentiate the function with respect to the weights.
Imgur

Gradient descent

Mini-batch gradient descent: batch training
Stochastic gradient descent(or on-line gradient descent): mini-batch contains only a single example.

Reference

http://cs231n.github.io/optimization-1/