Optimization
Small data -> GD
P171
- Steps of GD
- Remark: the function should be convex
- “Standard” GD is greedy. Modifications on GD could make it non-greedy (e.g. adding a momentum term)
P173
A comparison between GD and SGD. slow but fast v.s. fast but slow
SGD:
- Choose an initial vector of parameters $w$ and learning rate $\eta$.
- Repeat until an approximate minimum is obtained:
- Randomly shuffle samples in the training set.
- For $i = 1,2,\ldots,n$, do:
- $w = w - \nabla f_i(w)$
Big data
P175
Distributed system: PRAM + Communication
Dependencies are represented by DAG: Don’t want the tree to be deep
!P176!
Depth: with respect to the last CPU to complete its job (depth of the tree)
Work: time to complete all tasks multiplied by the number of CPU (number of nodes in the tree)
$T_\infty$ does not goes to 0, but to the depth of the algorithm
P177
Brent’s Theorem: $T_1 / p \leq T_p \leq T_1 / p + T_\infty$
Embarrassingly parallel: “embarrassingly” is used here to refer to parallelization problems which are “embarrassingly easy to distribute”, depth = 0.
!P180! + h6
Pay attention to parallel algorithm writing and complexity analysis.
P181
- $m$ -> size of dataset, $n$ -> number of feature cols
The work for GD $mn$ is very big.
Though converges fast, during each iteration we have large of work to do (fast but slow)
- error rate and complexity tradeoff
P182
SGD tradeoff:
less number of points, less work
less number of points, the number of iterations will increase more
GD with all: slow per iteration (work very high), few iterations
SGD with one: fast per iteration, more iterations
Using a mini-batch is possible (a few data used points per iteration)
P185
Locks are not needed for parallelized SGD.
No need to worry collision as collision chance is low and the impact is slow.
P186
- Hogwild! requires the cost function to be sparse
- no locks needed -> speed nearly linearly
P187
Batch GD: Easy to parallelized, but slow
SGD: Hard to parallelized, but with lower depth
Hogwild! renders stochastic almost embarrassingly parallel
Applications
P190
High-level idea for a Spark implementation of batch GD
P191
Batch GD in Spark:
$w$ sent many times.
Not too much memory with respect to size of $w$
P192
SGD in Spark:
How to perform Hogwild! on Spark