Extensions¶
First-order extensions¶
-
backpack.extensions.
BatchGrad
()¶ The individual gradients for each sample in a minibatch. Is only meaningful is the individual functions are independent (no batchnorm).
Stores the output in
grad_batch
as a[N x ...]
tensor, whereN
is the size of the minibatch and...
is the shape of the gradient.
-
backpack.extensions.
BatchL2Grad
()¶ The L2 norm of individual gradients in the minibatch. Is only meaningful is the individual functions are independent (no batchnorm).
Stores the output in
batch_l2
as a vector of the size as the minibatch.
-
backpack.extensions.
SumGradSquared
()¶ The sum of individual-gradients-squared, or second moment of the gradient. Is only meaningful is the individual functions are independent (no batchnorm).
Stores the output in
sum_grad_squared
, has the same dimension as the gradient.
-
backpack.extensions.
Variance
()¶ Estimates the variance of the gradient using the samples in the minibatch. Is only meaningful is the individual functions are independent (no batchnorm).
Stores the output in
variance
, has the same dimension as the gradient.
Second-order extensions¶
-
backpack.extensions.
KFAC
()¶ Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using Monte-Carlo sampling.
Stores the output in
kfac
as a list of Kronecker factors.If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.
Note
The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in
torch
. Therefore, the order of factors differs from the presentation in the literature.Implements the procedures described by
Optimizing Neural Networks with Kronecker-factored Approximate Curvature by James Martens and Roger Grosse, 2015.
A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016
-
backpack.extensions.
KFLR
()¶ Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using the full Hessian of the loss function w.r.t. the model output.
Stores the output in
kflr
as a list of Kronecker factors.If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.
Note
The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in
torch
. Therefore, the order of factors differs from the presentation in the literature.Implements the procedures described by
Practical Gauss-Newton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.
Extended for convolutions following
A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016
-
backpack.extensions.
KFRA
()¶ Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using the full Hessian of the loss function w.r.t. the model output and averaging after every backpropagation step.
Stores the output in
kfra
as a list of Kronecker factors.If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.
Note
The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in
torch
. Therefore, the order of factors differs from the presentation in the literature.Practical Gauss-Newton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.
Extended for convolutions following
A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016
-
backpack.extensions.
DiagGGNMC
()¶ Diagonal of the Generalized Gauss-Newton/Fisher. Uses a Monte-Carlo approximation of the Hessian of the loss w.r.t. the model output.
Stores the output in
diag_ggn_mc
, has the same dimensions as the gradient.For a more precise but slower alternative, see
backpack.extensions.DiagGGNExact()
.
-
backpack.extensions.
DiagGGNExact
()¶ Diagonal of the Generalized Gauss-Newton/Fisher. Uses the exact Hessian of the loss w.r.t. the model output.
Stores the output in
diag_ggn_exact
, has the same dimensions as the gradient.For a faster but less precise alternative, see
backpack.extensions.DiagGGNMC()
.
-
backpack.extensions.
DiagHessian
()¶ Diagonal of the Hessian.
Stores the output in
diag_h
, has the same dimensions as the gradient.Warning
Very expensive on networks with non-piecewise linear activations.