Extensions¶

First-order extensions¶

backpack.extensions.BatchGrad()¶

The individual gradients for each sample in a minibatch. Is only meaningful is the individual functions are independent (no batchnorm).

Stores the output in grad_batch as a [N x ...] tensor, where N is the size of the minibatch and ... is the shape of the gradient.

backpack.extensions.BatchL2Grad()¶

The L2 norm of individual gradients in the minibatch. Is only meaningful is the individual functions are independent (no batchnorm).

Stores the output in batch_l2 as a vector of the size as the minibatch.

backpack.extensions.SumGradSquared()¶

The sum of individual-gradients-squared, or second moment of the gradient. Is only meaningful is the individual functions are independent (no batchnorm).

Stores the output in sum_grad_squared, has the same dimension as the gradient.

backpack.extensions.Variance()¶

Estimates the variance of the gradient using the samples in the minibatch. Is only meaningful is the individual functions are independent (no batchnorm).

Stores the output in variance, has the same dimension as the gradient.

Second-order extensions¶

backpack.extensions.KFAC()¶

Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using Monte-Carlo sampling.

Stores the output in kfac as a list of Kronecker factors.

If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.

Note

The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in torch. Therefore, the order of factors differs from the presentation in the literature.

Implements the procedures described by

Optimizing Neural Networks with Kronecker-factored Approximate Curvature by James Martens and Roger Grosse, 2015.
A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.KFLR()¶

Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using the full Hessian of the loss function w.r.t. the model output.

Stores the output in kflr as a list of Kronecker factors.

If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.

Note

The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in torch. Therefore, the order of factors differs from the presentation in the literature.

Implements the procedures described by

Practical Gauss-Newton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.

Extended for convolutions following

A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.KFRA()¶

Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using the full Hessian of the loss function w.r.t. the model output and averaging after every backpropagation step.

Stores the output in kfra as a list of Kronecker factors.

If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.

Note

The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in torch. Therefore, the order of factors differs from the presentation in the literature.

Practical Gauss-Newton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.

Extended for convolutions following

A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.DiagGGNMC()¶

Diagonal of the Generalized Gauss-Newton/Fisher. Uses a Monte-Carlo approximation of the Hessian of the loss w.r.t. the model output.

Stores the output in diag_ggn_mc, has the same dimensions as the gradient.

For a more precise but slower alternative, see backpack.extensions.DiagGGNExact().

backpack.extensions.DiagGGNExact()¶

Diagonal of the Generalized Gauss-Newton/Fisher. Uses the exact Hessian of the loss w.r.t. the model output.

Stores the output in diag_ggn_exact, has the same dimensions as the gradient.

For a faster but less precise alternative, see backpack.extensions.DiagGGNMC().

backpack.extensions.DiagHessian()¶

Diagonal of the Hessian.

Stores the output in diag_h, has the same dimensions as the gradient.

Warning

Very expensive on networks with non-piecewise linear activations.