Definition
Layer Normalization standardize the inputs to a layer for each training example. It helps to stabilize the learning process and can significantly speed up the training of deep learning models. In essence, for a given neuron in a layer, Layer Normalization computes the mean and variance of all the inputs to that neuron for a single data sample in the batch. It then uses this mean and variance to normalize the input to that neuron.
In practise, Layer Normalization is applied at layers outputs level, so all the hidden units in a layer share the same normalization terms and , but different training cases have different normalization terms.` It also comes with two learnable parameters, and , which respectively rescaled and shift the standardized distribution of inputs to give the network more flexibility.
This is in contrast to Batch Normalization, which calculates the mean and variance across all the data samples in a batch for a given layer.
Formula
Let consider a layer outputs of shape (batch_size, input_sample_shape)
where input_sample_shape
can be an arbitrary dimension (a 3 dimensional image for example).
then the Layer Normalization of outputs of this layer is:
let’s break this formula step-by-step:
- is the mean of over the
input_sample_shape
- is the standard deviation of over the
input_sample_shape
- is a residual added to avoid division by zero
- and are respectively a learnable scale factor and a learnable bias of dimension
input_sample_shape
. Those learnable parameters are optional and their efficiency is still discussed.
Example
Let’s consider two connected hidden layer inside a neural network. The nature of the two layers can be diverse (Feed-Forward, Activation function, Convolutional, …). For a given sample, the first layer outputs a vector of 4 values: If we introduce a Layer Normalization between the two layers, here is how it would work:
1. Calculate the mean
2. Calculate the standard deviation
3. Standardize the outputs
Now, we normalize each value in the output vector using the formula:
Let’s take for the sake of this example, because we know that . Then,
4. Scale and shift the outputs (optional but common)
Let’s assume for this example that and . The final output of the Layer Normalization is calculated as:
Then,
The final output layer from the Layer Normalization is . This vector is then passed to each neuron of the next layer.
Usage
- When using both Layer Normalization and Dropout in a neural network, the generally recommended order is to apply Layer Normalization first, followed by Dropout.
- The standard architectural block you’ll see in many modern neural networks follows this order:
Linear Layer -> Layer Normalization -> Activation Layer
. The main goal of this order is to control the statistics of the signals passed between layers. By normalizing before the activation function, you ensure the inputs land in a “healthy” range for the activation (think about saturation intanh
orsigmoid
which led to vanishing gradients). - It’s worth noting that the original Transformer paper (“Attention Is All You Need”) used a “post-normalization” architecture, where Layer Normalization was applied after the residual connection, which includes the activation function:
LayerNormalization(x + Sublayer(x))
.