Layer Normalization

Definition

Layer Normalization standardize the inputs to a layer for each training example. It helps to stabilize the learning process and can significantly speed up the training of deep learning models. In essence, for a given neuron in a layer, Layer Normalization computes the mean and variance of all the inputs to that neuron for a single data sample in the batch. It then uses this mean and variance to normalize the input to that neuron.

In practise, Layer Normalization is applied at layers outputs level, so all the hidden units in a layer share the same normalization terms $µ$ and $σ$ , but different training cases have different normalization terms.` It also comes with two learnable parameters, $γ$ and $β$ , which respectively rescaled and shift the standardized distribution of inputs to give the network more flexibility.

This is in contrast to Batch Normalization, which calculates the mean and variance across all the data samples in a batch for a given layer.

Formula

Let consider a layer outputs $X$ of shape (batch_size, input_sample_shape)where input_sample_shape can be an arbitrary dimension (a 3 dimensional image for example).

then the Layer Normalization of outputs of this layer is:

Y = \frac{X - µ ( X )}{σ ( X ) ^{2} + ε} * γ + β

let’s break this formula step-by-step:

$µ (X)$ is the mean of $X$ over the input_sample_shape
$σ (X)$ is the standard deviation of $X$ over the input_sample_shape
$ε$ is a residual added to avoid division by zero
$γ$ and $β$ are respectively a learnable scale factor and a learnable bias of dimension input_sample_shape. Those learnable parameters are optional and their efficiency is still discussed.

Example

Let’s consider two connected hidden layer inside a neural network. The nature of the two layers can be diverse (Feed-Forward, Activation function, Convolutional, …). For a given sample, the first layer outputs a vector of 4 values: $X = [2, 4, 6, 8]$ If we introduce a Layer Normalization between the two layers, here is how it would work:

1. Calculate the mean $µ$

µ (X) = \frac{2 + 4 + 6 + 8}{4} = 5

2. Calculate the standard deviation $σ$

σ (X) = \frac{( 2 - µ ) ^{2} + ( 4 - µ ) ^{2} + ( 6 - µ ) ^{2} + ( 8 - µ ) ^{2}}{4} = \frac{9 + 1 + 1 + 9}{4} = 5 \approx 2.236

3. Standardize the outputs

Now, we normalize each value in the output vector using the formula:

N or ma l i ze d (x_{i}) = \frac{x _{i} - µ}{σ ^{2} + ε}

Let’s take $ε = 0$ for the sake of this example, because we know that $σ \neq = 0$ . Then,

N or ma l i ze d (x_{1}) N or ma l i ze d (x_{2}) N or ma l i ze d (x_{3}) N or ma l i ze d (x_{4}) = \frac{2 - 5}{5} \approx - 1.341 = \frac{4 - 5}{5} \approx - 0.447 = \frac{6 - 5}{5} \approx 0.447 = \frac{8 - 5}{5} \approx 1.341

4. Scale and shift the outputs (optional but common)

Let’s assume for this example that $γ = 1.5$ and $β = 0.5$ . The final output of the Layer Normalization is calculated as:

y_{i} = γ * N or ma l i ze d (x_{i}) + β

Then,

y_{1} y_{2} y_{3} y_{4} = 1.5 * N or ma l i ze d (x_{1}) + 0.5 \approx - 1.5115 = 1.5 * N or ma l i ze d (x_{2}) + 0.5 \approx - 0.1705 = 1.5 * N or ma l i ze d (x_{3}) + 0.5 \approx 1.1705 = 1.5 * N or ma l i ze d (x_{4}) + 0.5 \approx 2.5115

The final output layer from the Layer Normalization is $Y = L a yer N or ma l i z a t i o n (X) = [- 1.5115, - 0.1705, 1.1705, 2.5115]$ . This vector is then passed to each neuron of the next layer.

Usage

When using both Layer Normalization and Dropout in a neural network, the generally recommended order is to apply Layer Normalization first, followed by Dropout.
The standard architectural block you’ll see in many modern neural networks follows this order: Linear Layer -> Layer Normalization -> Activation Layer. The main goal of this order is to control the statistics of the signals passed between layers. By normalizing before the activation function, you ensure the inputs land in a “healthy” range for the activation (think about saturation in tanh or sigmoid which led to vanishing gradients).
It’s worth noting that the original Transformer paper (“Attention Is All You Need”) used a “post-normalization” architecture, where Layer Normalization was applied after the residual connection, which includes the activation function: LayerNormalization(x + Sublayer(x)).

Antoine Déchappe

Latest research review