# Difference between revisions of "Functional regularisation for continual learning with gaussian processes"

(→Methods) |
(→Methods) |
||

Line 54: | Line 54: | ||

== Methods == | == Methods == | ||

− | Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundary is known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. | + | Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundary is known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary. |

For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions: | For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions: |

## Revision as of 19:17, 22 November 2020

## Contents

## Presented by

Meixi Chen

## Introduction

Continual Learning (CL) refers to the problem where different tasks are fed to model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is model forgets how to solve earlier tasks. This paper proposed a new framework to regularize Continual Learning (CL) so that it doesn't forget previously learned tasks. This method, referred to as functional regularization for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. Then the posterior belief is utilized in optimization as a regularizer to prevent the model from completely deviating from the earlier tasks. The estimation of posterior functions is carried out under the framework of approximate Bayesian inference.

## Previous Work

There are two types of methods that have been widely used in Continual Learning.

### Replay/Rehearsal Methods

This type of methods stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimization of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1. Deciding which data to store often remains heuristic; 2. Requires a large quantity of stored data to achieve good performance.

### Regularization-based Methods

These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in a hope to regularize the learning of new tasks. Two important methods are Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL), both of which make model parameters adaptive to new tasks while regularizing weights by prior knowledge from the earlier tasks. Nonetheless, with long sequences of tasks this might still result in an increased forgetting of earlier tasks.

## Comparison between the Proposed Method and Previous Methods

### Comparison to regularization-based methods

**Similarity**: It is also based on approximate Bayesian inference by using a prior distribution that regularizes the model updates.

**Difference**: It constrains the neural network on the space of functions rather than weights by making use of *Gaussian processes* (GP).

### Comparison to replay/rehearsal methods

**Similarity**: It also stores data from earlier tasks.

**Difference**: Instead of storing a subset of data, it stores a set of *inducing points*, which can be optimized using criteria from GP literature [2] [3] [4].

## Recap of the Gaussian Process

**Definition**: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].

The Gaussian process is a non-parametric approach as it can be viewed as an infinte-dimensional generalization of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimization in the function space. A Gaussian process over a prediction function [math]f(\boldsymbol{x})[/math] can be completely specified by its mean function and covariance function (or kernel function), \[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\] Note that in practice the mean function is typically taken to be 0 because we can always write [math]f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})[/math] where [math]g(\boldsymbol{x})[/math] follows a GP with 0 mean. Hence, the GP is characterized by its kernel function.

In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by \[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\] When we only observe finitely many [math]\boldsymbol{x}[/math], the function's value at these input points is a multivariate normal distribution with covariance matrix parametrized by the kernel function.

Note: Throughout this summary, [math]\mathcal{GP}[/math] refers the the distribution of functions, and [math]\mathcal{N}[/math] refers to the distribution of finite random variables.

** A One-dimensional Example of the Gaussian Process **

In the figure below, the red dashed line represents the underlying true function [math]f(x)[/math] and the red dots are the observation taken from this function. The blue solid line indicates the predicted function [math]\hat{f}(x)[/math] given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.

## Methods

Consider a deep neural network in which the final hidden layer provides the feature vector [math]\phi(x;\theta)\in \mathbb{R}^K[/math], where [math]x[/math] is the input data and [math]\theta[/math] are the task-shared model parameters. Importantly, let's assume the task boundary is known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter [math]\theta[/math]. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.

For each specific task [math]i[/math], an output layer is constructed as [math]f_i(x;w_i) = w_i^T\phi(x;\theta)[/math], where [math]w_i[/math] is the task-specific weight. By assuming that the weight [math]w_i[/math] follows a normal distribution [math]w_i\sim \mathcal{N}(0, \sigma_w^2I)[/math], we obtain a distribution over functions: \[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \] where [math]k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)[/math]. We can express our posterior belief over [math]f_i(x)[/math] instead of [math]w_i[/math]. Namely, we are interested in estimating \[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\] which can be approximated by \[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\] Instead of having fixed model weights [math]w_i[/math], we now have a distribution for it, which depends on the input data. Then we can summarize information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output function that depends on the weights. However, we are facing the computational challenge of storing [math]\mathcal{O}(N_i^2)[/math] parameters and keeping in memory the full set of input vector [math]X_i[/math]. To see this, note that the [math]\Sigma_i[/math] is a [math]N_i \times N_i[/math] matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.

**Inducing Points**: [math]Z_i = \{z_{i,j}\}_{j=1}^{M_i}[/math], which can be a subset of [math]X_i[/math] (the [math]i[/math]-th training inputs) or points lying between the training inputs.

**Auxiliary function**: [math]\boldsymbol{u}_i[/math], where [math]u_{i,j} = f(z_{i,j})[/math].

The idea behind the inducing point method is to replace [math]\boldsymbol{f}_i[/math] by the auxiliary function [math]\boldsymbol{u}_i[/math] evaluated at the inducing inputs [math]Z_i[/math]. Intuitively, we are also assuming the inducing inputs [math]Z_i[/math] contain enough information to make inference about the "true" [math]\boldsymbol{f}_i[/math], so we can replace [math]X_i[/math] by [math]Z_i[/math].

Now we can introduce how to learn the first task when no previous knowledge has been acquired.

### Learning the First Task

In learning the first task, the goal is to generate the first posterior belief given this task: [math]p(\boldsymbol{u}_1|\text{Data})[/math]. We learn this distribution by approximating it by a variational distribution: [math]q(\boldsymbol{u}_1)[/math]. That is, [math]p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)[/math]. We can parametrize [math]q(\boldsymbol{u}_1)[/math] as [math]\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)[/math], where [math]L_{u_1}[/math] is the lower triangular Cholesky factor. I.e., [math]\Sigma_{u_1}=L_{u_1}L_{u_1}^T[/math]. Next, we introduce how to estimate [math]q(\boldsymbol{u}_1)[/math], or equivalently, [math]\mu_{u_1}[/math] and [math]L_{u_1}[/math] using variational inference.

Given the first task with data [math](X_1, \boldsymbol{y}_1)[/math], we can use a variational distribution [math]q(\boldsymbol{f}_1, \boldsymbol{u}_1)[/math] that approximates the exact posterior distribution [math]p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)[/math], where \[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\] \[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\] Note that we use [math]p_\theta(\cdot)[/math] to denote the Gaussian distribution form with kernel parametrized by a common [math]\theta[/math].

Hence, we can jointly learn [math]q(\boldsymbol{u}_1)[/math] and [math]\theta[/math] by minimizing the KL divergence \[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\] which is equivalent to maximizing the Evidence Lower Bound (ELBO) \[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]

### Learning the Subsequent Tasks

After learning the first task, we only keep the inducing points [math]Z_1[/math] and the parameters of [math]q(\boldsymbol{u}_1)[/math], both of which act as a task summary of the first task. Note that [math]\theta[/math] also has been updated based on the first task. In learning the [math]k[/math]-th task, we can use the posterior belief [math]q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})[/math] obtained from the preceding tasks to regularize the learning, and produce a new task summary [math](Z_k, q(\boldsymbol{u}_k))[/math]. Similar to the first task, now the objective function to be maximized is \[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]- \text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]

As a result, we regularize the learning of a new task by the sum of KL divergence terms [math]\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))[/math], where each [math]q(\boldsymbol{u}_i)[/math] encodes the knowledge about an earlier task [math]i \lt k[/math]. To make the optimization computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularization term.

### Alternative Inference for the Current Task

Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief [math]q(\boldsymbol{u}_k)[/math]. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation \[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\] by letting [math]f_k(x; w_k) = w_k^T \phi(x; \theta)[/math], [math]w_k \sim \mathcal{N}(0, \sigma_w^2 I)[/math]. The objective becomes \[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]- \text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]

The figure below is a depiction of the proposed method.

### Selection of the Inducing Points

In practice, a simple but effective way to select inducing points is to select a random set [math]Z_k[/math] of the training inputs [math]X_k[/math]. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix [math]K_{X_k}[/math] can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional [math]p(\boldsymbol{f}_k|\boldsymbol{u}_k)[/math]: \[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\] where [math]K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)[/math], and [math]K(\cdot, \cdot)[/math] is the kernel function parametrized by [math]\theta[/math]. This method promotes finding inducing points [math]Z_k[/math] that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. The round dots represents the data points and each color corresponds to a different task.

### Prediction

Given a test data point [math]x_{i,*}[/math], we can obtain the predictive density function of its output [math]y_{i,*}[/math] given by \begin{align*} p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\ &= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*}, \end{align*} where [math]q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)[/math] with known mean and variance \begin{align*} \mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\ \sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*} \end{align*} Note that all the terms in [math]\mu_{i,*}[/math] and [math]\sigma_{i,*}^2[/math] are either already estimated or depend on some estimated parameters.

## Detecting Task Boundaries

## Experiments

## Conclusions

## References

[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.

[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.

[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.

[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574.

[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.