**Introduction**

*Restricted Boltzmann Machines* (RBMs) is an unsupervised machine learning method that significantly contributes the early rise of deep learning in 2006: A Fast Learning Algorithm for Deep Belief Networks. Since then it becomes an important element in the Deep Learning family. Interestingly, RBMs manifest the idea of both probabilistic graphical models and neural networks. This post aims to provide an introduction to RBMs. Most of the formulas here are from **[1]**.

**Roadmap
**

*Boltzmann Machines*are energy-based models

**[2]**where the joint probability distribution is characterized by a scalar energy to each configuration of the variables. In energy-based models,

*inference*consists of clamping the values of a set of variables, and finding configurations of the remaining variables that minimize the energy function;

*learning*consists in finding an energy function that minimizes the energy of observed variables. Boltzmann machines are also probabilistic graphical models using graph-based representation to factorize the probability distribution.

*Restricted Boltzmann Machines*is a type of Boltzmann machine with special constraints – only a certain type of connection is allowed.

This post starts by introducing energy-based models, including the graphical representation of the model and its learning with gradient descent of log-likelihood. This post then discusses Boltzmann machines by placing a specific energy function in energy-based models. Restricted Boltzmann Machines are further discussed with the introduction of restrictions in Boltzmann Machines connections.

**Notations**

There are three basic notations in this post: \(x\), \( h\) and \( y\). \(x\) represents an input variable that takes the form of a vector \( x=\left \{ x_{1}, x_{2},\dots ,x_{N} \right \}\), where \( x_{i}\) denotes the i-th feature of \(x\). \( h\) represents a hidden variables that also takes the form of a vector \( h=\left \{ h_{1}, h_{2},\dots ,h_{N} \right \}\), where \( h_{i}\) denotes the i-th hidden variable. The hidden variable is sometimes called latent variable in a more statistical language. \( y\) represents the label of a given input to be predicted.

As an example, for the problem of image recognition, \( x\) are the images of interest where \( x_{i}\) are the individual pixels from an image \( x\). \( h\) are the hidden features/descriptors that serve as a high-level representations of image. Finally, \( y\) are the labels of the images.

**Energy-Based Models**

Different from predictive/supervised models, such as multiplayer perception (feedforward neural networks), *energy-based models* capture the joint probability distribution \( P(\mathbf{x}) \) of the configuration of input variables \( \left \{ x_{1}, x_{2},\dots ,x_{N} \right \}\), rather than conditional probability of \( P(y|\mathbf{x}) \) for supervised learning.

Energy-based models associate a scalar energy (compatibility value) to each configuration of the random variables. For each configuration, there is a corresponding energy associated, which is proportional to the probability for the model to be in that particular configuration. Owing to energies are uncalibrated (unnormalized and measured in arbitrary units), only relative values of energies can provide meaningful information about the probability distribution. Thus, the probability distribution of energy-based models need to be normalized using the Gibbs measure **[3]** that takes the form

\( P(x)=\frac{e^{-Energy(x)}}{Z}\)

The normalizing factor \( Z\) is called the *partition function* or *normalization function* defined by

\( Z=\sum_{x}^{ }e^{-Energy(x)}\)

Because the energy is negated, this probability distribution favors “negative energy”, meaning configurations with low energy will have a high probability. *Learning* in energy-based models corresponds to modifying the energy function to reshape the energy space, so that the desired patterns in the data always lie in the lowest spots in the energy surface, so that it will have higher probability.

**An Simple Example of Energy-Based Models**

Suppose we have a simple energy-based model with only two variables \( a \) and \( b \) (where \( x=\left \{ a, b \right \}\)), as well as a connection weight \( w\). Here we define energy function \( Energy(a,b)=-wab\). If \( a \) and \( b \) can only take binary values, and \( w\) is 1, the resulting energy function takes the form \( Energy(a,b)=-ab\). The table below shows all possible configurations for \( a \), \( b \), \( Energy(a,b)\) and the prabability for each configuration after normalization. In other words, this energy-based model factorize the joint probability distribution shown below.

We can visualize the probability distribution using the graph below where the size of each circle represents the probability of each configuration.

**Introducing Hidden Variables
**When modelling a set of input features, we can use the input attributes as variables to learn the relationship between each other. However, directly modeling such variables seems to have less expressive power, hence, we introduce the concept of

*hidden variables*(nonobserved variables), to better model the joint probability distribution. The hidden variables can be used to capture high level feature representations of the observed variables. The probability distribution and partition function are changed accordingly:

Since we are only interested in the distribution of visible variables, we want to model the *marginal distribution* over hidden variables that takes the form

\( P(x)=\sum_{h}^{ }\frac{e^{-Energy(x,h)}}{Z}\)

To simplify this formula, we defined *free energy *being the marginal energy of \( x\) summing over \( h\):

Therefore we can rewrite the marginal probability distribution as:

**Gradient Learning of Energy-based Models**

We use log-likelihood gradient to train the model, and we use \( \theta\) to represent parameters of the model.

Hence, the average log-likelihood gradient for a training set is:

where \( \hat{P}\) is the distribution over the training set, and \( P\) is the distribution over the model’s current parameters. The resulting log-likelihood gradient tells us, as long as we are able to compute the derivative of free energy for the training examples, and the derivative of free energy for the model’s own distribution, we will be able to train the model tractably.

**Boltzmann Machines**

Boltzmann machine is a type of energy-based model, we get a Boltzmann machine when applying the energy function below in the previous section

where \( b\) is the bias for visible variables, \( c\) is the bias for hidden variables, \( W\) is the connection weights between hidden variables and visible variables, \( U\) is the connection weights within visible variables and \( V\) is the connection weights within hidden variables. This energy function is simply a fully connected graph with weights on each connection.

**Gradient Learning of Boltzmann Machines**

The gradient of log-likelihood can be written as:

Since the \( \frac{\partial Energy(x,h)}{\partial \theta}\) is easy to compute (which is simply the connection weights), the result of this gradient tells us, as long as we are able to compute the probability of hidden variables given visible variables, and the joint probability of hidden and visible variables (for the current constraints in the model), we will be able to learn the model tractably. Now the question comes to, how to compute the conditional and joint probability. That is, how to sample from \( P(x\vert h)\) and \( P(x,h)\).

**Gibbs Sampling for Conditional Probability**

Gibbs sampling of the joint distribution of N variables \( X_{1}…X_{N}\) is done through a sequence of \( N\) sampling sub-steps of the form:

which means, the sample of one variable comes from the conditional probability given all other variables. We can compute the conditional probability of one node given all the other nodes easily:

This conditional probability is the same form as the activation function of neural networks. In Gibbs sampling, one sample step is to sample all the \( N\) variables once. After the number of sample steps goes to \( \infty\), the sample distribution will converge to \( P(X)\).

**Gibbs Sampling for Boltzmann Machines**

To sample \( P(x\vert h)\) and \( P(x,h)\) in Boltzmann machine, we need to compute two Gibbs chins. For \( P(x\vert h)\), we *clamp* the visible nodes (x) to sample h, which is the conditional probability. For \( P(x,h)\), we let all the variables run free, to sample the distribution of the model itself. This method for training Boltzmann machine is computationally expensive, because we need to run two Gibbs chins, and each of them will need to run a large number of steps to get a good estimate of the probability. Now, we introduce the idea of Restricted Boltzmann Machine and how it speeds up learning.

**Restricted Boltzmann Machines**

Restricted Boltzmann machine is a type of Boltzmann machine, without interconnections within hidden nodes and visible nodes. The energy function is defined as

The conditional probability of hidden variables given visible variables are:

This conditional probability indicates the probability for each hidden variable given all the visible variables are independent, so that we can get the joint probability directly. This also indicates that, each hidden node can be seen as an expert, and we are using the product of experts to model the joint distribution. This formula also applies for \( P(x\vert h)\).

**Gibbs Sampling for Restricted Boltzmann Machines**

For the Gibbs sampling in Boltzmann machine, it is very slow because we need to take a lot of sub-steps to get only one Gibbs chin. But for Restricted Boltzmann machine, since all the hidden variables are independent given visible variables and all the visible variables are independent given hidden variables, we can just take one step to complete one Gibbs chin. So that we can easily get the conditional probability. For the joint probability, we use a hybrid Monte-Carlo method, an MCMC method involving a number of free-energy gradient computation sub-steps for each step of the Markov chain. For \( k\) Gibbs steps:

**Contrastive Divergence**

Compared with the Gibbs sampling in Boltzmann machine, the above method to sample in Restricted Boltzmann machine is much more effective. However, runing the MCMC chain is still quite expensive. The idea of k-step Contrastive Divergence is to stop the MCMC chain after k steps. This method saved a lot of computational complexity. One way to interpret the Contrastive Divergence is that, after a few MCMC steps, we will know in which direction the error is heading towards, so that instead of waiting for the error becoming larger and larger, we can simply stop the chain the update the network.

**References**

[1] Learning Deep Architectures for AI, Yoshua Bengio, 2009, link

[2] A Tutorial on Energy-based Learning, Yann LeCun, etc., 2006, link

[3] Gibbs Measure, Wikipedia, link