Beforetuning, a layer-by-layer pre-training of RBM’s is performed: the output of a onelayer(RBM) are treated as inputs to the next layer(RBM) and the procedure willbe repeated till all the RBMs are trained.This layer-by-layer unsupervisedlearning is demanding in DBN training as practically it helps avoiding localoptima and alleviates the over filling problem that is seen, when thousands ofparameters are chosen.

Furthermore, the algorithm is capable in terms of itstime complexity, which is linear to the size of RBMs36 Features atdifferent layers contains different information about data structures withhigh-level features designed fromlow-level features. For a simple RBM with Bernoulli distribution of both thevisible and hidden layers, the sampling probabilities are as follows36P(hj=1|v;W)= ? vi+aj) (1)andP(vi=1|h;W)= ?( hj+bi) (2)where v andh represents a ix1visible unit vector and a jx1 hidden unit vector, respectively; W is thematrix of weights (wij) connecting the visible layer and hidden layers; ajand bi are bias terms; and ? (.) is a sigmoid function. For the caseof real-valued visible units, the conditional probability distributions arequiet different: typically, a Gaussian-Bernoulli distribution is assumed and P(vi|h; W) is Gaussian. Weights wij are updatedbased on an approximate method called contrastive divergence (CD)approximation. For example, the (t+1)th weight for wij can be updated asfollows:?wij(t+1)=c?wij(t)+?(vihj)data-(vihj)model (3)Where ? is thelearning rate and c is the momentum factor; (.)data and (.

)model arethe expectations under the distributions defined by the data and the model, respectively.While the expectations may be calculated by running Gibbs sampling infinitely,in practice, one-step CD is often used because it performs well37.Other model parameters (e.g.

,the biases) can be updated similarly. As agenerative mode, the RBM training includes a Gibbs sampler to sample hiddenunits based on the visible units and vice versa (Eqs.(1) and (2)). The weightsbetween these two layers are then updated using the CD rule (Eq. 3). Thisprocedure will repeat until convergence. An RBM model data distribution usinghidden units without employing label information.

Afterpre-training, information regarding the input data is stored in the weightsbetween every layer-by-layer. The DBN then adds a final layer representing thedesired outputs and the overall network is fine tuned using labeled data andback propagation strategies for better discrimination Thereare other variations for pre-training: instead of using RBMs, for example,stacked denoising auto-encoders and stacked predictive sparse coding are alsoproposed for unsupervised feature learning. Furthermore, recent results shown thatwhen a large number of training data is available, a fully supervised trainingusing random initial weights instead of the pre-trained weights (i.e., withoutusing RBMs or auto-encoders) will practically work well. For example, adiscriminative model starts with a network with one single hidden layer (i.e.

,a shallow neural network), which is trained by back propagation method. Uponconvergence, a new hidden layer is inserted into this shallow NN (between the firsthidden layer and the desired output layer) and the full network is discriminativelytrained again. This procedure is continued until all the hidden neurons aretrained.

Insummary, DBNs use a greedy and efficient layer-by layer approach to learn thelatent variables (weights) in each hidden layer and a back propagation methodfor fine tuning. This hybrid training strategy improves both the generativeperformance and the discriminative power of the network.