Abstract:Facial expression recognition (FER) has become an active research area
that finds a lot of applications in areas like human-computer interfaces, human
emotion analysis, psychological analysis, medical diagnosis etc. Popular
methods used for this purpose are based on geometry and appearance. Deep
convolutional neural networks (CNN) have shown to outperform traditional
methods in various visual recognition tasks including Facial Expression
Recognition. Even though efforts are made to improve the accuracy of FER
systems using CNN, for practical applications existing methods might not be
sufficient. This study includes a generic review of FER systems using CNN and
their strengths and limitations which help us understand and improve the FER
Indexterms:CNN, FER, feature
map, ReLU, MLP, BN.
People have been trying
to build artificial intelligent (AI) systems that are equivalent to humans
since a long time. The increased availability of computational power and data
for training have helped for developing machine learning techniques and fast learning
machines. Among all, deep learning is widely considered as a promising
technique to make intelligent machines. Facial expression recognition is the
process of identifying human emotion based on facial expressions. While humans
are naturally capable of understanding the emotions, it stayed as a challenging
task for machines.Facial expression recognitionprocess contains features
extraction and classification as shown in Fig. 1.
Expression Recognition System
Recognition system is a classification task. Classifier takes a set of features
that are retrieved from the input image as input. These features describe the
facial expression. Choosing a good feature set, efficient learning technique
and a diversified database for training are the important factors in
classification. Feature set should contain.
CNN is a combination of
deep learning technology and artificial neural networks. A massive development
in deep learning and applying CNN for a classification problem has attained a
great success 1,2,3. The success comes from the fact that feature extraction
and classification can be performed simultaneously on them. Critical features
are extracted by deep learning methods by updating the weights using back
propagation and optimization of errors.
II. Convolution neural
CNN are biologically-inspired variants of
multi-layer perceptron (MLP) networks. They use an architecture which is
particularly well suitable to classify images. The connections between the
layers and the weights associated with some form subsampling results in
features that are invariant to translation for classifying images. Their
architecture makes convolutional networks fast to train.
A. Architecture of CNN
A CNN contains input layer, multiple
hidden layers and an output layer. Multiple Convolutional layers, sub-sampling
(pooling) layers, normalization layers and fully connected layers are treated
as hidden layers. Fig.2 shows the architecture of a 5-layered CNN.
Fig. 2Architecture of a 5-layered CNN 4
In convolutional layer, input
image is convolved with kernel(s) to produce feature map(s) or activation
Pooling reduces the
dimensionality of each feature map but retains the most important information.
It partitions the image into non-overlapping regions, and each region is subsampled
(down-sampled) by a non-linear function such as maximum, minimum or average.Max
pooling is used as the most commondown-sampling functionthatgives maximum value
for each region as output.
In Rectified Linear Units
(ReLU) layer an activation function f(x) = max (0, x) is applied element wise.
ReLU introduces non-linearity in the network. Other functions used to increase
non-linearity are hyperbolic tangent, sigmoid etc.
The fully connected layer of
CNN is located after several convolutional and pooling layers andit is a
traditional multi-layer perceptron (MLP). All neurons in this layer are fully
connected to all activations in the previous layer.
In the loss layer, different
loss functions suitable for different tasks are used. A Softmax loss function
is used for classification of an image into multiple classes.
III. cnn for facial expression
years various architectures and models of CNN are created and used for facial
expression recognition. This research includes a summery review of researches
related to facial expression recognition using CNN.
Deep CNN for FER
5 proposed a CNN model with 10 layers for FER as shown in Fig.3. In their
architecture, a convolutional layer with kernel size 5 * 5, stride 1 and pad 2
was applied. Next a max pooling layer with kernel size 3 * 3, stride 2 and pad
1 is applied. This process is repeated 3 times with different strides and pads
as shown in Fig. 3. Followed by a convolutional layer with kernel size 2 * 2,
stride 1 and pad 0 is applied which is followed by another convolution layer of
kernel size 1 * 1, stride 1 and pad 0. Finally a fully connected layer is added
to the network.
proposed model was evaluated on images from JAFFE database 6 and CK database
7. In first evaluation seven facial expressions of various images from the
JAFFE database were used for training. In the second evaluation
Cohn-Kanadedatabase which contains images of all races was employed for six expressions.
They resized the images to 16×16 and the proposed model was trained for 30
epochs with a batch size of 40. They used 0.001 as the learning rate for first
eleven epochs, 0.002 for 13-29 epochs and 0.00001 for the last epoch. Their
results shows that their proposed method outperforms traditional geometric and
appearance based methods with high accuracy.
Pipeline of proposed CNN 5
Baseline CNN structure for FER
Minchul Shin et al. 8
have looked into four network models that are known to perform well in facial
expression recognition to analyse the most efficient network structure.The
effect of techniques for pre-processing the input image on the performance was
The first network that was looked into
was Tang’s CNN structure 9. It containsa layer for input transformation and
three convolutional and pooling layers and a fully connected two-layer perceptron
at the end.
The second network is Yu’s structure 2
which contains five convolutional layers, three stochastic pooling layers, and
three fully connected layers. The network has two convolutional layers prior to
pooling, except for the first layer.
The third network that is investigated
isKahou’s structure 3 which contains three convolutional and pooling layers
followed by a MLP of two layers.
The last one is the Caffe-ImageNetstructure
10. It was designed for the classification of images taken from the ImageNet
dataset into 1000 classes. But the output nodes were reduced to seven in
Baseline CNN approach. Every convolutional and fully connected layers of all
the four networks is applied with a ReLU layer and a dropout layer.
Five test sets (FET – 2013, SFEW2.0,
CK+, KDEF and Jaffe) are chosen to perform tests with the four network
structures. For the pre-processing of input image they found that the Histogram
equalization method shows the most reliable performance for all the four
networks. They observed that Tang’s network could achieve reasonably high
accuracy forhistogram equalized images compared to the other network models. Based
on this observation they suggested Tang’s simple network along with histogram
equalization as the baseline model for carrying out further research.
FER with CNN ensemble
Kaung Liu et al. 11 have proposed a
model consisting of many subnets that are structured differently. Each of these
subnets are separately trained on a training set and combined together.The
output layers are removed and the layers before the last layerare concatenated together.
Finally, this connected network is trained to output the final expression
They evaluated their model using Facial
Expression Recognition 2013 (FER – 2013) dataset. It containsgrayscale image of
faces of size 48 x 48 pixels. They divided the dataset intoan 80 % training set
and 20 % validation set. They trained the subnets separately. Each of the
subnets achieved a different accuracy on the dataset. By combining and
averaging the outputs of CNNs of different structures their network reports
betterment in performance when compared to single CNN structure.
D. Stacked Deep Convolutional
Auto-Encodersfor Emotion Recognition from Facial Expressions
Ruiz-Garcia et al. 12 have studied the effect of reducing
the number of convolutional layers and pre-training the Deep CNN as a Stacked
Convolutional Auto-Encoder (SCAE) in a greed layer-wise unsupervised fashion
for emotion recognition using facial expression. They incorporated Batch
Normalization (BN) for both convolutional and fully connected layers in their
model to accelerate training and improve classification performance.
SCAE emotion recognition model, each convolutional layer and its subsequent
layers: BN, ReLU and Max Pooling are treated as a single block and an
Auto-Encoder is created for each one of these blocks. The first Auto-Encoder
learns to reconstruct raw pixel data. The second Auto-Encoder learns to
reconstruct the output of the first encoder and so on. Finally, the fully
connected layer is trained to associate the output of the last convolutional
encoder with its corresponding label.
with BN and the SCAE emotion recognizers are trained and tested using KDEF 13
dataset. Applying the pre-training technique to give weights to the CNN using
Auto-Encoders increased their model’s performance to 92.53% and dramatically
reduced the training time.
above study it is clear that even though there are numerous approaches for
facial expression recognition still models are being developed continuously. The
reason for this is accuracy. Researches are continuously trying to improve the
accuracy of the FER by proposing various architectures and model. They also
adopted some of the other techniques in their architectures to improve the
accuracy as discussed in Section 1V to account for the problem of accuracy.
Efforts are done to reduce the training time for better performance. Ensemble
CNNs are used to improve the accuracy of facial expression recognition.
This paper includes a study ofsome of
the facial expression recognition systems based on CNNs. Differentarchitectures,
approaches, requirements, databases for training/testing images and their performance
has been studied here. Each method has its own strengths and limitations. This
study helps to understand different kinds of models for facial expression
recognition and to develop new CNN architectures for better performance and