Abstract:Facial expression recognition (FER) has become an active research areathat finds a lot of applications in areas like human-computer interfaces, humanemotion analysis, psychological analysis, medical diagnosis etc. Popularmethods used for this purpose are based on geometry and appearance. Deepconvolutional neural networks (CNN) have shown to outperform traditionalmethods in various visual recognition tasks including Facial ExpressionRecognition. Even though efforts are made to improve the accuracy of FERsystems using CNN, for practical applications existing methods might not besufficient. This study includes a generic review of FER systems using CNN andtheir strengths and limitations which help us understand and improve the FERsystems further. Indexterms:CNN, FER, featuremap, ReLU, MLP, BN.
I. Introduction People have been tryingto build artificial intelligent (AI) systems that are equivalent to humanssince a long time. The increased availability of computational power and datafor training have helped for developing machine learning techniques and fast learningmachines. Among all, deep learning is widely considered as a promisingtechnique to make intelligent machines. Facial expression recognition is theprocess of identifying human emotion based on facial expressions. While humansare naturally capable of understanding the emotions, it stayed as a challengingtask for machines.Facial expression recognitionprocess contains featuresextraction and classification as shown in Fig.
1. Feature extraction Classification InputfeaturesOutput Fig.1 FacialExpression Recognition System Facial ExpressionRecognition system is a classification task. Classifier takes a set of featuresthat are retrieved from the input image as input. These features describe thefacial expression. Choosing a good feature set, efficient learning techniqueand a diversified database for training are the important factors inclassification. Feature set should contain. CNN is a combination ofdeep learning technology and artificial neural networks.
A massive developmentin deep learning and applying CNN for a classification problem has attained agreat success 1,2,3. The success comes from the fact that feature extractionand classification can be performed simultaneously on them. Critical featuresare extracted by deep learning methods by updating the weights using backpropagation and optimization of errors. II.
Convolution neuralnetworksCNN are biologically-inspired variants ofmulti-layer perceptron (MLP) networks. They use an architecture which isparticularly well suitable to classify images. The connections between thelayers and the weights associated with some form subsampling results infeatures that are invariant to translation for classifying images. Theirarchitecture makes convolutional networks fast to train.A. Architecture of CNNA CNN contains input layer, multiplehidden layers and an output layer.
Multiple Convolutional layers, sub-sampling(pooling) layers, normalization layers and fully connected layers are treatedas hidden layers. Fig.2 shows the architecture of a 5-layered CNN. Fig. 2Architecture of a 5-layered CNN 4· In convolutional layer, inputimage is convolved with kernel(s) to produce feature map(s) or activationmap(s).· Pooling reduces thedimensionality of each feature map but retains the most important information.It partitions the image into non-overlapping regions, and each region is subsampled(down-sampled) by a non-linear function such as maximum, minimum or average.Maxpooling is used as the most commondown-sampling functionthatgives maximum valuefor each region as output.
· In Rectified Linear Units(ReLU) layer an activation function f(x) = max (0, x) is applied element wise.ReLU introduces non-linearity in the network. Other functions used to increasenon-linearity are hyperbolic tangent, sigmoid etc.· The fully connected layer ofCNN is located after several convolutional and pooling layers andit is atraditional multi-layer perceptron (MLP). All neurons in this layer are fullyconnected to all activations in the previous layer.· In the loss layer, differentloss functions suitable for different tasks are used. A Softmax loss functionis used for classification of an image into multiple classes.
III. cnn for facial expressionrecognitionIn recentyears various architectures and models of CNN are created and used for facialexpression recognition. This research includes a summery review of researchesrelated to facial expression recognition using CNN.A. Deep CNN for FERAy?egülUçar5 proposed a CNN model with 10 layers for FER as shown in Fig.3. In theirarchitecture, a convolutional layer with kernel size 5 * 5, stride 1 and pad 2was applied. Next a max pooling layer with kernel size 3 * 3, stride 2 and pad1 is applied.
This process is repeated 3 times with different strides and padsas shown in Fig. 3. Followed by a convolutional layer with kernel size 2 * 2,stride 1 and pad 0 is applied which is followed by another convolution layer ofkernel size 1 * 1, stride 1 and pad 0. Finally a fully connected layer is addedto the network.Theirproposed model was evaluated on images from JAFFE database 6 and CK database7. In first evaluation seven facial expressions of various images from theJAFFE database were used for training. In the second evaluationCohn-Kanadedatabase which contains images of all races was employed for six expressions.
They resized the images to 16×16 and the proposed model was trained for 30epochs with a batch size of 40. They used 0.001 as the learning rate for firsteleven epochs, 0.002 for 13-29 epochs and 0.00001 for the last epoch.
Theirresults shows that their proposed method outperforms traditional geometric andappearance based methods with high accuracy. Fig.3Pipeline of proposed CNN 5B. Baseline CNN structure for FERMinchul Shin et al. 8have looked into four network models that are known to perform well in facialexpression recognition to analyse the most efficient network structure.Theeffect of techniques for pre-processing the input image on the performance wasalso investigated.The first network that was looked intowas Tang’s CNN structure 9. It containsa layer for input transformation andthree convolutional and pooling layers and a fully connected two-layer perceptronat the end.
The second network is Yu’s structure 2which contains five convolutional layers, three stochastic pooling layers, andthree fully connected layers. The network has two convolutional layers prior topooling, except for the first layer.The third network that is investigatedisKahou’s structure 3 which contains three convolutional and pooling layersfollowed by a MLP of two layers.The last one is the Caffe-ImageNetstructure10. It was designed for the classification of images taken from the ImageNetdataset into 1000 classes.
But the output nodes were reduced to seven inBaseline CNN approach. Every convolutional and fully connected layers of allthe four networks is applied with a ReLU layer and a dropout layer.Five test sets (FET – 2013, SFEW2.0,CK+, KDEF and Jaffe) are chosen to perform tests with the four networkstructures.
For the pre-processing of input image they found that the Histogramequalization method shows the most reliable performance for all the fournetworks. They observed that Tang’s network could achieve reasonably highaccuracy forhistogram equalized images compared to the other network models. Basedon this observation they suggested Tang’s simple network along with histogramequalization as the baseline model for carrying out further research.C. FER with CNN ensembleKaung Liu et al.
11 have proposed amodel consisting of many subnets that are structured differently. Each of thesesubnets are separately trained on a training set and combined together.Theoutput layers are removed and the layers before the last layerare concatenated together.
Finally, this connected network is trained to output the final expressionlabels. They evaluated their model using FacialExpression Recognition 2013 (FER – 2013) dataset. It containsgrayscale image offaces of size 48 x 48 pixels. They divided the dataset intoan 80 % training setand 20 % validation set. They trained the subnets separately. Each of thesubnets achieved a different accuracy on the dataset. By combining andaveraging the outputs of CNNs of different structures their network reportsbetterment in performance when compared to single CNN structure.D.
Stacked Deep ConvolutionalAuto-Encodersfor Emotion Recognition from Facial ExpressionsArielRuiz-Garcia et al. 12 have studied the effect of reducingthe number of convolutional layers and pre-training the Deep CNN as a StackedConvolutional Auto-Encoder (SCAE) in a greed layer-wise unsupervised fashionfor emotion recognition using facial expression. They incorporated BatchNormalization (BN) for both convolutional and fully connected layers in theirmodel to accelerate training and improve classification performance.In theSCAE emotion recognition model, each convolutional layer and its subsequentlayers: BN, ReLU and Max Pooling are treated as a single block and anAuto-Encoder is created for each one of these blocks. The first Auto-Encoderlearns to reconstruct raw pixel data.
The second Auto-Encoder learns toreconstruct the output of the first encoder and so on. Finally, the fullyconnected layer is trained to associate the output of the last convolutionalencoder with its corresponding label.Their CNNwith BN and the SCAE emotion recognizers are trained and tested using KDEF 13dataset. Applying the pre-training technique to give weights to the CNN usingAuto-Encoders increased their model’s performance to 92.53% and dramaticallyreduced the training time. IV. ObservationsFrom theabove study it is clear that even though there are numerous approaches forfacial expression recognition still models are being developed continuously. Thereason for this is accuracy.
Researches are continuously trying to improve theaccuracy of the FER by proposing various architectures and model. They alsoadopted some of the other techniques in their architectures to improve theaccuracy as discussed in Section 1V to account for the problem of accuracy.Efforts are done to reduce the training time for better performance. EnsembleCNNs are used to improve the accuracy of facial expression recognition.V. ConclusionsThis paper includes a study ofsome ofthe facial expression recognition systems based on CNNs. Differentarchitectures,approaches, requirements, databases for training/testing images and their performancehas been studied here.
Each method has its own strengths and limitations. Thisstudy helps to understand different kinds of models for facial expressionrecognition and to develop new CNN architectures for better performance andaccuracy.