TOWARDS SOLVING ABSTRACT TASKS USING CONVOLUTIONAL NEURAL NETWORKS
A Dissertation Presented to the Faculty of the Graduate School
of Cornell University in Partial Fulﬁllment of the Requirements for the Degree of
Doctor of Philosophy
by Kuan-Chuan Peng
August 2016

c 2016 Kuan-Chuan Peng ALL RIGHTS RESERVED

TOWARDS SOLVING ABSTRACT TASKS USING CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng, Ph.D. Cornell University 2016
Abstract tasks are the tasks relating to or involving general ideas or qualities rather than speciﬁc people, objects, or actions. Recently, abstract tasks such as artistic style classiﬁcation and memorability prediction have been receiving increasing attention in the computer vision community. Previous works related to abstract tasks mainly reply on standard handcrafted features without directly learning the features from the training data. In this thesis, we explore the efﬁcacy of using convolutional neural networks (CNN) which learn the features tailored for each abstract task. Predicting emotion distributions and predicting emotion stimuli maps are the ﬁrst two abstract tasks we work on. In both tasks, we build associated databases and show that CNN-based approaches can predict more accurate emotion distributions and emotion stimuli maps compared with the methods used in the previous works. Given the encouraging results in the emotion-related tasks, we apply CNN to eight different abstract tasks proposed recently in computer vision, showing that CNN-based approaches can outperform the state-of-the-art performance reported in the previous works.
In addition to the traditional CNN framework, we propose using multi-task, multi-depth, and multi-scale CNN features to further improve the performance in abstract tasks. Multi-task features incorporate the features learned from the training data of other tasks. Multi-depth features consist of the features learned

by different neural network architectures, but multi-scale features are formed by the features learned from the augmented training data in different scales. The experimental results show that all the three proposed CNN features outperform the traditional CNN framework. Furthermore, we train another fully connected networks to fuse our proposed CNN features. The fused features achieve better performance than using each of our proposed features.

THESIS COMMITTEE Prof. Tsuhan Chen
School of Electrical and Computer Engineering Cornell University Prof. Kavita Bala
Department of Computer Science Cornell University Prof. Serge Belongie
Department of Computer Science Cornell University and Cornell Tech
iii

BIOGRAPHICAL SKETCH
Kuan-Chuan Peng receives his Ph.D. in the ﬁeld of Electrical and Computer Engineering from Cornell University. He joined the Advanced Multimedia Processing Lab at Cornell University in 2012, advised by Prof. Tsuhan Chen. His thesis committee includes Prof. Tsuhan Chen, Prof. Kavita Bala, and Prof. Serge Belongie. His thesis focuses on using convolutional neural networks to improve the performance of the abstract tasks in computer vision such as emotion classiﬁcation and memorability prediction.
Before accepting his Ph.D. degree from Cornell University, Kuan-Chuan Peng received his bachelor’s degree in Electrical Engineering at National Taiwan University in 2009. He gained interest in Computer Vision through the special projects advised by Prof. Liang-Gee Chen. Therefore, he shifted his major to Computer Science and got his master’s degree at National Taiwan University in 2012. During his two years of master studies in National Taiwan University, he worked with Prof. Chiou-Shann Fuh and JMicron Technology Corp., developing the algorithm of pedestrian detection for automobile devices.
As a researcher in computer vision, Kuan-Chuan Peng not only serves as the reviewer of the related top-tier academic conferences but also publishes his works at those venues, including International Conference on Computer Vision (ICCV) and the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). After graduating from Cornell University, Kuan-Chuan Peng plans to continue his research on neural networks and abstract tasks.
iv

This thesis is dedicated to my parents: Jui-Pin Peng and Kuei-Yuan Liu. I appreciate your unconditional love and continuous support.
v

ACKNOWLEDGEMENTS
I am grateful for the constant support and guidance from my advisor, Prof. Tsuhan Chen during my Ph.D. life. You are the main reason why I choose Cornell to do my Ph.D. study, and this is one of the best decisions I have ever made in my life. You encourage me to think creatively and provide me with lots of freedom in various topics in computer vision. Your academic insight helps me overcome the challenges when I pursue my Ph.D. I am always inspired by your advice on my research and your invaluable life experiences. You are my role model in both career and research, and I am proud to be your student.
I appreciate the suggestions given by my thesis committee: Prof. Kavita Bala and Prof. Serge Belongie. Your broad knowledge and clairvoyant advice encourage me to keep improving my research works. I also want to thank Prof. Noah Snavely and Prof. Thorsten Joachims. Their courses in computer vision and machine learning build a solid foundation for me to carry on my Ph.D. research.
I thank Heather Yu and Dongqing Zhang, my manager and mentor when I was an intern researcher in Huawei in 2013. Your industrial insight and continuous support even after my internship are indispensable part of my research. Our study about emotion prediction motivates my succeeding research in abstract tasks. It is a rewarding experience to work with you.
I am grateful for the support from my collaborators: Andrew Gallagher, Amir Sadovnik, and Kolbeinn Karlsson. I enjoy the brainstorming in our discussions and appreciate your constructive advice during my Ph.D. study. It is my pleasure to work with you and learn from you. I thank all the members in the Advanced Multimedia Processing (AMP) Lab: Adarsh Kowdle, Zhaoyin Jia, Ruogu Fang, Amandianeze Nwana, Hang Chu, Henry Shu, and Dong Ki
vi

Kim. I also thank the scholars visiting the AMP Lab during my time in Cornell, including James Guo-Zhen Wang, Toshihiko Yamasaki, Satoshi Ueno, Tim TyngLuh Liu, and Yuuka Kihara. Their feedback for my research and the useful discussions in our weekly group meetings are important for my research. I am lucky to be able to learn from each of you. I am thankful for the Cornell students who have worked with me in the past few years. Their creativity always gives me new ideas in my research.
I thank all my friends at Cornell and Cornell Taiwanese Student Association. They always give me the warmest hug especially in the winter time in Ithaca. Their support makes my life at Cornell colorful and memorable. I appreciate the unconditional love from my family. You are always there rooting for me, which makes me unafraid of any obstacles in my life.
vii

TABLE OF CONTENTS

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction

1

1.1 First Published Appearances of Described Contributions . . . . . 4

2 Modeling and Predicting Evoked Emotion Distributions

6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 The Emotion6 Database . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Emotion Model and Category Selection . . . . . . . . . . . 11

2.3.2 Image Collection and User Study . . . . . . . . . . . . . . . 11

2.4 Predicting Emotion Distributions . . . . . . . . . . . . . . . . . . . 14

2.4.1 Support Vector Regression (SVR) . . . . . . . . . . . . . . . 14

2.4.2 Convolutional Neural Networks (CNN) and CNN for

Regression (CNNR) . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Predicting Valence–Arousal (VA) Scores . . . . . . . . . . . . . . . 20

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Predicting Emotion Stimuli Maps

23

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 The EmotionROI Database and User Study . . . . . . . . . . . . . 27

3.3 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Labeling from Emotion Similarity (LES) . . . . . . . . . . . 30

3.3.2 FCN with Euclidean Loss (FCNEL) . . . . . . . . . . . . . 33

3.4 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.2 Baselines — Saliency and Objectness Detection . . . . . . . 35

3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Multi-task CNN Features 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Databases and Tasks . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . 4.2.3 Training Approaches . . . . . . . . . . . . . . . . . . . . . .

42 42 45 47 47 48 48

viii

4.3 CNN Performance in Abstract Tasks . . . . . . . . . . . . . . . . . 50 4.4 Correlating Abstract Tasks . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5 Multi-depth CNN Features 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Generating Multi-depth CNN Features . . . . . . . . . . . . . . . 5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Databases and Tasks . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Training Approach . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 63 66 66 67 68 70

6 Multi-scale CNN Features

71

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2 Label-Inheritable Property . . . . . . . . . . . . . . . . . . . . . . . 74

6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.3.1 Databases and Tasks . . . . . . . . . . . . . . . . . . . . . . 76

6.3.2 MSCNN Architecture . . . . . . . . . . . . . . . . . . . . . 77

6.3.3 Training Approach . . . . . . . . . . . . . . . . . . . . . . . 79

6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Fusing Multi-depth, Multi-scale, and Multi-Task Features

85

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.2 Feature Fusion Using Fully Connected Networks . . . . . . . . . . 86

7.2.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . 86

7.2.2 Fused Features . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.4 Feature Fusion for Predicting Emotion Distributions . . . . . . . . 89

7.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.4.2 Proposed Combined Networks . . . . . . . . . . . . . . . . 90

7.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 92

8 Conclusions and Future Work

94

8.1 Future Work: CNN Visualization for Abstract Tasks . . . . . . . . 95

A Related Publications

101

Bibliography

103

ix

LIST OF TABLES
1.1 Several standard benchmark databases and the recent CNNrelated works using them. . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The issues of previous emotion image databases and how our proposed database, Emotion6, solves these issues. . . . . . . . . . 9
2.2 The feature set we use for support vector regression (SVR) in predicting emotion distributions. . . . . . . . . . . . . . . . . . . . 15
2.3 The performance of different methods for predicting emotion distributions compared using PM and M (M ∈ {KLD, BC, CD, EMD}). The upper table shows PM, the probability that Method 1 outperforms Method 2 with distance metric M. Each row in the upper table shows that Method 1 outperforms Method 2 in all M. The lower table lists M, the mean of M, of each method, showing that CNNR achieves better M than the other methods listed here. CNNR performs the best out of all the listed methods in terms of all PMs with better M. . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 The performance of different algorithms for predicting valence and arousal scores. The numbers are the average of absolute difference (AAD) compared with the ground truth in SAM 9point scale [6]. CNNR outperforms the two baselines, and has comparable performance to SVR. . . . . . . . . . . . . . . . . . . 20
3.1 The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in MAE, precision, recall, and 4 F-measures. The performance is shown by PM and M (M ∈ MAE, precision, recall, F0.5, F √0.3, F1, F2 ). The top table lists M, the mean of metric M, of each method, showing that LES and FCNEL achieve better M than the two baselines. For each metric M, the best M is marked in bold. The bottom two tables show PM, the probability that Method 1 outperforms Method 2 with metric M. Each row in the bottom two tables shows that Method 1 outperforms Method 2 in most M. For saliency detection [9], both the results before / after “block ﬁtting” are reported. FCNEL performs the best out of all the listed methods in terms of most PM with better M. . . . . . . . . . . . . . . . . . 37
4.1 The databases and associated abstract tasks used in this thesis along with their properties. In the rest of the thesis, we refer to each task by the corresponding task ID listed under each task. The experimental setting for each task is provided at the bottom of the table, where ρ is Spearman rank correlation between the prediction and the ground truth. . . . . . . . . . . . . . . . . . . 44
x

4.2 The ﬁve different CNN training approaches used in this chapter. MImageNet is the Caffe [32] reference model trained for ImageNet [11] classiﬁcation, and MAVA is our trained reference model for AVA [54] classiﬁcation. We refer to each training method by its training approach ID. . . . . . . . . . . . . . . . . . 49
4.3 The summary of the experimental results of the 8 abstract tasks listed in Table 4.1 using the ﬁve training approaches in Table 4.2. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The bold numbers represent the best performance given the speciﬁed evaluation metric, and the underlinednumbers indicate the performance better than that of “train from scratch.” . . . . . . . . . . . . . . . . . . . . . 50
4.4 The summary of the experimental results using the framework in Figure 4.1. The task CAL and the 8 abstract tasks listed in Table 4.1 are included in this experiment. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The underlinednumbers indicate the performance better than that of using self feature. The best concatenating setting in each task is shown in Table 4.5. This table shows that concatenating the CNN features learned from different tasks can improve the performance. However, in our experiment, concatenating all the learned CNN features never performs the best in each task. In fact, in 4 out of 9 tasks, concatenating all the learned CNN features performs even worse than using self feature. . . . 54
4.5 The best concatenating setting out of 264 different settings in each task. The corresponding performance is listed in Table 4.4. This table shows that the best performance of each task is not achieved by concatenating all the learned CNN features, but by concatenating a subset of them. . . . . . . . . . . . . . . . . . . . 55
4.6 The performance ranking in each task by using the CNN features learned from a single task. The numbers are presented in the form “rank (performance).” The rank 1 (9) represents the best (worst) performance in each task. The table shows that self feature usually performs the best if we use the CNN features learned from one task. The best and the worst performances out of the 9 different features in each task are listed at the bottom of the table, which shows that even the worst performance is nontrivial (much better than random guessing). . . . . . . . . . . . . 57
xi

4.7 The performance ranking in each task by using self feature and the CNN features learned from another task. The format of the numbers is the same as that of Table 4.6 except that the underlinednumbers represent the performance better than that of using only self feature. The best and the worst performances in each task out of the 8 different combinations of features are listed at the bottom of the table, where the performance of using only self feature is also provided. The table shows that most of the listed feature combinations outperform using only self feature in each task. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 The summary of the multi-depth CNN features used in this chapter. Serving as a baseline, F0 represents the features extracted from the topmost layer in the traditional CNN framework. F1 to F4 are our proposed multi-depth CNN features which are formed by concatenating fis (i = {0, 1, · · · , 4}) deﬁned in Figure 5.1. We follow the speciﬁcation of the AlexNet [46] and use k = 4096. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for all the 8 tasks, our proposed multi-depth CNN features (F1 to F4) outperform not only the best known results from prior works but also the features commonly used in the traditional CNN framework (F0). . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for most of the 8 tasks, our proposed multi-scale CNN features (MSCNN-1 to MSCNN-3) outperform not only the best known results from prior works but also the traditional CNN framework (MSCNN-0). . . . . . . . . . . . . . . . . . . . . . . . 82
7.1 The performance comparison between each of our proposed features and the fused features. The numbers are reported in the format “accuracy (%) [conﬁdence level (%)].” The conﬁdence represents the conﬁdence of the feature outperforming the CNN baseline according to the binomial test. The italic numbers are the performance exceeding the 95% conﬁdence level (statistically signiﬁcant results). . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2 Performance comparison in predicting emotion distributions. Our combined networks predict more accurate emotion distributions with ≥97.7% conﬁdence compared with Peng’s method [66]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xii

LIST OF FIGURES

1.1 Example abstract tasks proposed in recent computer vision literature. For each abstract task, we show an example image and its ground truth label from the corresponding database proposed in the listed reference. . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2.1 Example images of Emotion6 with the corresponding ground truth. The emotion keyword used to search each image is displayed on the top. The graph below each image shows the probability distribution of evoked emotions of that image. The bottom two numbers are valence–arousal (VA) scores in SAM 9point scale [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Two screenshots of the interface of our user study on the Amazon Mechanical Turk. Before the subject answers the questions (the right image), we provide the subject with the instructions and an example (the left image) explaining how to answer the questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Classiﬁcation performance of CNN and Wang’s method [73] with the Artphoto database [52]. In 6 out of 8 emotion categories, CNN outperforms Wang’s method [73]. . . . . . . . . . . . . . . 17
2.4 The mapping between the color codes and emotion keywords used in Section 2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 The distribution of VA scores of the Emotion6 database. All the images in Emotion6 are placed in VA plane according to their VA scores. The boundary of each image is colored to reﬂect the dominant evoked emotion according to the color codes in Figure 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 An example showing that different regions in an image contribute to the viewer’s evoked emotion differently. (c), (d), and (e) are cut from the yellow, green, and red rectangles of (a) respectively. (b) shows the regions in (a) which affect the evoked emotion the most marked by user study. (e) will evoke more similar emotions as those evoked by (a) compared with (c) and (d) because (e) contains not only the person jumping but other emotion-related areas, which is consistent with (b), the ground truth of the emotion stimuli map of (a). . . . . . . . . . . . . . . . 24
3.2 An example showing the difference between saliency, objectness detection and the emotion stimuli map. (b) is the ground truth emotion stimuli map using (a) as the input image. (c) and (d) correspond to saliency [9] and objectness [2] detection, respectively. Neither (c) nor (d) perfectly captures (b), where two thirds of the subjects convey that the area affecting evoked emotions includes not only the ﬂower but also other emotion-related areas. 25

xiii

3.3 A screenshot of the interface of our user study on the Amazon Mechanical Turk. We ask the subject to draw a rectangle enclosing the part of the image that most inﬂuences the evoked emotion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Some example images from the EmotionROI database with the corresponding ground truth emotion stimuli maps. The emotion keyword used to search each image (provided by the Emotion6 [66] database) is displayed under the image. . . . . . . 29
3.5 The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in PR curve, average precision and recall, and 4 F-measures. These ﬁgures show that our two proposed methods (LES and FCNEL) outperform the two baselines (saliency [9] and objectness [2] detection) in PR curve as well as 6 statistics computed from the average precision and recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 The qualitative and quantitative results of predicting emotion stimuli maps with some testing images of the EmotionROI database as input. The representation of each column is as follows: (a) input image, (b) the ground truth emotion stimuli map, (c) the pre-processed result of saliency detection [9], (d) the post-processed result of saliency detection [9], (e) the result of objectness detection [2], (f) the result of LES, and (g) the result of FCNEL. The emotion keyword used to search each input image is shown under each image in column (a) according to the information provided in the Emotion6 [66] database. For column (b) to (g), the corresponding MAE is shown under each image. Our two proposed methods (column (f) and (g)) predict more accurate emotion stimuli maps than other baselines (column (c) to (e)) do for these examples. . . . . . . . . . . . . . . . . . . . . . . 40
4.1 The framework of concatenating the CNN features learned from different tasks. We experiment on 9 tasks (n = 9), including the 8 abstract tasks in Table 4.1 and Caltech-101 [48] object classiﬁcation task. The switch S i associated with each task Ti (i ∈ {1, 2, · · · , 9}) controls whether the CNN features learned from task Ti are concatenated in the ﬁnal feature vector. . . . . . . . . 53
xiv

5.1 The ﬁve CNN structures adopted in this chapter. CNN0 represents the AlexNet [46], and CNN1 to CNN4 are the same as CNN0 except that some convolutional layers are removed. We use each CNNi (i = {0, 1, · · · , 4}) as a feature extractor which takes an image as input and outputs a feature vector fi from the topmost fully connected layer. These fis are concatenated to form our proposed multi-depth features according to the deﬁnition in Table 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 The illustration of label-inheritable (LI) property. Given an image database D associated with a task T , if any cropped version of any image I from D can take the same label as that of I, we say that T satisﬁes LI property. In this ﬁgure, we only show the case when the cropped image is the upper right portion of the original image. In fact, the cropped image can be any portion of the original image. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Example images from the three databases (Painting-91 [41], arcDataset [77], and Caltech-101 [48]) associated with three different tasks which satisfy LI property in different degrees. The corresponding database, task, label, and the extent that LI property is satisﬁed are shown under each image. . . . . . . . . . . . . 75
6.3 The illustration of our proposed multi-scale convolutional neural networks (MSCNN) which consists of m + 1 AlexNet [46] (one for each of the m + 1 different scales). The details of the MSCNN architecture and the training approach are explained in Sec. 6.3.2 and Sec. 6.3.3 respectively. . . . . . . . . . . . . . . . . . 77
7.1 The fully connected networks we use to fuse our proposed multitask, multi-depth, and multi-scale CNN features. . . . . . . . . . 86
7.2 The illustration of our proposed combined networks for predicting emotion distributions. We only show convolution, deconvolution, and fully connected layers for clarity. . . . . . . . 90
8.1 The visualization of the output node representing the class “high aesthetic value” in the task AVA in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where all the 9 training images are correctly labeled by the trained CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xv

8.2 The visualization of the multi-task features. The visualized output node represents the class “Rembrandt van Rijn” in the task AST in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. . . . . . . . . 98
8.3 The visualization of the multi-depth features. The visualized output node represents the class “Gothic architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.4 The visualization of the multi-scale features. The visualized output node represents the class “Russian Revival architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. . . . . . . . . 100
xvi

CHAPTER 1 INTRODUCTION
During the past few years, computer vision scientists are excited about the breakthrough progress in object classiﬁcation. According to the article [69] published by the team hosting the ImageNet [11] visual recognition challenge, the error rate of object classiﬁcation decreases from 16.42% in 2012 to 6.66% in 2014. Machines are taught to be able to accurately classify objects. However, humans care more than just objects. For instance, according to the user study conducted by Murray et al. [54], humans can distinguish the images with high aesthetic values from the images with low aesthetic values even though all these images contain the same type of objects. Another example showing that humans are interested in more than objects is image tagging, where humans use not only objects but also non-object attributes to tag or describe the images, which can be easily veriﬁable on common image hosting websites such as Flickr [20]. Both the examples of image aesthetic values and image tagging show that abstract attributes matter. Adopting the deﬁnition from the MerriamWebster dictionary [14], we deﬁne abstract attributes as the attributes relating to or involving general ideas or qualities rather than speciﬁc people, objects, or actions. To bridge the gap between humans and machines, one approach is to teach machines to solve the tasks involving abstract attributes. In this thesis, we use the term abstract tasks to refer to the tasks involving abstract attributes.
Various kinds of abstract tasks have been proposed in recent computer vision literature, including classiﬁcation tasks (emotions [52, 73], artist and artistic styles [41], aesthetic qualities [50, 54], fashion styles [43], and architectural styles [77]) and regression tasks (memorability [7, 30, 31, 42, 44] and interest-
1

Figure 1.1: Example abstract tasks proposed in recent computer vision literature. For each abstract task, we show an example image and its ground truth label from the corresponding database proposed in the listed reference.
ingness [22, 28]). Figure 1.1 shows some example images and their ground truth labels of these abstract tasks. We observe that these involving abstract attributes are relatively subjective compared with objects and that it is tricky to describe abstract attributes as objects. Because of the above two reasons, most abstract tasks are weakly supervised without the information of location, and the ground truth of abstract tasks has lower consensus. In addition, publicly available databases of abstract tasks only contain limited amount of data (typically a few hundreds or thousands of images), which is not at all comparable
2

database(s)

CNN-related works

ImageNet [11] PASCAL [19]
SUN [76] Caltech-101 [48] Caltech-UCSD Birds 200 [74]

[15, 26, 29, 46, 82] [1, 29]
[1, 15, 26] [15, 29, 82]
[15, 83]

Table 1.1: Several standard benchmark databases and the recent CNNrelated works using them.

to millions of images already available for standard object or scene classiﬁcation tasks. We also ﬁnd that most abstract tasks have higher intra-class variation compared with object classiﬁcation tasks. The above properties make abstract tasks challenging, which motivates us to propose better methods to solve abstract tasks.
During our search for possible methods to solve abstract tasks, we ﬁnd that convolutional neural networks (CNN) have achieved better performance than non CNN-based approaches for several databases in recent studies [15, 26, 82, 83]. The trend of using CNN-based methods to solve computer vision problems started with Krizhevsky et al. [46] using CNNs to achieve breakthrough improvement in ImageNet [11] visual recognition challenge. In addition, Donahue et al. [15] showed that the network proposed by Krizhevsky et al. [46] can serve as a feature extractor which they use to outperform the state-of-the-art results on several generic tasks. Given these encouraging results [15, 46], more researchers started studying and applying CNN-based approaches to different standard benchmark databases. After summarizing the commonly used standard databases and the recent CNN-related works using them in Table 1.1, we ﬁnd that these databases used in most CNN-related works are associated with standard classical object or scene classiﬁcation tasks instead of abstract tasks.

3

In this thesis, we explore the possibility of using CNN to solve abstract tasks. We start our experiment on the two emotion-related tasks, predicting emotion distributions and predicting emotion stimuli maps, and extend our experiment to eight different abstract tasks. After showing that CNN-based approaches can outperform the state-of-the-art results of the eight abstract tasks, we proposed to use multi-task, multi-depth, and multi-scale CNN features to further improve the performance.
The rest of this thesis is organized as follows. Chapter 2 presents our work on model and predict evoked emotion distributions. Chapter 3 describes our work on predicting emotion stimuli maps, which is an example of extending the weakly supervised abstract tasks to strongly supervised ones. Chapter 4, Chapter 5, and Chapter 6 elaborate our proposed multi-task, multi-depth, and multi-scale CNN features respectively. After showing supporting results of our proposed CNN features on abstract tasks, we describe a method using fully connected networks to fuse these features in Chapter 7. The thesis is concluded in Chapter 8 with a discussion of potential future work.
1.1 First Published Appearances of Described Contributions
Most contributions or their initial versions described in this thesis have ﬁrst appeared as various publications:
1. Chapter 2: Peng, Sadovnik, Gallagher, Chen [66] 2. Chapter 3: Peng, Sadovnik, Gallagher, Chen [67] 3. Chapter 4: Peng, Chen [62] 4. Chapter 5: Peng, Chen [60]
4

5. Chapter 6: Peng, Chen [61] 6. Chapter 7: Peng, Chen [60, 61, 62] 7. Chapter 8: Peng, Chen [60, 61, 62] The following contributions have appeared as various publications: Peng, Karlsson, Chen, Zhang, Yu [63]; Peng, Yu, Zhang, Chen [68]; Peng, Chen [59]. However, they are beyond the scope of this dissertation, and therefore are not discussed here.
5

CHAPTER 2 MODELING AND PREDICTING EVOKED EMOTION DISTRIBUTIONS
Summary
This chapter explores a new aspect of photos and human emotions. We show through psychovisual studies that different people have different emotional reactions to the same image, which is a novel departure from previous work that only records and predicts a single dominant emotion for each image. Our studies also show that the same person may have multiple emotional reactions to one image. Predicting emotions in “distributions” instead of a single dominant emotion is important for many applications. In addition, we present a new database, Emotion6, containing distributions of emotions.
2.1 Introduction
Images are emotionally powerful. An image can evoke a strong emotion in the viewer. Further, the viewer’s emotion may be sometimes affected in a way that was unexpected by the photographer. For example, an image of a hot air balloon may evoke feelings of joy to some observers (who crave adventure), but fear in others (who have fear of heights). We address the fact that people have different evoked emotions by collecting and predicting the distributions of emotional responses when an image is viewed by a large population. We also address the fact that the same person may have multiple emotions evoked by one image by allowing the subjects to record multiple emotional responses to one image. Our goal is to predict the evoked emotion that a population of
6

observers has when viewing a particular image.
We make the following contributions:
1. We show that different people have different emotional reactions to an image and that the same person may have multiple emotional reactions to an image. Our proposed database, Emotion6, addresses both ﬁndings by modeling emotion distributions.
2. We use a convolutional neural network (CNN) to predict emotion distributions, rather than simply predicting a single dominant emotion, evoked by an image. Our predictor of emotion distributions for Emotion6 can serve as a better baseline than using support vector regression (SVR) with the features from previous works [52, 72, 73] for future researchers. We also predict emotions in the traditional setting of affective image classiﬁcation, showing that CNN outperforms Wang’s method [73] on Artphoto database [52].
2.2 Prior Work
In computer vision, image classiﬁcation based on abstract concepts has recently received a great deal of focus. Aesthetic quality estimation [33] and affective image classiﬁcation [52, 72, 73] are two typical examples. However, these two abstract concepts are fundamentally different because the evoked emotion of an image is not equivalent to aesthetic quality. For example, one may feel joyful after viewing either amateur or expert photos, and aesthetically ideal images may elicit either happy or sad emotions. Moreover, aesthetic quality is a one-
7

dimensional attribute, whereas emotions are not [21].
In predicting the emotion responses evoked by an image, researchers conduct experiments on various types of images. Wang et al. [73] perform affective image classiﬁcation on artistic photos or abstract paintings, but Solli and Lenz [72] use Internet images. Machajdik and Hanbury [52] use both artistic and realistic images in their experiment. The fact that groups of people seldom agree on the evoked emotion [33] and that even a person may have multiple emotions evoked by one image, are ignored by previous works. According to the statistics of Emotion6, the image database we collect in Sec. 2.3, more than half of the subjects have emotion responses different from the dominant emotion. Our statistics also show that ∼22% of all the subjects’ responses select at least two emotion keywords to describe one subject’s evoked emotions. Both of these observations support our assertion that emotion should be represented as a distribution rather than as a single dominant emotion. Further, predicting emotions by distribution rather than as a single dominant emotion is important for practical applications. For example, a company has two possible ads arousing different emotion distributions (ad1: 60% joy and 40% surprise; ad2: 70% joy and 30% fear). Though ad2 elicits joy with higher probability than ad1 does, the company may choose ad1 instead of ad2 because ad2 arouses negative emotion in some part of the population.
In psychology, researchers have been interested in emotions for decades, leading to three major approaches – “basic emotion”, “appraisal”, and “psychological constructionist” traditions [23]. With debate on these approaches, psychologists designed different kinds of models for explaining fundamental emotions based on various criteria. Ortony and Turner [56] summarized some
8

Issues of previous databases Explanation and how Emotion6 solves the issues

Ad-hoc categories

Previous databases select emotion categories without psychological background, but Emotion6 uses Ekman’s 6 basic emotions [18] as categories.

Unbalanced categories

Previous databases have unbalanced proportion of images from each category, but Emotion6 has balanced categories with 330 images per category.

Single category per image

Assigning each image to only one category (dominant emotion), previous databases ignore that the evoked emotion can vary between observers [33]. Emotion6 expresses the emotion associated with each image in probability distribution.

Table 2.1: The issues of previous emotion image databases and how our proposed database, Emotion6, solves these issues.

past theories of basic emotions, and even now, there is not complete consensus among psychologists. One of the most popular frameworks in the emotion ﬁeld proposed by Russell [70], the valence–arousal (VA) model, characterizes emotions in VA dimensions, where valence describes the emotion in the scale of positive to negative emotion, while arousal indicates the degree of stimulation or excitement. We adopt VA model as part of emotion prediction. In terms of emotion categories, we adopt Ekman’s six basic emotions [18], which details are explained in Sec. 2.3.
To recognize and classify different emotions, scientists build connections between emotions and various types of input data including text [21], speech [16, 45], facial expressions [16, 13], music [79], and gestures [16]. Among the research related to emotions, we are interested in emotions evoked by consumer photographs (not just artworks or abstract images as in [73]). Unfortunately, the number of related databases is relatively few compared to other areas mentioned previously. These databases, such as IAPS [47], GAPED [10],

9

Figure 2.1: Example images of Emotion6 with the corresponding ground truth. The emotion keyword used to search each image is displayed on the top. The graph below each image shows the probability distribution of evoked emotions of that image. The bottom two numbers are valence–arousal (VA) scores in SAM 9-point scale [6].
and emodb [72] have a few clear limitations. We propose a new emotion database, Emotion6, which this chapter is mainly based on. Table 2.1 summarizes how Emotion6 solves the limitations of previous databases. Sec. 2.3 describes the details of Emotion6.
2.3 The Emotion6 Database
For each image in Emotion6, the following information is collected by a user study:
1. The ground truth of VA scores for evoked emotion. 2. The ground truth of emotion distribution for evoked emotion.
Consisting of 7 bins, Ekman’s 6 basic emotions [18] and neutral, each emotion distribution represents the probability that an image will be classiﬁed into each
10

bin by a subject. For both VA scores, we adopt the self-assessment manikin (SAM) 9-point scale [6], which is also used in [47]. For the valence scores, 1, 5, and 9 mean very negative, neutral, and very positive emotions respectively. For the arousal scores, 1 (9) means the emotion has low (high) stimulating effect. Figure 2.1 shows some images from Emotion6 with the corresponding ground truth. The details about the selection of emotion model/categories/images and the user study are described in the following sections. The database Emotion6 is available online [64].
2.3.1 Emotion Model and Category Selection
According to the list of different theories of basic emotions [56], we use Ekman’s six basic emotions [18] (anger, disgust, joy, fear, sadness, and surprise) as the categories of Emotion6. Each of these six emotions is adopted by at least three psychological theorists in [56], which provides a consensus for the importance of each of these six emotions. We adopt the valence–arousal (VA) model, in addition to using emotion keywords as categories because we want to be consistent with the previous databases where ground truth VA scores are provided.
2.3.2 Image Collection and User Study
We collect the images of Emotion6 from Flickr [20] by using the 6 category keywords and synonyms as search terms. High-level semantic content of an image, including strong facial expressions, posed humans, and text, inﬂuences the evoked emotion of an image. Instead of focusing on the text or facial expres-
11

Figure 2.2: Two screenshots of the interface of our user study on the Amazon Mechanical Turk. Before the subject answers the questions (the right image), we provide the subject with the instructions and an example (the left image) explaining how to answer the questions.
sions, we are more interested in how the low-level features of images such as colors and texture affect the evoked emotion. Therefore, we remove the images containing apparent human facial expressions or text directly related to the evoked emotion because these two contents are shown to have strong relationship to the emotion [16, 21]. In contrast to the database emodb [72], which has no human moderation, we examine each image in Emotion6 to remove erroneous images. A total of 1980 images are collected, 330 for each category, comparable to previous databases. Each image is scaled to approximately VGA resolution while keeping the original aspect ratio.
We use Amazon Mechanical Turk (AMT) to collect emotional responses from subjects. For each image, each subject rates the evoked emotion in terms of VA scores, and chooses the keyword(s) best describing the evoked emotion. We provide 7 emotion keywords (Ekman’s 6 basic emotions [18] and neutral), and the subject can select multiple keywords for each image. Instead of directly asking the subject to give VA scores, we rephrase the questions to be similar to GAPED [10]. Figure 2.2 shows two snapshots of the interface. To compare with
12

previous databases, we randomly extract a subset S G containing 220 images from GAPED [10] such that the proportion of each category in S G is the same as that of GAPED. We rejected the responses from a few subjects who failed to demonstrate consistency or provided a constant score for all images.
Each human intelligence task (HIT) on AMT contains 10 images, and we offer 10 cents to reward the subject’s completion of each HIT. In the instructions, we inform the subject that the answers will be examined by an algorithm that detects lazy or fraudulent workers and only workers that pass will be paid. In each HIT, the last image is from S G, and the other 9 images are from Emotion6. We create 220 different HITs for AMT such that the following constraints are satisﬁed:
1. Each HIT contains at least one image from each of 6 categories (by keyword).
2. Images are ordered in such a way that the frequency of an image from category i appearing after category j is equal for all i, j.
3. Each image or HIT cannot be rated more than once by the same subject, and each subject cannot rate more than 55 different HITs.
4. Each image is scored by 15 subjects.
Mean and standard deviation, in seconds, on each HIT are 450 and 390 respectively. The minimum time spent on 1 HIT is 127 seconds, which is still reasonable. 432 unique subjects took part in the experiment, rating 76.4 images on average. After collecting the answers from the subjects, we sort the VA scores, and average the middle 9 scores (to remove outliers) to serve as ground truth. For emotion category distribution, the ground truth of each category is
13

the average vote of that category across subjects. To provide grounding for Emotion6, we compute the VA scores of the images from S G using the above method and compare them with the ground truth provided by GAPED [10], where the original scale 0∼100 is converted linearly to 1∼9 to be consistent with our scale. The average of absolute difference of V (A) scores for these images is 1.006 (1.362) in SAM 9-point scale [6], which is comparable in this highly subjective domain.
2.4 Predicting Emotion Distributions
Randomly splitting the Emotion6 database into training and testing sets with the proportion of 7:3, we propose three methods—SVR, CNN, CNNR and compare their performance with those of the three baselines. The details of the proposed three methods are explained below.
2.4.1 Support Vector Regression (SVR)
Inspired by previous works on affective image classiﬁcation [52, 72, 73], we adopt features related to color, edge, texture, saliency, and shape to create a normalized 759-dimensional feature set shown in Table 2.2. To verify the affective classiﬁcation ability of this feature set, we perform the exact experiment from [52], using their database. The average true positive per class is ∼60% for each category, comparable to the results presented in [52].
We train one model for each emotion category using the ground truth of the category in Emotion6 with Support Vector Regression (SVR) provided in
14

Feature Type Dimension Description

Texture

24 3

Features from Gray-Level Co-occurrence Matrix (GLCM) including the mean, variance, energy, entropy, contrast, and inverse difference moment [52]. Tamura features (coarseness, contrast and directionality) [52].

Composition 2 1 2
3

Rule of third (distance between salient regions and power points/lines) [73]. Diagonal dominance (distance between prominent lines and two diagonals) [73]. Symmetry (sum of intensity differences between pixels symmetric with respect to the vertical/horizontal central line) [73]. Visual balance (distances of the center of the most salient region from the center of the image, the vertical and horizontal central lines) [73].

Saliency

1 1 2

Difference of areas of the most/least saliency regions. Color difference of the most/least saliency regions. Difference of the sum of edge magnitude of the most/least saliency regions.

Color

80

Cascaded CIECAM02 color histograms (lightness, chroma, hue, brightness, and saturation) in the most/least saliency regions.

Edge 512 Cascaded edge histograms (8 (8)-bin edge direction (magnitude) in RGB and gray channels) in the most/least saliency regions.

Shape

128

Fit an ellipse for every segment from color segmentation and compute the histogram of ﬁt ellipses in terms of angle (4 bins), the ratio of major and minor axes (4 bins), and area (4 bins) in the most/least saliency regions.

Table 2.2: The feature set we use for support vector regression (SVR) in predicting emotion distributions.

15

LIBSVM [8] with the parameters of SVR learned by performing 5-fold cross validation on the training set. In the predicting phase, the probabilities of all emotion categories are normalized such that they sum up to 1. To assess the performance of SVR in emotion classiﬁcation, we compare the emotion with the greatest prediction with the dominant emotion of the ground truth. The accuracy of our model in this multi-class classiﬁcation setting is 38.9%, which is about 2.7 times that of random guessing (14.3%).
2.4.2 Convolutional Neural Networks (CNN) and CNN for
Regression (CNNR)
In CNN, we use the exact convolutional neural network in [46] except that the number of output nodes is changed to 7 to represent the probability of the input image being classiﬁed as each emotion category in Emotion6. In CNNR, we train a regressor for each emotion category in Emotion6 with the exact convolutional neural network in [46] except that the number of output nodes is changed to 1 to predict a real value and that the softmax loss layer is replaced with the Euclidean loss layer. In the predicting phase, the probabilities of all emotion categories are normalized to sum to 1. Using the Caffe implementation [32] and its default parameters for training the ImageNet [11] model, we pre-train with the Caffe reference model [32] and ﬁne-tune the convolutional neural network with our training set in both CNN and CNNR.
To show the efﬁcacy of classiﬁcation with the convolutional neural network, we use CNN to perform binary emotion classiﬁcation with the Artphoto database [52] under the same experimental setting of Wang’s method [73]. In
16

Figure 2.3: Classiﬁcation performance of CNN and Wang’s method [73] with the Artphoto database [52]. In 6 out of 8 emotion categories, CNN outperforms Wang’s method [73].
this experiment, we change the number of output nodes to 2 and train one binary classiﬁer for each emotion under 1-vs-all setting. We repeat the positive examples in the training set such that the number of positive examples is the same as that of the negative ones. Figure 2.3 shows that CNN outperforms Wang’s method [73] in 6 out of 8 emotion categories. In terms of the average of average true positive per class of all 8 emotion categories, CNN (64.724%) also outperforms Wang’s method [73] (63.163%).
The preceding experiment shows that CNN achieves state-of-art performance for emotion classiﬁcation of images. However, what we are really interested in is the prediction of emotion distributions, which better capture the range of human responses to an image. For this task, we use CNNR as previously described, and show that its performance is state-of-the-art for emotion distribution prediction.
17

We compare the predictions of our proposed three methods with the following three baselines:
1. A uniform distribution across all emotion categories. 2. A random probability distribution. 3. Optimally dominant (OD) distribution, a winner-take-all strategy where
the emotion category with highest probability in ground truth is set to 1, and other emotion categories have zero probability.
The ﬁrst two baselines represent chance guesses while the third represents a best case scenario for any (prior art) multi-class emotion classiﬁer that outputs a single emotion.
We use four different distance metrics to evaluate the similarity between two emotion distributions — KL-divergence (KLD), Bhattacharyya coefﬁcient (BC), Chebyshev distance (CD), and earth mover’s distance (EMD) [53, 71]. Since KLD is not well deﬁned when a bin has value 0, we use a small value ε = 10−10 to approximate the values in such bins. In computing EMD in our paper, we assume that each of the 7 dimensions (Ekman’s 6 basic emotions [18] and neutral) is such that the distance between any two dimensions is the same. For KLD, CD and EMD, lower is better. For BC, higher is better.
For each distance metric M, we use M and PM to evaluate the ranking between two algorithms, where M is the mean of M, and PM in Table 2.3 (upper table) is the proportion of images where Method 1 matches the ground truth distribution more accurately than Method 2 according to distance metric M. Method 1 is superior to Method 2 when PM exceeds 0.5. For the random distribution baseline, we repeat 100000 times and report the average PM. The results
18

Method 1
CNNR CNNR CNNR CNNR CNNR Uniform

Method 2
Uniform Random
OD SVR CNN OD

PKLD
0.742 0.815 0.997 0.625 0.934 0.997

PBC
0.783 0.819 0.840 0.660 0.810 0.667

PCD
0.692 0.747 0.857 0.571 0.842 0.736

PEMD
0.756 0.802 0.759 0.620 0.805 0.593

Method
Uniform Random
OD SVR CNN CNNR

KLD
0.697 0.978 10.500 0.577 2.338 0.480

BC
0.762 0.721 0.692 0.820 0.692 0.847

CD
0.348 0.367 0.510 0.294 0.497 0.265

EMD
0.667 0.727 0.722 0.560 0.773 0.503

Table 2.3: The performance of different methods for predicting
emotion distributions compared using PM and M (M ∈ {KLD, BC, CD, EMD}). The upper table shows PM, the probability that Method 1 outperforms Method 2 with distance metric M. Each row in the upper table shows that Method 1 outperforms Method 2 in all M. The lower table
lists M, the mean of M, of each method, showing that CNNR
achieves better M than the other methods listed here. CNNR performs the best out of all the listed methods in terms of all
PMs with better M.

are in Table 2.3. CNNR outperforms SVR, CNN, and the three baselines in both PM and M, and should be considered as a standard baseline for future emotion distribution research. Table 2.3 also shows that OD performs even worse than uniform baseline. This shows that predicting only one single emotion category like [52, 72, 73] does not well model the fact that people have different emotional responses to the same image and that the same person may have multiple emotional responses to one image.

19

Figure 2.4: The mapping between the color codes and emotion keywords used in Section 2.5.

Method 1
CNNR CNNR CNNR

Method 2
Popularity Random
SVR

PV
0.631 0.729 0.556

PA
0.577 0.818 0.502

Method
Popularity Random
SVR CNNR

AAD of Valence
1.590 2.423 1.347 1.219

AAD of Arousal
0.829 2.113 0.734 0.741

Table 2.4: The performance of different algorithms for predicting valence and arousal scores. The numbers are the average of absolute difference (AAD) compared with the ground truth in SAM 9point scale [6]. CNNR outperforms the two baselines, and has comparable performance to SVR.

2.5 Predicting Valence–Arousal (VA) Scores

In this section, we present the statistics related to VA scores in SAM 9-point scale [6] and predict VA scores. For the valence scores, 1, 5, and 9 mean very negative, neutral, and very positive emotions respectively. For the arousal scores, 1 (9) means the emotion has low (high) stimulating effect. The boundary of each image is colored according to its dominant evoked emotion using the color codes in Figure 2.4. Figure 2.5 places all the images of Emotion6 in VA plane according to the ground truth of evoked VA scores.
We create predictors for VA scores separately using the same set of features
20

Figure 2.5: The distribution of VA scores of the Emotion6 database. All the images in Emotion6 are placed in VA plane according to their VA scores. The boundary of each image is colored to reﬂect the dominant evoked emotion according to the color codes in Figure 2.4.
21

and similar methods as those of predicting the emotion distributions in Sec. 2.4. We compare the results of SVR and CNNR with the following two baselines:
1. Guessing V (A) score as the mode of all V (A) scores. 2. Guessing VA scores uniformly.
We evaluate the results with the average of absolute difference (AAD) compared with the ground truth in SAM 9-point scale [6]. We also report PV (PA), the proportion of the images in the test set where Method 1 predicts more accurate V (A) than Method 2. For the baseline using uniform random guessing, we repeat 100000 times and report the average. The results are listed in Table 2.4. CNNR outperforms the two baselines, and has comparable performance with respect to SVR.
2.6 Conclusion
This chapter introduces the idea of representing the emotional responses of observers to an image as a distribution of emotions. We describe the methods for estimating the emotion distribution for an image. Further, our proposed emotion predictor, CNNR, outperforms other methods including using SVR with the features from the previous works and the optimal dominant emotion baseline, the upper-bound of the emotion predictors that predict a single emotion. Finally, we propose a novel image database, Emotion6, and provide ground truth of valence, arousal, and probability distributions in evoked emotions.
22

CHAPTER 3 PREDICTING EMOTION STIMULI MAPS
Summary
Which parts of an image evoke emotions in an observer? To answer this question, we introduce a novel problem in computer vision — predicting an Emotion Stimuli Map (ESM), which describes pixel-wise contribution to evoked emotions. Using a new image database, EmotionROI, as a benchmark for predicting the ESM, we ﬁnd that the regions selected by saliency and objectness detection do not correctly predict the image regions which evoke emotion. Although objects represent important regions for evoking emotion, parts of the background are important as well. Based on this fact, we propose two methods for predicting the ESM, one incorporates emotion similarity while the other leverages fully convolutional networks. Both qualitative and quantitative experimental results conﬁrm that our methods can predict the regions which evoke emotions better than both saliency and objectness detection.
3.1 Introduction
Images, when viewed, can cause a variety of emotional responses, depending on not only the arrangement of one or more objects in the image but also the emotional state or background of the viewer. For example, an image of bungee jumping can make outdoors-loving people excited, but it can evoke fear in those afraid of heights. Even within the same image, different regions contribute to the viewer’s evoked emotion differently. Imagine we crop the
23

(a) (b)

(c)

(d) (e)

Figure 3.1: An example showing that different regions in an image contribute to the viewer’s evoked emotion differently. (c), (d), and (e) are cut from the yellow, green, and red rectangles of (a) respectively. (b) shows the regions in (a) which affect the evoked emotion the most marked by user study. (e) will evoke more similar emotions as those evoked by (a) compared with (c) and (d) because (e) contains not only the person jumping but other emotion-related areas, which is consistent with (b), the ground truth of the emotion stimuli map of (a).

yellow, green, and red rectangles (Figure 3.1 (c), (d), and (e)) from Figure 3.1 (a) and present them individually to a viewer without showing the viewer the full image context (a). The emotional response to (e) is more similar to (a) than to either (c) or (d). We represent the varying degree of inﬂuence that regions of an image have on the emotional response of a viewer with an emotion stimuli map (ESM), shown in Figure 3.1 (b), where brighter areas represent higher inﬂuence. The ESM (b) is produced by averaging across selections from a user study, and matches the observation that (e) best captures the emotional-inducing regions of (a). In this chapter, we are interested in predicting the ESM.
Recently, emotion-related topics have gained increasing attention in computer vision, especially affective image classiﬁcation. Machajdik and Hanbury [52] perform affective image classiﬁcation on both artistic and realistic images. Solli and Lenz [72] use Internet images in their experiment, but Wang et al. [73] focus on affective image classiﬁcation of artistic photos or abstract paint-
24

(a) (b) (c) (d)
Figure 3.2: An example showing the difference between saliency, objectness detection and the emotion stimuli map. (b) is the ground truth emotion stimuli map using (a) as the input image. (c) and (d) correspond to saliency [9] and objectness [2] detection, respectively. Neither (c) nor (d) perfectly captures (b), where two thirds of the subjects convey that the area affecting evoked emotions includes not only the ﬂower but also other emotionrelated areas.
ings. Our prior work [66] also predict and transfer emotion distributions using Internet images. In addition, there are related works studying emotions from animated GIFs [34] and multilingual perspectives [35]. Even though different forms of multimedia have been explored, none of the previous works analyze the inﬂuence of various regions in an image on emotion. There is no benchmark for evaluating the ESM. We use the images collected in the Emotion6 database [66] to build a benchmark database, EmotionROI, for predicting the ESM. The ground truth of the ESM provided in the EmotionROI database is generated based on the answers marked by users in a user study. The details of the EmotionROI database are explained in Sec. 3.2.
Saliency detection [9, 25, 51] and objectness measurement [2] are two popular topics closely related to predicting the ESM. While saliency and objectness detection tend to ﬁnd salient objects in an image, the ESM captures the regions affecting the evoked emotion and those regions may contain not only the salient objects but other emotion-related areas. For example, Figure 3.2
25

(c) and (d) are the results of saliency [9] and objectness [2] detection respectively with Figure 3.2 (a) as the input. Figure 3.2 (c) focuses on the dark salient areas, but Figure 3.2 (d) emphasizes the withered ﬂower. Neither Figure 3.2 (c) nor (d) perfectly captures the ground truth ESM in Figure 3.2 (b), where two thirds of the subjects convey that the area affecting evoked emotions the most includes not only the ﬂower but also other emotion-related areas. In this chapter, we propose two methods for predicting the ESM with the result closer to the ground truth versus the state-of-the-art algorithms for saliency [9] and objectness [2] detection.
Previous work related to saliency detection [36] often considers using eyetracking equipments to gather ground truth and perform validation. However, when building the ground truth ESM in the EmotionROI database, we choose not to use eye-tracking equipments because of the following two reasons:
1. Saliency detection is different from predicting the ESM in terms of the task deﬁnition, and we also show their difference in Figure 3.2, where (b) and (c) are not even similar.
2. Where humans look at in an image may implicitly reveal partial areas which affect the evoked emotion the most. However, we believe that directly asking the subjects to mark the emotion-related areas is a more straightforward and efﬁcient method which can avoid potential errors caused by the inference from the eye-tracking results.
To the best of our knowledge, this is the ﬁrst work in computer vision addressing the problem of predicting the ESM. We make the following contributions:
26

1. We build a benchmark database, EmotionROI, for predicting the ESM by performing a user study and collecting the ground truth ESMs of the images provided in the Emotion6 database [66]. The EmotionROI database is available online [65].
2. We propose two methods in Sec. 3.3 for predicting the ESM. One leverages emotion similarity, while the other one learns features with fully convolutional networks. Both our methods predict more accurate ESMs than do the state-of-the-art algorithms of saliency [9] and objectness [2] detection.
3.2 The EmotionROI Database and User Study
We use the images in the Emotion6 database [66] to build the EmotionROI database, our proposed benchmark database for predicting the ESM. The EmotionROI database provides the ESMs produced by integrating rectangular areas from the image identiﬁed by subjects as inﬂuential to the evoked emotion. In the following paragraphs, we introduce the Emotion6 [66] database ﬁrst and explain the details of building the EmotionROI database.
The Emotion6 [66] database consists of 6 emotion categories with 330 images per category. For each image, the following information is provided:
1. The ground truth of evoked emotion distribution in terms of emotion keywords.
2. The emotion keyword used to search each image.
The Emotion6 [66] database is assembled from Flickr [20] by entering the 6 category keywords corresponding with the Ekman’s 6 basic emotions [18] (anger,
27

Figure 3.3: A screenshot of the interface of our user study on the Amazon Mechanical Turk. We ask the subject to draw a rectangle enclosing the part of the image that most inﬂuences the evoked emotion.
disgust, joy, fear, sadness, and surprise) and their synonyms as the searching keywords, followed by a step of human moderation to remove erroneous images. The Emotion6 [66] database contains 1980 images, 330 per category, comparable to previous databases [10, 47]. Each image is approximately VGA resolution.
Adopting all the 1980 images in the Emotion6 [66] database, we use the Amazon Mechanical Turk (AMT) to collect responses from subjects, building the ground truth ESMs in the EmotionROI database. We ask the subject to draw a rectangle enclosing the part of the image that most inﬂuences the evoked emotion. Figure 3.3 is a snapshot of the interface. We collect the responses in a similar way as that used in the Emotion6 [66] database. We consider the emotion categories provided by the Emotion6 [66] database and create 220 different human intelligence tasks (HITs) (each HIT contains 10 images) for AMT that meet the following constraints:
1. Each HIT contains at least one image from each of the 6 categories.
28

Figure 3.4: Some example images from the EmotionROI database with the corresponding ground truth emotion stimuli maps. The emotion keyword used to search each image (provided by the Emotion6 [66] database) is displayed under the image.
2. Images are ordered in such a way that the frequency of an image from category i appearing after category j is equal for all i, j.
We also enforce the following regulations to be consistent with the previous database [10]:
1. The same subject can only respond to each image or HIT at most once, and each subject cannot respond to more than 55 different HITs to increase diversity.
2. We collect 15 responses for each image to have statistically signiﬁcant results.
432 unique subjects participate in the experiment, responding to an average of 76.4 images each. For the ESM, we assume the inﬂuence of each pixel on evoked emotions is proportional to the number of drawn rectangles covering that pixel. The ground truth ESMs are normalized to the range between 0 to 1. Figure 3.4 shows some example images in the EmotionROI database and the
29

corresponding ground truth ESMs. Figure 3.4 also shows the emotion keyword used to search each image (provided by the Emotion6 [66] database).
3.3 Proposed Methods
We propose two methods for predicting the ESM — labeling from emotion similarity (LES) and fully convolutional networks with Euclidean loss (FCNEL). LES leverages objectness detection [2] and emotion prediction [66] from prior works without directly learning from the EmotionROI database, while FCNEL directly learns from the ground truth of the training data of the EmotionROI database. We explain LES and FCNEL in Sec. 3.3.1 and Sec. 3.3.2 respectively.
3.3.1 Labeling from Emotion Similarity (LES)
Because the regions affecting the evoked emotion may contain not only the main objects but other emotion-related areas, we want to leverage the results of both objectness detection and emotion distribution prediction in our framework of predicting the ESM. For objectness detection, we adopt Alexe’s method [2] which takes an image as input and outputs a set of bounding boxes B representing the most probable locations for an object. For each bounding box bi ∈ B, the objectness score si representing the probability of bi covering an object is also available from Alexe’s method [2]. A pixel-wise objectness map describing the probability of each pixel belonging to some object can be approximated by assigning each pixel the sum of the objectness scores of the bounding boxes covering that pixel and normalizing the map.
30

For emotion distribution prediction, we use the method and the Emotion6 database introduced in our prior work [66]. We adopt the same feature set as that used in [66] which includes the features related to color, edge, texture, saliency, and shape from the previous works studying affective image classiﬁcation [52, 72, 73]. We train one emotion predictor for each of the 6 emotion categories in the Emotion6 [66] database using the ground truth of that category provided in the Emotion6 [66] database with support vector regression (SVR) [8]. The parameters of SVR are learned by performing 5-fold cross validation. In testing phase, the probabilities of all emotion categories are normalized such that they sum up to 1. Following the above steps, we build our ﬁnal emotion predictor which takes an image as input and outputs a 7-D emotion distribution vector.

Since we observe that most subjects tend to select regions enclosing some

object, we design our algorithm to predict the ESM based on the bounding boxes

given by objectness detection. We deﬁne a weighting parameter wi ∈ [0, 1] asso-

ciated with each bounding box bi ∈ B describing the level of inﬂuence of bi on

the evoked emotion. Given wi for each bi, the ESM can be approximated by

assigning each pixel the sum of the weighting parameters wi of the bounding

boxes covering that pixel and normalizing the map. We formulate the problem

of estimating wi for each bi as a multi-labeling problem. Assuming there are M

discrete

levels

for

each

wi

∈

L

=

{l1, l2, · · · , lM},

where

lj

=

j−1 M−1

for

j

∈

{1, 2, · · · ,

M},

our goal is to ﬁnd a labeling l ∈ L|B| such that the following energy function is

minimized:

E (l) = ψi (wi) + λ

ψi j wi, w j ,

bi∈B

bi∈B b j∈B b j bi

(3.1)

where ψi and ψi j are the data term and smoothness term respectively, and λ

31

is the weighting of ψi j. This energy function captures the idea that the ESM should be a map with high values in emotionally involved regions, including the main object(s) and partial background contributing to the evoked emotion. The details of ψi and ψi j are presented in the following paragraphs.

The data term ψi is deﬁned as:

ψi

(wi)

=

A

(bi)

·

si

·

|wi M

− −

si| , 1

(3.2)

where A (bi) is the area of bi normalized by the area of the entire image. We design the data term such that the ﬁnal estimation of wi will not be too far away from si since the subjects tend to select rectangles enclosing main objects. Those bis with larger A (bi) and si will be weighted more in E (l) such that both the coverage and objectness scores are respected.

The smoothness term ψi j is deﬁned as:

ψi j

wi, w j

=A

bi ∩ b j

· BC

ei, e j

·

wi − w j , M−1

(3.3)

where A bi ∩ b j is the normalized area of the intersection of bi and b j, ei and e j are the evoked emotion distributions predicted by our emotion predictor using bi and b j as input respectively, and BC ei, e j is the Bhattacharyya coefﬁcient between ei and e j. BC (·, ·) returns a value between 0 to 1 representing the similarity between two input probability distributions. The higher the Bhat-

tacharyya coefﬁcient is, the more similar the two input probability distributions

are. Both the idea of modeling emotions as distributions and the choice of using

the evaluation metric BC are inspired by [66]. The design of ψi j encourages bounding boxes with larger overlapping area or more similar evoked emotion

distributions to take similar labels.

Initializing wi with the label closest to si, we minimize E (l) by graph cut

32

with α-expansion algorithm [5]. In our experiment, we set M = 10 to reach a balance between the diversity of the label set and computational efﬁciency, and λ is empirically set to 0.001.
3.3.2 FCN with Euclidean Loss (FCNEL)
Fully Convolutional Networks (FCN) have been shown to achieve the stateof-the-art performance in semantic segmentation since Long et al. [49] popularized this approach. We leverage FCN as another method for predicting the ESM because FCN provides an end-to-end training framework which generates pixel-wise dense prediction of the same resolution as the input image. Speciﬁcally, we adopt the FCN in Long’s work [49] with single stream, 32-pixelprediction-stride version based on the AlexNet [46] architecture. We choose this standard and relatively simple architecture versus other deeper or more complicated networks because the size of our EmotionROI database is relatively small. Therefore, we want to keep the number of parameters which need to be trained manageable.
In Long’s work [49], the softmax loss layer is used as the objective function in the FCN for semantic segmentation where any two different semantic labels are mutually exclusive. However, in predicting the ESM, we want to predict the inﬂuence on evoked emotions at each pixel location, not one out of many mutually exclusive class labels. Therefore, we change the topmost fully connected layer of FCN such that only one output representing the inﬂuence on evoked emotions is predicted at each pixel location. We also change the softmax loss layer to Euclidean loss layer such that the modiﬁed FCN can be trained to
33

predict the ESM close to the corresponding ground truth in terms of L2-norm. To distinguish the FCN using Euclidean loss from the common FCN used in semantic segmentation, we use FCNEL to refer to the former method.
In our experiment, we train the FCNEL for predicting the ESM by using the Caffe [32] framework. We pre-train our FCNEL with the reference model, FCNAlexNet, which is trained for the PASCAL VOC segmentation task [19] and provided by Long et al. [49]. After pre-training, we ﬁne-tune all the parameters of FCNEL with the training data of the EmotionROI database. To efﬁciently train FCNEL but also avoid convergence issue of the learned parameters, we empirically set the base learning rate to 10−8. The number of training iterations is set such that each training example is visited at least 20 times. For other training details, we adopt the same setting provided by Long et al. [49] unless otherwise speciﬁed.
3.4 Experimental Setting
We experiment on our proposed EmotionROI database, and we use the same training/testing split as that used in our prior work [66] unless otherwise speciﬁed. Therefore, there are 1386/594 training/testing images out of all the 1980 images in the EmotionROI database.
3.4.1 Evaluation Metrics
We use 8 evaluation metrics, including mean absolute error (MAE), precision, recall, 4 commonly used F-measures (F0.5, F √0.3, F1, and F2 scores), and the
34

precision-recall (PR) curve. All the predicted ESMs are normalized to 0 to 1 before evaluation. MAE corresponds to the mean absolute error between the value of the predicted map and the ground truth at all pixel locations. precision is deﬁned as the ratio of emotionally involved pixels correctly assigned to all the pixels identiﬁed in the predicted map, while recall represents the percentage of detected emotionally involved pixels out of all the pixels marked in the ground truth. Before computing precision and recall, we binarize each predicted map adaptively according to its Otsu threshold [57]. F-measure is deﬁned in terms of precision and recall as follows:

Fβ =

1 + β2

· precision · recall , β2 · precision + recall

(3.4)

where β controls the weighting between precision and recall. In addition to the

3

common

F-measures

(F0.5,

F1,

and

F2),

we

also

include

F

√ 0.3

because

it

is

a

standard metric for saliency detection [78]. For the PR curve, we binarize the

predicted map using each threshold between [0, 255]/255, which is similar as

the method used in [78].

3.4.2 Baselines — Saliency and Objectness Detection
Applying our two proposed methods to all the testing images of the EmotionROI database to predict the ESMs, we compare the results with those of the state-of-the-art method of saliency [9] and objectness [2] detection. Since the saliency detection method proposed by Cheng et al. [9] outputs pixelwise saliency maps instead of the maps consisting of bounding boxes, we post-process the results of the saliency map for fair comparison by drawing a bounding box covering the middle p percent of salient pixels in both width and
35

Figure 3.5: The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in PR curve, average precision and recall, and 4 F-measures. These ﬁgures show that our two proposed methods (LES and FCNEL) outperform the two baselines (saliency [9] and objectness [2] detection) in PR curve as well as 6 statistics computed from the average precision and recall.
height. The processed saliency map will include one bounding box ﬁlled with 1 and other areas ﬁlled with 0. Testing p = 50 + 5i for i = {0, 1, · · · , 9}, we ﬁnd that the resulting MAEs are similar, so we only report the best result among them which uses p = 80. We also compute the MAE of context-aware saliency detection [25], and the results are similar to those of Cheng’s method [9]. Therefore, we only report the results using Cheng’s method [9] for saliency detection.
3.5 Experimental Results
We evaluate the predicted ESMs of the 594 testing images of the EmotionROI database using the 8 evaluation metrics mentioned in 3.4.1. The results are summarized in Figure 3.5 and Table 3.1. Figure 3.5 shows PR curve, average precision and recall, and 4 F-measures. For saliency detection [9], we only show
36

Method Saliency [9] Objectness [2] LES FCNEL

MAE 0.328 / 0.303

precision 0.585 / 0.666

recall 0.340 / 0.536

F0.5

F

√ 0.3

F1

0.455 / 0.600 0.444 / 0.594 0.383 / 0.556

F2 0.349 / 0.538

0.197 0.668 0.510 0.604 0.597 0.551 0.522

0.186 0.132 0.668 0.669 0.600 0.718 0.642 0.672 0.639 0.673 0.617 0.683 0.603 0.701

Method 1 Method 2

PMAE

P preci sion

Precall

LES Saliency [9] 0.973 / 0.944 0.723 / 0.467 0.873 / 0.635

LES Objectness [2]

0.721

0.527

0.840

FCNEL FCNEL FCNEL

Saliency [9] 0.983 / 0.990 0.733 / 0.512 0.949 / 0.777

Objectness [2]

0.847

0.503

0.892

LES

0.786

0.492

0.850

Method 1 Method 2

PF0.5

PF

√ 0.3

PF1

PF2

LES Saliency [9] 0.937 / 0.651 0.938 / 0.656 0.935 / 0.656 0.902 / 0.639

LES Objectness [2] 0.811

0.823

0.877

0.857

FCNEL Saliency [9] 0.981 / 0.803 0.988 / 0.824 0.986 / 0.860 0.969 / 0.807

FCNEL Objectness [2] 0.763

0.810

0.919

0.916

FCNEL

LES

0.677

0.707

0.862

0.870

Table 3.1: The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in MAE, precision, recall, and 4 F-measures. The performance is shown by PM and
M (M ∈ MAE, precision, recall, F0.5, F √0.3, F1, F2 ). The top table
lists M, the mean of metric M, of each method, showing that LES
and FCNEL achieve better M than the two baselines. For each metric M, the best M is marked in bold. The bottom two tables show PM, the probability that Method 1 outperforms Method 2 with metric M. Each row in the bottom two tables shows that Method 1 outperforms Method 2 in most M. For saliency detection [9], both the results before / after “block ﬁtting” are reported. FCNEL performs the best out of all the listed methods
in terms of most PM with better M.

37

the results before “block ﬁtting” in PR curve because after “block ﬁtting,” the ESMs will be binary with two extreme values by construction. Therefore, there will be no more than 3 points in PR curve for saliency detection [9] after “block ﬁtting.” However, we do ﬁnd that those points are all below the two PR curves representing LES and FCNEL.
We also perform pairwise comparison between our two proposed methods and the two baselines in Table 3.1, where for each metric M, M and PM are used to compare different methods. M is the mean of M, and PM in Table 3.1 represents the proportion of the testing images of the EmotionROI database where Method 1 predicts more accurate ESMs than Method 2 does according to metric M. Method 1 is superior to Method 2 when PM exceeds 0.5. The best M for each M is marked in bold in Table 3.1. For saliency detection [9], we report both the results before / after “block ﬁtting” in Table 3.1, where “block ﬁtting” improves the performance.
Both Figure 3.5 and Table 3.1 show that LES and FCNEL outperform saliency [9] and objectness [2] detection under most evaluation metrics and that FCNEL performs the best among all the listed methods. The only exception is the results of precision, where saliency [9] after “block ﬁtting,” objectness [2], LES, and FCNEL show comparable performance. Since the most salient object usually has a relatively high value on the ESM, it makes sense that both saliency [9] and objectless [2] achieve reasonable precision. However, what affects evoked emotions is not only salient objects but also other emotionally involved areas, as shown in the ground truth of the EmotionROI database, Figure 3.1, and Figure 3.2. LES and FCNEL show better ability in identifying those emotionally involved areas compared with both saliency [9] and object-
38

ness [2] detection, which is reﬂected in the metrics involving recall. In addition to Figure 3.5 and Table 3.1, we also analyze the performance with respect to the emotion keywords used to search the testing images of the EmotionROI database (given by the Emotion6 [66] database). Speciﬁcally, we evaluate the performance of each method on the testing images searched using each emotion keyword. We ﬁnd that LES and FCNEL outperform saliency [9] and objectness [2] detection and that FCNEL performs the best out of all the methods under survey, regardless of the emotion category.
Figure 3.6 shows the qualitative and quantitative results of predicting the ESMs with some testing images of the EmotionROI database as input. Column (a) to (g) are the input image, the ground truth ESM, the pre-processed result of saliency detection [9], the post-processed result of saliency detection [9], the result of objectness detection [2], the result of LES, and the result of FCNEL respectively. For column (a), the emotion keyword under each image is the keyword used to search that image according to the information provided in the Emotion6 [66] database. For column (b) to (g), the corresponding MAE is shown under each image.
Initializing the ESM with column (e), LES adjusts the weighting of each bounding box according to the emotion similarity in the energy minimization such that the ﬁnal result in column (f) gets closer to the ground truth in column (b). Directly learning from the ground truth of the training images of the EmotionROI database, FCNEL predicts even more accurate ESMs than LES does, as shown in column (g). Compared with saliency [9] and objectness [2] detection, our results show that the introduction of the cues from emotion similarity and the features learned from the training images of the EmotionROI
39

(a) input (b) ground truth (c) saliency (d) saliency (box) (e) objectness (f) LES

(g) FCNEL

Figure 3.6: The qualitative and quantitative results of predicting emotion stimuli maps with some testing images of the EmotionROI database as input. The representation of each column is as follows: (a) input image, (b) the ground truth emotion stimuli map, (c) the pre-processed result of saliency detection [9], (d) the post-processed result of saliency detection [9], (e) the result of objectness detection [2], (f) the result of LES, and (g) the result of FCNEL. The emotion keyword used to search each input image is shown under each image in column (a) according to the information provided in the Emotion6 [66] database. For column (b) to (g), the corresponding MAE is shown under each image. Our two proposed methods (column (f) and (g)) predict more accurate emotion stimuli maps than other baselines (column (c) to (e)) do for these examples.

40

database improve the results of predicting the ESM.
3.6 Conclusion
We identify a novel problem, predicting the emotion stimuli map (ESM), in computer vision. Building a new image database, EmotionROI, as a benchmark for predicting the ESM, we address the major difference between the ESM, saliency and objectness detection — the regions affecting evoked emotions contain both the main objects and additional contextual background necessary for the viewer to fully experience the emotion of the image.
Based on the above ﬁnding, we propose two methods, LES and FCNEL, for predicting the ESM. LES uses the cues from emotion similarity, while FCNEL uses fully convolutional networks to directly learn from the training set of the EmotionROI database. We present qualitative and quantitative results showing that both our methods predict more accurate ESMs compared with saliency and objectness detection.
41

CHAPTER 4 MULTI-TASK CNN FEATURES
Summary
Most works using convolutional neural networks (CNN) show the efﬁcacy of their methods in standard object recognition tasks, but not in abstract tasks such as emotion classiﬁcation and memorability prediction, which are increasingly important (especially as machines become more autonomous, there is a need for semantic understanding). To verify whether CNN-based methods are effective in abstract tasks, we select 8 different abstract tasks in computer vision, evaluating the performance of 5 different CNN-based training approaches in these tasks. We show that CNN-based approaches outperform the state-of-theart results in all the 8 tasks. Furthermore, we show that concatenating CNN features learned from different tasks can enhance the performance in each task. We also show that concatenating the CNN features learned from all the tasks under experiment does not perform the best, which is different from what is usually shown in previous works. Using CNN as a tool to correlate different tasks, we suggest which CNN features researchers should use in each task.
4.1 Introduction
Utilizing convolutional neural networks has become a popular approach in recent research [15, 26, 82, 83]. However, relatively few CNN-related works concentrate on abstract tasks. Abstract tasks have been receiving increasing attention, including classiﬁcation tasks (emotions [52, 73], architec-
42

tural styles [77], aesthetic qualities [50, 54], fashion styles [43], artist and artistic styles [41]) and regression tasks (memorability [7, 30, 31, 42, 44] and interestingness [22, 28]). Most of the mentioned references tackle the abstract tasks without using CNN.
In this chapter, we are interested in the performance of CNN in abstract tasks because it is not yet well studied in the current literature. We are curious whether a CNN can outperform the current state-of-the-art results in abstract tasks. In addition, we are also curious about the following questions which are rarely discussed in current literature where different abstract tasks are treated in a relatively independent fashion. We would like to know whether the CNN features learned from an abstract task can improve the performance in another abstract task. Furthermore, we want to identify for a given abstract task, concatenating which of the many learned CNN features will perform the best. These questions are related to the transferability mentioned by Yosinski et al. [80]. However, Yosinski et al. [80] study the transferability only in the ImageNet [11] classiﬁcation task. This chapter is related to the transferability across different tasks and databases. Our experiments include 8 abstract tasks from 6 databases, covering the abstract tasks mentioned previously. The databases and tasks used in this chapter are summarized in Table 4.1, where each task is assigned a task ID. In this chapter, we use the task ID to refer to each abstract task.
Most works using CNN-related features adopt the AlexNet [46] and pretrain it with the supervision for ImageNet [11] classiﬁcation. As a novel departure from the previous works, this chapter uses not only the CNN features pretrained with the supervision for ImageNet [11] but also the CNN features pre-
43

database task
reference task ID

Artphoto emotion classiﬁcation
[52] EMO

Painting-91 artist
classiﬁcation [41] AST

Painting-91 artistic style classiﬁcation
[41] ART

AVA aesthetic classiﬁcation
[54] AVA

# classes # images image type class labels

8 806 deviantart [12] fear, sad, etc.

91 4266 painting Rubens, Picasso, etc.

13 2338 painting Baroque, Cubbism, etc.

2 >250k dpchallenge [17] high / low aesthetic quality

# training images # testing images
data split # fold(s) evaluation metric reference of above setting

∼645 ∼160 random
5 1-vs-all accuracy
[73]

2275 1991 speciﬁed [41]
1 accuracy
[41]

1250 1088 speciﬁed [41]
1 accuracy
[41]

∼233k 19930 speciﬁed [54]
1 accuracy
[54]

database task
reference task ID

HipsterWars arcDataset Memorability Memorability

fashion style architectural style memorability interestingness

classiﬁcation classiﬁcation prediction

prediction

[43] [77]

[31] dataset: [31]; task: [28]

FAS ARC MEM

INT

# classes # images image type class labels

5 1893 outﬁt Bohemian, Goth, etc.

10 / 25 2043 / 4786 architecture Georgian, Gothic, etc.

regression task 2222
general memorability

regression task 2222
general interestingness

# training images # testing images
data split # fold(s) evaluation metric reference of above setting

853 92 random 100 accuracy [43]

300 / 750 1743 / 4036
random 10
accuracy [77]

1111 1111 speciﬁed [31]
25 ρ
[31]

1982 240 random 10 ρ
[28]

Table 4.1: The databases and associated abstract tasks used in this thesis along with their properties. In the rest of the thesis, we refer to each task by the corresponding task ID listed under each task. The experimental setting for each task is provided at the bottom of the table, where ρ is Spearman rank correlation between the prediction and the ground truth.

44

trained with the supervision for AVA [54] under the AlexNet [46]. We compare the performance of these two sets of CNN features in all 8 abstract tasks in Table 4.1, identifying which set of features achieves better performance in each abstract task.
When using CNN-based approaches or features for abstract tasks, most existing works limit themselves to one speciﬁc domain instead of leveraging the CNN features learned from different domains of abstract tasks. For instance, Bar et al. [3] use CNN-based features to perform artistic style classiﬁcation of the images in WikiArt database [4]. Peng et al. [66] predict emotion distributions with their proposed Emotion6 database. Karayev et al. [39] work on image style classiﬁcation with their proposed databases, but Lu et al. [50] are interested in classifying both aesthetic qualities and image styles using AVA database [54]. Though our databases do not completely overlap, 3 of our 8 selected tasks in Table 4.1 (EMO, ART, and AVA) cover similar abstract tasks to theirs. The novelty of this paper is in applying CNN-based features to 8 abstract tasks from 8 different domains and leveraging the features learned from multiple tasks simultaneously. The following section summarizes our ﬁndings.
4.1.1 Summary of Findings
Superior Performance of CNN-based Approaches in Abstract Tasks
Testing the performance of 5 different training approaches with the AlexNet [46], we ﬁnd that at least one of the ﬁve CNN-based approaches outperforms the current state-of-the-art performance in the 8 abstract tasks in Table 4.1.
45

Concatenating CNN Features Learned from Different Tasks Can Enhance the Performance in Each Task
Unlike previous works [28, 31, 41, 43, 52, 54, 77] that only use standard or handcrafted features without using the features speciﬁcally trained for other abstract tasks, we argue that the performance of a given abstract task can beneﬁt from the features learned from other existing abstract tasks. More precisely, our results show that for each of the abstract tasks in Table 4.1, concatenating the CNN features learned from some other task(s) which are different from the task of interest can outperform using only the CNN features learned from the task of interest. This ﬁnding supports that when the computer vision community keeps identifying and proposing brand new tasks, researchers should leverage the knowledge learned from other related tasks.
Concatenating CNN Features Learned from All the Tasks Does Not Perform the Best
To identify which CNN features to concatenate will perform the best in each task, we evaluate different settings of concatenating CNN features. Our results show that concatenating the CNN features learned from all the tasks in our experiment does not perform the best, which is surprising and inconsistent with what is usually shown in previous works [30, 31, 41, 54] where combining all the features achieves the best performance. We also show that in some abstract tasks, using only the CNN features learned from the task of interest outperforms concatenating the CNN features learned from all the tasks. In addition, for each task, we identify the concatenating setting which results in the best performance in our experiment.
46

Suggestions of Choosing CNN Features to Use in Abstract Tasks
We address that CNN can be used as a tool to correlate different abstract tasks. According to the performance of the CNN features learned from different abstract tasks, we are able to interpret a given abstract task using higher-level semantics instead of only using standard or handcrafted features like previous works [28, 41, 43, 52, 54]. For example, according to our results, artistic style features (F ART) outperforms fashion style features (F FAS) in the artist classiﬁcation task (AST), where the notation “F T” denotes CNN features learned from the task T (T is a task ID). For each task, we release the performance ranking of the CNN features learned from different tasks. The ranking is an indicator of which CNN features we should consider in each task. We hope this method of correlating different abstract tasks will encourage researchers to leverage the knowledge learned from existing tasks more when they solve their tasks of interest.
4.2 Experimental Setup
4.2.1 Databases and Tasks
Performing 8 abstract tasks from 6 databases, we summarize all the databases and tasks used in this chapter in Table 4.1 along with their properties and related statistics. We refer to each task by its task ID listed in Table 4.1, where the experimental setting associated with each task is also provided. In Table 4.1, “data split” indicates whether the training/testing splits are randomly generated or
47

speciﬁed by the work proposing the dataset/task, and ”# fold(s)” represents the number of different training/testing splits used for the task. The experimental setting of each task follows that of the corresponding reference listed at the bottom of Table 4.1.
4.2.2 Network Architecture
For the 6 classiﬁcation tasks in Table 4.1, we use the Caffe [32] implementation of the AlexNet [46] except that the number of the nodes in the output layer is set to the number of classes in each task. For the 2 regression tasks (MEM and INT), we also use the Caffe [32] implementation of the AlexNet [46] except that the number of the nodes in the output layer is changed to 1 to predict a real value and that the softmax loss layer is replaced with the Euclidean loss layer. When using the Caffe [32] implementation, we adopt its default training parameters for training the CNN for ImageNet [11] classiﬁcation unless otherwise speciﬁed.
4.2.3 Training Approaches
Before training, we resize all the images to 256×256 which is the same size used to train the CNN for ImageNet [11] classiﬁcation in Caffe [32] implementation. We directly adopt the Caffe reference model [32] (denoted as MImageNet) for ImageNet [11] classiﬁcation. Using the same CNN architecture (except that the number of the nodes in the output layer is set to 2), we train an AVA [54] reference model (denoted as MAVA) with ∼233k training images and training from scratch approach (randomly initialize all the CNN parameters and train with the
48

training approach ID pt ImageNet + ft pt ImageNet + ft-fc8
pt AVA + ft pt AVA + ft-fc8
train from scratch

description
Pre-train with MImageNet and ﬁne-tune all the CNN parameters using the training set. The same as “pt ImageNet + ft” except that only the CNN parameters associated with the edges directly connected to the output layer are allowed to be updated using the training set. Pre-train with MAVA and ﬁne-tune all the CNN parameters using the training set. The same as “pt AVA + ft” except that only the CNN parameters associated with the edges directly connected to the output layer are allowed to be updated using the training set. Randomly initialize all the CNN parameters and train with the training set.

Table 4.2: The ﬁve different CNN training approaches used in this chapter. MImageNet is the Caffe [32] reference model trained for ImageNet [11] classiﬁcation, and MAVA is our trained reference model for AVA [54] classiﬁcation. We refer to each training method by its training approach ID.

training set). We train a reference model for AVA [54] instead of other databases in Table 4.1 because the number of training images in AVA [54] is large enough (>230k) such that training from scratch can achieve reasonable performance.
Given MImageNet and MAVA, we list the 5 training approaches used in this chapter in Table 4.2, following the descriptions and setting of supervised pretraining and ﬁne-tuning used in [1] unless otherwise speciﬁed. In Table 4.2, pre-training (pt) with MD means using a data-rich auxiliary dataset D (D ∈ ImageNet, AVA ) to initialize the CNN parameters. Fine-tuning (ft) means all the CNN parameters can be updated by continued training on the dataset of interest. “ft-fc8” is the same as “ft” except that only the CNN parameters associated with the edges directly connected to the output layer are allowed to be updated. In this chapter, we use the training approach ID in Table 4.2 to refer to each training approach. Applying all the 5 training approaches listed in Table 4.2 to all the 8 tasks in Table 4.1, we report the experimental results and

49

task ID evaluation metric
previous work pt ImageNet + ft pt ImageNet + ft-fc8
pt AVA + ft pt AVA + ft-fc8 train from scratch

EMO 1-vs-all accuracy (%)
63.163 [73] 60.127 64.724 59.836 60.644 61.572

AST accuracy (%)
53.100 [41] 56.102 53.541 25.615 4.671 21.698

ART accuracy (%)
62.200 [41] 68.290 65.165 40.625 18.015 38.327

AVA accuracy (%)
73.250 [50] n/a n/a n/a n/a 74.436

task ID evaluation metric
previous work pt ImageNet + ft pt ImageNet + ft-fc8
pt AVA + ft pt AVA + ft-fc8 train from scratch

FAS accuracy (%)
70.971 [43] 71.294 66.228 57.337 27.554 54.304

ARC accuracy (%)
69.170 / 46.210 [77] 71.159 / 52.953 67.246 / 51.469 35.841 / 20.401 18.233 / 8.290 21.532 / 12.386

MEM ρ
0.500 [42] 0.520 -0.140 0.368 0.080 0.372

INT ρ
0.600 [28] 0.643 0.339 0.511 -0.113 0.382

Table 4.3: The summary of the experimental results of the 8 abstract tasks listed in Table 4.1 using the ﬁve training approaches in Table 4.2. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The bold numbers represent the best performance given the speciﬁed evaluation metric, and the underlined numbers indicate the performance better than that of “train from scratch.”

compare them with the corresponding state-of-the-art performance in Sec. 4.3.

4.3 CNN Performance in Abstract Tasks
We summarize the experimental results of the selected 8 abstract tasks in Table 4.3, which shows that in all the 8 tasks, at least one of the ﬁve training approaches in Table 4.2 outperforms the state-of-the-art methods. Table 4.3 also shows that for most of the 8 tasks, the training approaches involving pretraining and ﬁne-tuning usually outperform training from scratch. For AST, ART, and ARC, the complete results are shown in Table 4.3, where the results of
50

ARC are displayed in the form: 10-way / 25-way classiﬁcation accuracy.
Reviewing all the experimental results in Table 4.3, we summarize the following tips for future researchers applying CNN-based approaches to abstract tasks:
1. If there is no prior knowledge about the abstract task of interest, one reasonable way is applying all the 5 training approaches in Table 4.2 and selecting the one with the best performance on the validation set.
2. If we have the prior knowledge of which database (out of all the possible databases at hand which can be used for pre-training) is more relevant to the abstract task of interest, we can directly use that database for pretraining instead of trying all of them.
3. Even if we have no prior knowledge, from our empirical experience, “pt ImageNet + ft” usually performs well for abstract tasks.
4.4 Correlating Abstract Tasks
4.4.1 Experimental Setting
To ﬁnd out whether the features learned from one task can enhance the performance in another task, we select a total of 9 tasks (the 8 abstract tasks in Table 4.1 and Caltech-101 [48] object classiﬁcation task) in our additional experiment. We include Caltech-101 [48] object classiﬁcation task (we use CAL as its task ID) because we are also curious about whether the features learned from an object classiﬁcation task can improve the performance in abstract tasks. For
51

the task CAL, following the same setting in [82] (30 training images per class), we achieve comparable accuracy as that reported in [82] by using the network architecture in Sec. 4.2.2 and the training approach “pt ImageNet + ft.”
For each of the 9 tasks in the experiment, we train the corresponding CNN with the network architecture in Sec. 4.2.2 and the training approach “pt ImageNet + ft.” We treat each of the 9 trained CNN as a feature extractor which takes an image as input and outputs a 4096-d feature vector from its topmost fully connected layer. We use “F T” to represent the 4096-d feature vector output from the CNN trained with the task T where we call F T “self feature.” For example, F EMO and F AST are learned CNN features corresponding to emotion and artist classiﬁcation respectively. In the task EMO, F EMO is self feature, but F AST is not.
With the 9 trained CNN feature extractors, we illustrate the framework of our experiment in Figure 4.1. Our goal is to evaluate the performance in each task under different settings of concatenating learned CNN features. We generate the concatenated CNN features as follows. First, given an input image, we extract all the F Tis where i ∈ {1, 2, · · · , 9} (Ti is one of the abstract task in Table 4.1 or CAL; Ti T j if i j). Second, we decide whether to concatenate F Ti by the binary switch S i associate with F Ti. If S i is set (reset), F Ti will (will not) be part of the features concatenated to form the ﬁnal feature vector. In other words, formed by concatenating all the F Ti with set S i, the ﬁnal concatenated CNN features are a 4096×nset-d vector, where nset is the total number of set S i (i ∈ {1, 2, · · · , 9}).
For each of the 9 tasks, we evaluate the performance under a total of 264 different settings of the 9 switches. These settings include:
52

Figure 4.1: The framework of concatenating the CNN features learned from different tasks. We experiment on 9 tasks (n = 9), including the 8 abstract tasks in Table 4.1 and Caltech-101 [48] object classiﬁcation task. The switch S i associated with each task Ti (i ∈ {1, 2, · · · , 9}) controls whether the CNN features learned from task Ti are concatenated in the ﬁnal feature vector.
1. 9 different combinations of S i such that nset = 1. 2. 28 − 1 different combinations of S i such that nset > 1 and self feature is
concatenated.
For each task, we train a classiﬁer or regressor for each of the 264 settings using the concatenated CNN features of the training database, and we test on the concatenated CNN features of the testing database using the trained classiﬁer or regressor. In this experiment, we use support vector machine (SVM) or support vector regression (SVR) provided in LIBSVM [8], linear kernel, and the LIBSVM [8] default parameters to train all the classiﬁers and regressors.
Considering the efﬁciency of the experiment, we choose to do the ﬁrst 5 folds
53

task ID evaluation metric

EMO

AST

ART

AVA

accuracy (%) accuracy (%) accuracy (%) accuracy (%)

self feature best concatenating setting (Table 4.5)
concatenate all

36.228 39.082 36.971

55.148 57.509 54.596

67.555 71.048 69.210

69.423 69.980 69.458

task ID evaluation metric

FAS ARC MEM INT CAL accuracy (%) accuracy (%) ρ ρ accuracy (%)

self feature best concatenating setting (Table 4.5)
concatenate all

76.957 77.609 74.348

54.440 55.382 53.489

0.398 0.573 0.507 0.630 0.504 0.629

88.217 88.394 85.969

Table 4.4: The summary of the experimental results using the framework in Figure 4.1. The task CAL and the 8 abstract tasks listed in Table 4.1 are included in this experiment. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The underlined numbers indicate the performance better than that of using self feature. The best concatenating setting in each task is shown in Table 4.5. This table shows that concatenating the CNN features learned from different tasks can improve the performance. However, in our experiment, concatenating all the learned CNN features never performs the best in each task. In fact, in 4 out of 9 tasks, concatenating all the learned CNN features performs even worse than using self feature.

of training/testing splits in the tasks FAS, ARC, MEM, and INT where more than 5 folds are provided. In the task EMO, we perform 8-way classiﬁcation instead of 1-vs-all setting. For AVA database [54] associated with the task AVA, we use the generic training set with 2495 images to shorten the training time. In the task ARC, we do the 25-way classiﬁcation task speciﬁed in Table 4.1. In the task CAL, we follow the setting in [82] (30 training images per class). Other experimental settings are consistent with Table 4.1 unless otherwise speciﬁed.

54

task ID ART AST CAL ARC EMO AVA FAS MEM INT

F ART v

v vv v v

F AST v v

vv

F CAL v

v

vv

F ARC

vv

vv

F EMO v v

vv

vv

F AVA v

vv

v vv

F FAS

vv v

vvv

F MEM v

vv

vv

F INT

vv

Table 4.5: The best concatenating setting out of 264 different settings in each task. The corresponding performance is listed in Table 4.4. This table shows that the best performance of each task is not achieved by concatenating all the learned CNN features, but by concatenating a subset of them.

4.4.2 Experimental Results

We summarize the performance of concatenating learned CNN features in Table 4.4, where we compare the performance of concatenating all the 9 F Tis with that of self feature in each task. The best performance out of 264 different settings is also listed in Table 4.4, and the corresponding best concatenating setting is identiﬁed in Table 4.5. In Table 4.4, the underlined numbers represent the performance better than that of using self feature. Table 4.4 supports the ﬁnding that concatenating the CNN features learned from different tasks can improve the performance in each task, regardless of whether the task is an abstract task. Table 4.4 also shows that in all the 9 tasks, concatenating all the learned CNN features is not the best concatenating setting, which is different from most previous works [30, 31, 41, 54] where combining all the features achieves the best performance. One possible reason is because the amount of training data in each task is not sufﬁciently large enough to perfectly train the

55

4096×9-d weight vector of SVM/SVR.
Table 4.4 shows that in the 4 tasks (AST, CAL, ARC, and FAS), concatenating all the learned CNN features performs even worse than using only self feature, which suggests that we should concatenate useful features instead of concatenating all the features. For each task, the best concatenating setting out of 264 different settings is identiﬁed in Table 4.5, which can serve as a guide to selecting useful features in each task.
To be consistent with most previous works [28, 30, 31, 41, 43, 54] where the performance of each single feature is reported, we summarize the performance ranking in each task in Table 4.6, where the numbers are presented in the form “rank (performance).” In Table 4.6, the rank 1 (9) represents the best (worst) performance in each task. We also separately list the best and the worst performances out of the 9 F Tis in each task. We want to address that even the worst performance in each task is much better than random guessing, which indicates that the non-best performance of concatenating all the features shown in Table 4.4 is not simply because of combining useless features together. In Table 4.6, using self feature performs the best in all the 9 tasks except the two regression tasks MEM and INT. One possible reason is that SVR is not designed to maximize ρ, the evaluation metric speciﬁed by the works [28, 31] which propose these tasks.
Given that self feature usually performs the best (out of 9 learned CNN features) as shown in Table 4.6, we are curious about the effect if we add an additional feature to the self feature. In Table 4.7, we analyze the performance ranking of using self feature plus the CNN features learned from another task. The format of the numbers is the same as that of Table 4.6 except that the
56

task ID
F ART F AST F CAL F ARC F EMO F AVA F FAS F MEM F INT
evaluation metric best performance worst performance

ART

AST

CAL

ARC

EMO

1 (67.555) 1 (67.555) 3 (61.121) 4 (60.938) 4 (60.938) 7 (56.985) 6 (57.813) 9 (51.838) 8 (55.790)

2 (48.569) 1 (55.148) 5 (45.907) 4 (46.760) 3 (47.564) 7 (41.487) 6 (44.751) 9 (37.167) 8 (40.532)

4 (81.080) 5 (81.066) 1 (88.217) 2 (83.667) 8 (70.771) 6 (80.475) 3 (81.514) 9 (65.643) 7 (74.316)

4 (48.845) 6 (48.320) 3 (50.411) 1 (54.440) 2 (50.629) 7 (46.824) 5 (48.697) 9 (40.986) 8 (43.925)

7 (31.138) 6 (31.139) 5 (31.886) 2 (33.868) 1 (36.228) 4 (33.372) 3 (33.748) 9 (27.170) 8 (30.029)

accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%)

67.555

55.148

88.217

54.440

36.228

51.838

37.167

65.643

40.986

27.170

task ID
F ART F AST F CAL F ARC F EMO F AVA F FAS F MEM F INT
evaluation metric best performance worst performance

AVA
5 (63.181) 7 (62.915) 3 (63.297) 4 (63.277) 2 (63.392) 1 (69.423) 6 (63.006) 9 (61.004) 8 (61.666)
accuracy (%) 69.423 61.004

FAS
4 (68.478) 6 (66.739) 7 (66.739) 2 (71.739) 5 (68.043) 3 (69.565) 1 (76.957) 9 (60.870) 8 (66.087)
accuracy (%) 76.957 60.870

MEM
3 (0.454) 6 (0.442) 1 (0.464) 7 (0.434) 2 (0.459) 5 (0.445) 4 (0.450) 8 (0.398) 9 (0.346)
ρ 0.464 0.346

INT
5 (0.520) 6 (0.508) 8 (0.493) 9 (0.487) 7 (0.497) 1 (0.575) 4 (0.542) 3 (0.560) 2 (0.573)
ρ 0.575 0.487

Table 4.6: The performance ranking in each task by using the CNN features learned from a single task. The numbers are presented in the form “rank (performance).” The rank 1 (9) represents the best (worst) performance in each task. The table shows that self feature usually performs the best if we use the CNN features learned from one task. The best and the worst performances out of the 9 different features in each task are listed at the bottom of the table, which shows that even the worst performance is non-trivial (much better than random guessing).

57

task ID
self feature + F ART self feature + F AST self feature + F CAL self feature + F ARC self feature + F EMO self feature + F AVA self feature + F FAS self feature + F MEM self feature + F INT
evaluation metric self feature only best performance worst performance

ART

AST

CAL

ARC

EMO

n/a 1 (70.129) 2 (68.658) 2 (68.658) 4 (68.382) 6 (67.739) 7 (67.647) 5 (67.831) 8 (67.096)

3 (56.404) n/a
2 (56.605) 7 (55.098) 1 (57.308) 6 (55.349) 4 (56.103) 5 (55.550) 8 (53.541)

4 (88.231) 5 (87.823)
n/a 3 (88.244) 6 (87.728) 1 (88.319) 2 (88.299) 7 (86.975) 8 (85.834)

2 (55.124) 6 (54.574) 5 (54.747)
n/a 1 (55.149) 4 (54.990) 3 (55.064) 7 (53.271) 8 (52.359)

1 (37.095) 3 (36.560) 5 (36.352) 7 (35.855)
n/a 2 (36.601) 6 (36.105) 4 (36.479) 8 (35.361)

accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%)

67.555

55.148

88.217

54.440

36.228

70.129

57.308

88.319

55.149

37.095

67.096

53.541

85.834

52.359

35.361

task ID

AVA

FAS

MEM

INT

self feature + F ART self feature + F AST self feature + F CAL self feature + F ARC self feature + F EMO self feature + F AVA self feature + F FAS self feature + F MEM self feature + F INT

6 (69.498) 3 (69.528) 8 (69.323) 5 (69.503) 7 (69.473)
n/a 1 (69.729) 3 (69.528) 2 (69.594)

2 (76.957) 7 (74.348) 4 (76.522) 5 (75.652) 6 (75.435) 1 (77.391)
n/a 2 (76.957) 7 (74.348)

7 (0.459) 3 (0.466) 2 (0.468) 5 (0.463) 1 (0.470) 4 (0.465) 6 (0.461)
n/a 8 (0.417)

6 (0.598) 4 (0.599) 5 (0.598) 7 (0.598) 3 (0.599) 1 (0.610) 2 (0.603) 8 (0.585)
n/a

evaluation metric self feature only best performance worst performance

accuracy (%) 69.423 69.729 69.323

accuracy (%) 76.957 77.391 74.348

ρ
0.398 0.470 0.417

ρ
0.573 0.610 0.585

Table 4.7: The performance ranking in each task by using self feature and the CNN features learned from another task. The format of the numbers is the same as that of Table 4.6 except that the underlined numbers represent the performance better than that of using only self feature. The best and the worst performances in each task out of the 8 different combinations of features are listed at the bottom of the table, where the performance of using only self feature is also provided. The table shows that most of the listed feature combinations outperform using only self feature in each task.

58

underlined numbers represent the performance better than that of using only self feature. The best and the worst performances in each task out of the 8 different combinations of features are listed at the bottom of Table 4.7, where the performance of using only self feature is also provided.
Table 4.7 supports that most combinations of using self feature and another CNN features outperform using only self feature, which encourages us to use the CNN features learned from another task when solving the task of interest. However, there are some cases where combining features decreases the performance (as shown in Table 4.7). These cases address the importance of choosing useful features, and both Table 4.6 and Table 4.7 can serve as the indicators of choosing useful features. For instance, in the task ART, we should consider using F AST to enhance the performance. Table 4.6 and Table 4.7 are also examples of using CNN as a tool to correlate different tasks. Leveraging the characteristic of CNN that the features learned from one task can be naturally bundled as a feature set (for example, F ART), we are able to interpret a given task using higher-level semantics. For example, in fashion style classiﬁcation (FAS), aesthetic features (F AVA) are more useful (in terms of enhancing performance) than the features learned from artist classiﬁcation (F AST) according to both Table 4.6 and Table 4.7.
4.5 Conclusion
In this chapter, we apply 5 different CNN-based training approaches to the selected 8 abstract tasks receiving great attention in computer vision recently, showing that our results outperform the state-of-the-art results in all the 8 tasks.
59

Unlike previous researchers who use standard or handcrafted features to solve abstract tasks, we propose a framework to leverage the CNN features learned from different tasks. By evaluating the performance of concatenating features in different settings, we show that using the CNN features leaned from one task can enhance the performance in another task. We also show that concatenating all the learned CNN features in this chapter is not the best option. Instead, we should identify the useful features in each task to achieve the best performance.
To identify the useful features in each task, we not only show the best concatenating setting but also use CNN as a tool to correlate different tasks. By presenting the performance ranking of the CNN features learned from different tasks, we provide suggestions of which CNN features to use in each task. We hope that the results presented in this chapter will encourage researchers proposing new tasks or interested in existing tasks to cooperate instead of only focusing on the task of interest without utilizing the knowledge learned from existing tasks.
60

CHAPTER 5 MULTI-DEPTH CNN FEATURES
Summary
Recent works about convolutional neural networks (CNN) show breakthrough performance on various tasks. However, most of them only use the features extracted from the topmost layer of CNN instead of leveraging the features extracted from CNN with different numbers of layers. As the ﬁrst group which explicitly addresses utilizing the features from CNN with different numbers of layers, we propose multi-depth CNN features which consist of the features extracted from multiple CNNs with different numbers of layers. Our experimental results show that our proposed multi-depth CNN features outperform not only the state-of-the-art results but also the features commonly used in the traditional CNN framework on 8 abstract tasks. As shown by the experimental results, our proposed multi-depth CNN features achieve the best known performance on the 8 abstract tasks in different domains, which makes our proposed multi-depth CNN features promising solutions for generic tasks.
5.1 Introduction
Extraordinary performance by using convolutional neural networks (CNN) has been reported in recent literature [1, 24, 26, 29, 38]. However, there is one major constraint in the traditional CNN framework: the ﬁnal output of the output layer is solely based on the features extracted from the topmost layer. In other words, given the features extracted from the topmost layer, the ﬁnal output is
61

independent of all the features extracted from other non-topmost layers. At ﬁrst glance, this constraint seems to be reasonable because the non-topmost layers are implicitly considered in the way that the output of the non-topmost layers is the input of the topmost layer. However, we believe that the features extracted from the non-topmost layers are not explicitly and properly utilized in the traditional CNN framework where partial features generated by the non-topmost layers are ignored during training. Therefore, we want to relax this constraint of the traditional CNN framework by explicitly leveraging the features extracted from non-topmost layers. Inspired by this idea, we propose multi-depth features based on the AlexNet [46], and we show that our proposed features outperform the AlexNet [46] on 8 abstract tasks listed in Table 4.1. The details of our proposed multi-depth CNN features, the experimental setup and the results are presented in Sec. 5.2, Sec. 5.3, and Sec. 5.4 respectively.
In recent studies analyzing the performance of multi-layer CNN [1, 82], both works extract features from the AlexNet [46] and evaluate the performance on different databases. These works achieve a consistent conclusion that the features extracted from the topmost layer have the best discriminative ability in classiﬁcation tasks compared with the features extracted from other nontopmost layers. However, both works [1, 82] only evaluate the performance of the features extracted from one layer at a time without considering the features learned by CNNs with different numbers of layers at once. Unlike [1, 82], in Sec. 5.4, we show that our proposed multi-depth CNN features outperform the features extracted from the topmost layer of the AlexNet [46] (the features used in [1, 82]).
In previous works [41, 77] studying the abstract tasks involved in this
62

chapter, the state-of-the-art results are achieved by the traditional handcrafted features (for example, SIFT and HOG) without considering CNN-related features. In Sec. 5.4, we show that our proposed multi-depth CNN features outperform the state-of-the-art results on 8 abstract tasks. Another related prior work is the double-column convolutional neural network (DCNN) proposed by Lu et al. [50]. Using DCNN to predict pictorial aesthetics, Lu et al. [50] extract multi-scale features from multiple CNNs with the multi-scale input data generated from their algorithm. In contrast, this chapter focuses on the multi-depth CNN features extracted from CNNs with different numbers of layers without the need to generate multi-scale input.
In this chapter, our main contribution is the concept of utilizing the features extracted from the CNNs with different numbers of layers. Based on this concept, we propose multi-depth CNN features extracted from the CNNs with different numbers of layers and show that our proposed features outperform not only the state-of-the-art performance but also the results of the traditional CNN framework on 8 abstract tasks. To the best of our knowledge, this is the ﬁrst work explicitly utilizing the features extracted from the CNNs with different numbers of layers, which is a novel departure from most CNN-related works which use only the features extracted from the topmost layer.
5.2 Generating Multi-depth CNN Features
Figure 5.1 and Table 5.1 illustrate the involved CNN structures in this chapter and how we form our proposed multi-depth CNN features respectively. There are 5 different CNN structures in Figure 5.1, where CNN0 represents the
63

Figure 5.1: The ﬁve CNN structures adopted in this chapter. CNN0 represents the AlexNet [46], and CNN1 to CNN4 are the same as CNN0 except that some convolutional layers are removed. We use each CNNi (i = {0, 1, · · · , 4}) as a feature extractor which takes an image as input and outputs a feature vector fi from the topmost fully connected layer. These fis are concatenated to form our proposed multi-depth features according to the deﬁnition in Table 5.1.
AlexNet [46], and CNN1 to CNN4 are the “sub-CNNs” of CNN0 (they are the same as CNN0 except that some convolutional layers are removed). We use the same notation of convolutional layers (conv-1 to conv-5) and fully connected layers (fc-6 and fc-7) as that used in [1] to represent the corresponding layers in the AlexNet [46]. In Figure 5.1, in addition to the input and output layers, we only show the convolutional and fully connected layers of the AlexNet [46] for clarity. Instead of using the output from the output layer of each CNNi (i = {0, 1, · · · , 4}), we treat CNNi as a feature extractor which takes an image as
64

feature ID multi-depth features (Figure 5.1) dimension

F0 (baseline)
F1 F2 F3 F4

f0
f0 + f1 f0 + f1 + f2 f0 + f1 + f2 + f3 f0 + f1 + f2 + f3 + f4

k
2k 3k 4k 5k

Table 5.1: The summary of the multi-depth CNN features used in this chapter. Serving as a baseline, F0 represents the features extracted from the topmost layer in the traditional CNN framework. F1 to F4 are our proposed multi-depth CNN features which are formed by concatenating fis (i = {0, 1, · · · , 4}) deﬁned in Figure 5.1. We follow the speciﬁcation of the AlexNet [46] and use k = 4096.

input and outputs a k-d feature vector fi from the topmost fully connected layer, which is inspired by [15]. We follow the speciﬁcation of the AlexNet [46] and use k = 4096 in our experiment.
In Figure 5.1, f0 represents the features extracted from the output of the fc-7 layer of the AlexNet [46], and f1 to f4 represent the features derived from different combinations of the convolutional layers. fi (i = {0, 1, · · · , 4}) is extracted from the topmost fully connected layer of CNNi, not from the intermediate layer of CNN0 because the features extracted from the topmost layer have the best discriminative ability according to [1, 82]. As the features learned from CNN0 and its sub-CNNs, fis implicitly reﬂect the discriminative ability of the corresponding layers of CNN0. Most CNN-related works use only f0 and ignore the intermediate features ( f1 to f4), but we explicitly extract them as part of our proposed multi-depth CNN features which are explained in the following paragraph.
Using the feature vectors ( fis) deﬁned in Figure 5.1, we concatenate these fis

65

and form our proposed multi-depth CNN features. We summarize these multidepth CNN features (F1 to F4) in Table 5.1, where how the features are formed and their dimensions are speciﬁed. F0 represents the features extracted from the topmost layer in the traditional CNN framework without concatenating the features from other layers. The feature IDs listed in Table 5.1 are used to refer to the corresponding multi-depth CNN features when we report the experimental results in Sec. 5.4, where we compare the performance of Fi (i = {0, 1, · · · , 4}) on 8 abstract tasks.
5.3 Experimental Setup
5.3.1 Databases and Tasks
We conduct experiment on 8 abstract tasks, and these databases and tasks are summarized in Table 4.1, where their properties and related statistics are shown. When reporting the results in Sec. 5.4, we use the task ID listed in Table 4.1 to refer to each task. To evaluate different methods under fair comparison, we use the same experimental settings for the 3 tasks (AST, ART, and ARC) as those used in the references listed at the bottom of Table 4.1. For the task AVA, we use the generic training set with 2495 images to shorten the training time. For the task ARC, there are two different experimental settings (10-way and 25-way classiﬁcation) provided by [77], and we do both in our experiment. For the task EMO, we perform 8-way classiﬁcation instead of 1-vs-all setting. For the 3 tasks FAS, MEM, and INT, we choose to do the ﬁrst 5 folds of training/testing splits considering the efﬁciency of the experiment.
66

5.3.2 Training Approach
In our experiment, we use the Caffe [32] implementation to train the 5 CNNi (i = {0, 1, · · · , 4}) in Figure 5.1 for each of the three tasks in Table 4.1. For each task, CNNi is adjusted such that the number of the nodes in the output layer is set to the number of classes of that task. For the two regression tasks MEM and INT in Table 4.1, we modify the number of the nodes in the output layer to 1 and use the Euclidean loss to replace the softmax loss. When using the Caffe [32] implementation, we adopt its default training parameters for training the AlexNet [46] for ImageNet [11] classiﬁcation unless otherwise speciﬁed. Before training CNNi for each task, all the images in the corresponding database are resized to 256×256 according to the Caffe [32] implementation.
In training phase, adopting the Caffe reference model provided by [32] (denoted as MImageNet) for ImageNet [11] classiﬁcation, we train CNNi (i = {0, 1, · · · , 4}) in Figure 5.1 for each task in Table 4.1. We follow the descriptions and setting of supervised pre-training and ﬁne-tuning used in Agrawal’s work [1], where pre-training with MD means using a data-rich auxiliary database D to initialize the CNN parameters and ﬁne-tuning means that all the CNN parameters can be updated by continued training on the corresponding training set. For each CNNi for each task in Table 4.1, we pre-train it with MImageNet and ﬁne-tune it with the training set of that task. After ﬁnishing training CNNi, we form the multi-depth CNN features Fi (i = {0, 1, · · · , 4}) according to Table 5.1. With these multi-depth CNN feature vectors for training, we use support vector machine (SVM)/support vector regression (SVR) to train a linear classiﬁer/regressor supervised by the ground truth of the training images in the corresponding database for the classiﬁca-
67

tion/regression task. Speciﬁcally, one linear classiﬁer or regressor is trained for each Fi (i = {0, 1, · · · , 4}) for each task (a total of 5 classiﬁers or regressors per task). In practice, we use LIBSVM [8] to do so with the cost (parameter C in SVM or SVR) set to the default value 1. Trying different C values, we ﬁnd that different C values result in similar accuracy, so we just use the default value.
In testing phase, we use the given testing image as the input of the trained CNNi (i = {0, 1, · · · , 4}) and generate fis. The multi-depth CNN features Fi (i = {0, 1, · · · , 4}) are formed by concatenating the generated fis according to Table 5.1. After that, we feed each feature vector (Fi) of the testing image as the input of the corresponding trained SVM classiﬁer/SVR regressor, and the output of the SVM classiﬁer/SVR regressor is the predicted label/value of the testing image.
5.4 Experimental Results
Using the training approach described in Sec. 5.3.2, we evaluate the performance of Fi (i = {0, 1, · · · , 4}) deﬁned in Table 5.1 on the 8 abstract tasks listed in Table 4.1. The experimental results are summarized in Table 5.2, where the bold numbers represent the best performance for that task. We compare the performance of our proposed multi-depth CNN features (F1 to F4) with the following two baselines:
1. The current known best performance of that task provided by the references listed in Table 5.2.
2. The performance of F0, which represents the commonly used features in
68

task ID evaluation metric
previous work F0 (baseline) [62]
F1 F2 F3 F4

EMO accuracy (%)
n/a 36.23 38.83 38.34 38.34 38.34

AST accuracy (%)
53.10 [41] 55.15 56.25 56.40 56.35 56.35

ART accuracy (%)
62.20 [41] 67.37 68.29 68.57 69.21 69.21

AVA accuracy (%)
n/a 69.42 69.57 69.57 69.60 69.60

task ID evaluation metric
previous work F0 (baseline) [62]
F1 F2 F3 F4

FAS accuracy (%)
n/a 76.96 77.39 77.39 76.96 76.96

ARC accuracy (%)
69.17 / 46.21 [77] 70.64 / 54.84 70.94 / 55.44 70.73 / 55.35 70.68 / 55.32 70.68 / 55.31

MEM ρ
n/a 0.40 0.43 0.46 0.49 0.49

INT ρ
n/a 0.57 0.58 0.62 0.63 0.63

Table 5.2: The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for all the 8 tasks, our proposed multi-depth CNN features (F1 to F4) outperform not only the best known results from prior works but also the features commonly used in the traditional CNN framework (F0).

the traditional CNN framework.

The results in Table 5.2 show that all of our proposed multi-depth CNN features (F1 to F4) outperform the two baselines on the 8 tasks, which supports our claim that utilizing the features extracted from multiple CNNs with different numbers of layers is better than using the traditional CNN features which are only extracted from the topmost fully connected layer. In Table 5.2, the best performance for each task is not always achieved by F4, the multi-depth features extracted from all the CNNs in our experiment. Although the dimension of F4 is the highest among all the proposed multi-depth features and theoretically it has the most powerful describing ability as the image descriptor, the high dimension of the feature vectors can be more likely to suffer from overﬁt-
69

ting. The results in Table 5.2 are the trade-off between the describing ability and the overﬁtting issue of the feature vectors. We think that this trade-off and the database/task bias are the two major reasons explaining why the best performance for each task is achieved by different multi-depth features (different Fis).
Table 5.2 also shows that CNN-based features (F0 to F4) outperform the classical handcrafted features (for example, SIFT and HOG) used in the prior works [41, 77], which is consistent with the ﬁndings of the recent CNN-related literature [15, 29, 50, 82, 83]. In addition, our proposed multi-depth CNN features are generic features which are applicable to various tasks, not just the features speciﬁcally designed for certain tasks. As shown in Table 5.2, these multi-depth CNN features are effective in various domains from artistic style classiﬁcation to architectural style classiﬁcation, which makes our proposed multi-depth CNN features promising solutions for other tasks which future researchers are interested in.
5.5 Conclusion
In this chapter, we mainly focus on the idea of utilizing the features extracted from multiple CNNs with different numbers of layers. Based on this idea, we propose the multi-depth CNN features, showing their efﬁcacy on 8 abstract tasks. Our proposed multi-depth CNN features outperform not only the stateof-the-art results but also the CNN features commonly used in the traditional CNN framework. Furthermore, we ﬁnd that our proposed multi-depth CNN features are promising generic features which can be applied to various tasks.
70

CHAPTER 6 MULTI-SCALE CNN FEATURES
Summary
Most works related to convolutional neural networks (CNN) use the traditional CNN framework which extracts features in only one scale. We propose multi-scale convolutional neural networks (MSCNN) which can not only extract multi-scale features but also solve the issues of the previous methods which use CNN to extract multi-scale features. With the assumption of label-inheritable (LI) property, we also propose a method to generate exponentially more training examples for MSCNN from the given training set. Our experimental results show that MSCNN outperforms both the state-of-the-art methods and the traditional CNN framework on most of the 8 abstract tasks in our experiment, supporting that MSCNN outperforms the traditional CNN framework on the tasks which at least partially satisfy LI property.
6.1 Introduction
As mentioned in Chapter 4 and Chapter 5, convolutional neural networks (CNN) have already achieved breakthrough performance in recent literature [1, 24, 26, 29, 38]. However, there are two constraints in the traditional CNN framework:
1. CNN extracts the features in only one scale without explicitly leveraging the features in different scales.
71

2. The performance of CNN highly depends on the amount of training data, which is shown in recent works [15, 82]. Using either the Caltech-101 [48] or Caltech-256 [27] databases, these works [15, 82] modify the number of training examples per class and record the corresponding accuracy. Both works ﬁnd that satisfactory performance can be achieved only when enough training examples are used.
In this chapter, we want to solve the above two issues of the traditional CNN framework. We propose multi-scale convolutional neural networks (MSCNN), a framework of extracting multi-scale features using multiple CNNs. The details of MSCNN are explained in Sec. 6.3. We also propose a method to generate exponentially more training examples for MSCNN from the given training set based on the assumption that the task of interest is label-inheritable (LI). The concept of LI is explained in Sec. 6.2.
In recent studies extracting multi-scale features using CNN, Gong et al. [26] propose multi-scale orderless pooling, but He et al. [29] propose spatial pyramid pooling. Both works show experimental results which support the argument that using multi-scale features outperforms using features in only one scale. In both works, the multi-scale features are extracted by using special pooling methods in the same CNN, so the CNN parameters used to extract multiscale features are the same before the pooling stage. However, we argue that the features in different scales should be extracted with different sets of CNN parameters learned from the training examples in different scales, which is explicitly modeled in our proposed MSCNN.
In the applications using multiple CNN, Lu et al. [50] predict pictorial aesthetics by using double-column convolutional neural network (DCNN) to
72

extract features in two different scales, global view and ﬁne-grained view. They also show that using DCNN outperforms using features in only one scale. The concept of using multiple CNN is similar in both DCNN and our proposed MSCNN. Nevertheless, the input images for the scale of ﬁne-grained view in DCNN are generated by randomly cropping from the images in the scale of global view. Lu’s approach [50] has the following two drawbacks:
1. The randomly cropped image may not well represent the entire image in the scale of ﬁne-grained view.
2. By using only the cropped images in the scale of ﬁne-grained view, the information which is not in the cropped area is ignored.
In MSCNN, instead of throwing away information, we extract the features from every portion of the input image in each scale, which solves the above two drawbacks of DCNN.
In this chapter, we make the following contributions:
1. We propose multi-scale convolutional neural networks (MSCNN) which can extract the features in different scales using multiple convolutional neural networks. We show that MSCNN outperforms not only the stateof-the-art performance on most of the 8 abstract tasks listed in Table 4.1 but also the traditional CNN framework which only extracts features in one scale.
2. We also propose a method to generate exponentially more training examples for MSCNN from the given training set. Under the labelinheritable (LI) assumption, our method generates exponentially more
73

Figure 6.1: The illustration of label-inheritable (LI) property. Given an image database D associated with a task T , if any cropped version of any image I from D can take the same label as that of I, we say that T satisﬁes LI property. In this ﬁgure, we only show the case when the cropped image is the upper right portion of the original image. In fact, the cropped image can be any portion of the original image.
training examples without the need to collect new training examples explicitly. Although our method and MSCNN are designed under LI assumption, our experimental results support that our proposed framework can still outperform the traditional CNN framework on the tasks which only partially satisfy LI assumption.
6.2 Label-Inheritable Property
Our proposed MSCNN is designed for the tasks satisfying label-inheritable property which is illustrated in Figure 6.1. Given an image database D associated with a task T , if any cropped version of any image I from D can take the the same label as that of I, we say that T is label-inheritable (LI). In other words, if the concept represented by the label of any image I in the given database D
74

Figure 6.2: Example images from the three databases (Painting-91 [41], arcDataset [77], and Caltech-101 [48]) associated with three different tasks which satisfy LI property in different degrees. The corresponding database, task, label, and the extent that LI property is satisﬁed are shown under each image.
is also represented in each portion of I, the corresponding task T satisﬁes LI property.
Figure 6.2 shows three example images from the three databases (Painting91 [41], arcDataset [77], and Caltech-101 [48]) associated with three different tasks which satisfy LI property in different degrees. We also list the corresponding database, task, and label under each example image. For any image from the Painting-91 [41] database, each portion of that image is painted by the same artist (Picasso for the leftmost example image in Figure 6.2), so the task “artist style classiﬁcation” is LI. For the images from the arcDataset [77] database, we can recognize the architectural style from different parts of the
75

architecture (for example, the roof and pillars). However, if the cropped image does not contain any portion of the architecture, the cropped image cannot represent the architectural style of the original image, which is the reason why LI property is only partially satisﬁed for architectural style classiﬁcation. For the images from the Caltech-101 [48] database, the LI property is only satisﬁed when the cropped image contains the entire object which the label represents. If the cropped image contains only a portion of the object, the cropped image may not be able to take the same label. For the rightmost example image in Figure 6.2, the portion of the mouth cannot totally represent the label “faces.” Therefore, we think object classiﬁcation is mostly not LI.
In this chapter, we apply our proposed framework, MSCNN, to 8 tasks (listed in Table 4.1) satisfying LI property in different degrees, showing that MSCNN outperforms the traditional CNN framework on the tasks satisfying LI property. We also show experimental results supporting that MSCNN can still outperform the traditional CNN framework on the tasks which only partially satisfy LI property. The details of the experimental results are shown in Sec. 6.4.
6.3 Experimental Setup
6.3.1 Databases and Tasks
We conduct experiment on the 8 abstract tasks speciﬁed in Table 4.1, where their properties and related statistics are also shown. We use the task ID listed in Table 4.1 to refer to each task. We adopt the same experimental settings for the 3 tasks (AST, ART, and ARC) as those used in the references listed at the bottom of
76

Figure 6.3: The illustration of our proposed multi-scale convolutional neural networks (MSCNN) which consists of m+1 AlexNet [46] (one for each of the m + 1 different scales). The details of the MSCNN architecture and the training approach are explained in Sec. 6.3.2 and Sec. 6.3.3 respectively.
Table 4.1. For the task EMO, we perform 8-way classiﬁcation instead of 1-vs-all setting. For the 3 tasks FAS, MEM, and INT, we choose to experiment on only the ﬁrst 5 folds of training/testing splits for the efﬁciency of the experiment. For the task ARC, we conduct our experiment with both of the two different experimental settings (10-way and 25-way classiﬁcation) provided by [77]. For the task AVA, we use the generic training set with 2495 images to shorten the training time.
6.3.2 MSCNN Architecture
The architecture of our proposed multi-scale convolutional neural networks (MSCNN) is illustrated in Figure 6.3. MSCNN consists of m + 1 AlexNet [46] which extract features in m + 1 different scales (from scale 0 to scale m, one AlexNet [46] per scale). We only show m = 1 in Figure 6.3 for clarity. In each
77

scale, we use the AlexNet [46] as the feature extractor which takes an image as input and outputs a 4096-d feature vector. The 4096-d feature vector is extracted from the topmost fully connected layer of the AlexNet [46].
Given an input image, we extract its multi-scale features according to the following steps:
1. In scale i, we place a 2i ×2i grid on the input image and generate 4i cropped images.
2. We resize each of the 4i cropped images in scale i to a pre-deﬁned image size k × k.
3. Each of the 4i cropped resized images takes turn (in the order of left to right, top to bottom) to be the input of the AlexNet [46] in scale i. In other words, the AlexNet [46] in scale i extracts features 4i times with one of the 4i cropped resized images as input at each time. After that, we generate the feature vector in scale i by concatenating those 4i 4096-d feature vectors in order.
4. The ﬁnal multi-scale feature vector is formed by concatenating all the feature vectors in all m + 1 scales in order, and the dimension of the multiscale feature vector is 4m+1 − 1 3 × 4096.
For the AlexNet [46] in each scale, we use the Caffe [32] implementation except that the number of the nodes in the output layer is set to the number of classes in each task listed in Table 4.1. For the two regression tasks MEM and INT in Table 4.1, we change the number of the nodes in the output layer to 1 and replace the softmax loss with the Euclidean loss. When using the Caffe [32] implementation, we adopt its default training parameters for training
78

the AlexNet [46] for ImageNet [11] classiﬁcation unless otherwise speciﬁed. The pre-deﬁned image size k × k is set to 256×256 according to the Caffe [32] implementation. The details of the training approach and how to make prediction are explained in Sec. 6.3.3.
6.3.3 Training Approach
Before training, we generate a training set S i for each scale i from the original training set S of the task of interest T with the assumption that T satisﬁes LI property. We take the following steps:
1. We place a 2i × 2i grid on each image of S , crop out 4i sub-images, and resize each cropped image to k × k.
2. Due to LI property, each cropped resized image is assigned the same label as that of the original image from which it is cropped.
After the above two steps, the generated training set S i consists of 4i ×|S | labeled images (|S | is the number of images in S ), and the size of each image is k × k. We follow the Caffe [32] implementation, using k = 256.
In training phase, adopting the Caffe reference model [32] (denoted as MImageNet) for the ImageNet [11] classiﬁcation, we train the AlexNet [46] in scale i using the training set S i. We follow the descriptions and setting of supervised pre-training and ﬁne-tuning used in Agrawal’s work [1], where pre-training with MD means using a data-rich auxiliary database D to initialize the CNN parameters and ﬁne-tuning means that all the CNN parameters can be updated by continued training on the corresponding training set. For the AlexNet [46]
79

in each scale i, we pre-train it with MImageNet and ﬁne-tune it with S i. After ﬁnishing training all the m + 1 AlexNet [46] in m + 1 scales, we extract the
4m+1 − 1 3 × 4096-d multi-scale feature vector of each training image in S using the method described in Sec. 6.3.2. With these multi-scale feature vectors for training, we use support vector machine (SVM)/support vector regression (SVR) to train a linear classiﬁer/regressor supervised by the ground truth of the images in S . In practice, we use LIBSVM [8] to do so with the cost (parameter C in SVM/SVR) set to the default value 1. Trying different C values, we ﬁnd that different C values produce similar accuracy, so we just use the default value.
In testing phase, given a testing image, we generate its 4m+1 − 1 3 × 4096-d multi-scale feature vector using the method described in Sec. 6.3.2. After that, we feed the feature vector of the testing image as the input of the trained SVM classiﬁer/SVR regressor, and the output of the SVM classiﬁer/SVR regressor is the prediction of the testing image.
Compared with the traditional CNN framework (m = 0), MSCNN has m + 1 times of the number of parameters to train. However, under LI assumption, we can generate exponentially more training examples (4i × |S | training examples in scale i) from the original training set S without the need to explicitly collect new training examples.
For any two training examples Ii and I j from S i and S j respectively (i < j), the overlapping content cannot exceed a quarter of the area of Ii. Therefore, under LI assumption, our training approach not only generates exponentially more training examples but also guarantees certain diversity among the generated training examples.
80

In terms of the training time, since S i contains 4i × |S | training examples, the time needed to train the CNN in scale i is 4i × t0, where t0 is the training time of the traditional CNN framework. The training time of the linear SVM classiﬁer/SVR regressor is far less than the training time of the AlexNet [46], so it is negligible in the total training time. If we train the m + 1 AlexNet [46] in m + 1 different scales in parallel, the total training time of MSCNN will be 4m × t0.
When training CNN, the traditional method implemented by Caffe [32] resizes each training image to a pre-deﬁned resolution because of the hardware limitation of GPU. Due to the resizing, partial information in the original highresolution training image is lost. In MSCNN, more details in the original highresolution training image are preserved in scale i (i > 0) because all the training examples in S i (∀i ∈ {0, 1, · · · , m}) are resized to the same resolution. The extra information preserved in scale i (i > 0) makes MSCNN outperform the traditional CNN framework on the tasks satisfying LI property, which is shown in Sec. 6.4.
Our proposed MSCNN extracts the features in different scales with scaledependent CNN parameters. In other words, the CNN parameters in scale i are speciﬁcally learned from the training images in scale i (the images in S i). This training approach solves the issue of previous works [26, 29] which simply generate the multi-scale features from the same set of CNN parameters which may not be the most suitable one for each scale. In addition, MSCNN extracts the features from every portion of the image in each scale, which solves the drawback of Lu’s work [50] which does not fully utilize the information of the entire image in the scale of ﬁne-grained view. Applying our proposed MSCNN to the 8 tasks listed in Table 4.1, we report the performance in Sec. 6.4.
81

task ID evaluation metric
previous work MSCNN-0 (baseline) [62]
MSCNN-1 MSCNN-2 MSCNN-3

EMO accuracy (%)
n/a 36.23 37.35 37.47 37.72

AST accuracy (%)
53.10 [41] 55.15 58.11 57.91 n/a

ART accuracy (%)
62.20 [41] 67.37 69.67 70.96 67.74

AVA accuracy (%)
n/a 69.42 71.09 72.30 72.75

task ID evaluation metric
previous work MSCNN-0 (baseline) [62]
MSCNN-1 MSCNN-2 MSCNN-3

FAS accuracy (%)
n/a 76.96 79.78 79.35 n/a

ARC accuracy (%)
69.17 / 46.21 [77] 70.64 / 54.84 74.82 / 58.89 75.32 / 59.13 n/a

MEM ρ
n/a 0.40 0.39 n/a n/a

INT ρ
n/a 0.57 0.60 0.62 0.63

Table 6.1: The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for most of the 8 tasks, our proposed multi-scale CNN features (MSCNN-1 to MSCNN-3) outperform not only the best known results from prior works but also the traditional CNN framework (MSCNN-0).

6.4 Experimental Results

Using the experimental setting speciﬁed in Sec. 6.3, we compare the performance of MSCNN with that of the traditional CNN framework (m = 0) and that of the state-of-the-art methods. We use the notation “MSCNN-m” to represent the MSCNN framework consisting of m + 1 AlexNet [46] in m + 1 different scales (scale 0 to scale m). MSCNN-0 represents the traditional CNN framework which extracts the features in only one scale. The result is summarized in Table 6.1.
Table 6.1 shows that our proposed methods (MSCNN-1 to MSCNN-3) outperform both the current state-of-the-art method and the traditional CNN framework (MSCNN-0) on most of the 8 tasks. Since the tasks ART and AST satisfy LI property, the result supports that MSCNN outperforms the traditional

82

CNN framework on the tasks which are LI. MSCNN also outperforms the traditional CNN framework on the task ARC, which only partially satisﬁes LI property.
Table 6.1 also shows that increasing the number of scales in MSCNN may not always improve the performance. This is reasonable because the training images in scale i (the images in S i) are less likely to contain useful features as i increases. Therefore, the performance improves the most when we increase the number of scales in MSCNN to a certain level. Although the dimension of the multi-scale CNN features increases when we increase the number of scales in MSCNN (and hence theoretically gaining more describing ability as the image descriptor), the high dimension of the feature vectors can be more likely to suffer from overﬁtting. The results in Table 6.1 are the trade-off between the describing ability and the overﬁtting issue of the feature vectors. We think that this trade-off and the database/task bias are the two major reasons explaining why the best performance for each task is achieved by the MSCNN with different numbers of scales. For most of the 8 tasks in our experiment, MSCNN-1, MSCNN-2, and MSCNN-3 (if applicable) outperform MSCNN-0, which supports that using multi-scale features is better than using the features in one scale.
Based on our experimental results, we show that MSCNN outperforms the traditional CNN framework on most of the 8 abstract tasks in Table 4.1. In addition, our results indicate that under the assumption that LI property holds, there is a most suitable number of scales for MSCNN such that the performance can increase the most.
83

6.5 Conclusion
We addressed the label-inheritable (LI) property and proposed a novel framework, multi-scale convolutional neural networks (MSCNN). We also proposed a training method for MSCNN under LI assumption such that exponentially more training examples can be generated without the need to collect new training data. MSCNN not only solves the issues of the previous methods which use CNN to extract multi-scale features but also outperforms both the state-of-theart methods and the traditional CNN framework on most of the 8 abstract tasks under our experiment. Our results support that MSCNN can outperform both the state-of-the-art method and the traditional CNN framework on the tasks satisfying (even partially) LI property.
84

CHAPTER 7 FUSING MULTI-DEPTH, MULTI-SCALE, AND MULTI-TASK FEATURES
7.1 Introduction
In Chapter 4, Chapter 5, and Chapter 6, we present the efﬁcacy of our proposed multi-task, multi-depth, and multi-scale CNN features respectively on the abstract tasks listed in Table 4.1, showing that each of our proposed features can outperform the state-of-the-art performance on the abstract tasks in our experiment. In this chapter, our goal is to fuse our proposed multi-task, multidepth, and multi-scale CNN features such that the performance can be further improved. For most of the prior works about abstract tasks, including the references listed in Table 4.1, the common approach to fuse different features seems to be the sum of weighted feature values. In practice, this approach is usually implemented by simple concatenation of different features followed by training a linear support vector machine (SVM), which can be observed in [31, 41]. Although this naive approach is easy to implement, our experimental results about the multi-task features in Chapter 4 support that simple concatenation may not be the best method to fuse different features. Because our proposed multi-task, multi-depth, and multi-scale features are all based on CNN, we start to search feature fusion methods based on CNN. Among the previous works involving fusing different CNN-based features [40, 58, 75], one common method is to construct another fully connected networks with different feature sources as the input. We adopt this approach to fuse our proposed CNN features. The details of the fully connected networks used for feature fusion and the experimental results are presented in Sec. 7.2 and Sec. 7.3 respectively.
85

Figure 7.1: The fully connected networks we use to fuse our proposed multi-task, multi-depth, and multi-scale CNN features.
7.2 Feature Fusion Using Fully Connected Networks
7.2.1 Network Architecture
Inspired by [40, 58, 75], we use the fully connected networks illustrated in Figure 7.1 to fuse our proposed multi-task, multi-depth, and multi-scale CNN features. The input layer is the concatenation of the three feature sources. Each of the two intermediate fully connected layers has 1024 nodes, which is set empirically to seek the balance between feature describing ability and avoiding overﬁtting. On top of the two intermediate fully connected layers, we add an output layer with the number of nodes set to be the number of classes of the task of interest. We believe that using the fully connected networks to fuse different features is better than using simple concatenation because the former method is able to use higher-level intermediate feature representation to make prediction.
In this chapter, we experiment on 3 abstract tasks—AST, ART, and ARC speciﬁed in Table 4.1. For the task ARC, we experiment on the setting of
86

25-way classiﬁcation. In practice, we use the Caffe [32] implementation to train the fully connected networks in Figure 7.1 from scratch. When using the Caffe [32] implementation, we adopt its default training parameters for training the AlexNet [46] for ImageNet [11] classiﬁcation unless otherwise speciﬁed.
7.2.2 Fused Features
Because the publicly available databases for the 3 abstract tasks in our experiment only contain limited amount of data, we simplify our experimental setting such that the number of parameters which need to be trained is manageable. In practice, we reduce the number of parameters by limiting the source features we intend to fuse. In our experiment, we concatenate the following features as the input of the fully connected layer in Figure 7.1:
1. The default AlexNet [46] features extracted from its topmost fully connected layer (the CNN baseline). This feature is the same as the features in scale 0 of the multi-scale features in Figure 6.3, or the features f0 extracted from CNN0 of the multi-depth features in Figure 5.1.
2. The features f1 extracted from CNN1 of the multi-depth features in Figure 5.1.
3. The features in scale 1 of the multi-scale features in Figure 6.3. 4. The multi-task features leaned from the other 2 tasks. For example, if the
task of interest is AST, the features extracted from the two AlexNets [46] trained with ART and ARC are the corresponding multi-task features.
87

features\task ID

AST ART ARC

CNN baseline multi-depth CNN features [60] multi-scale CNN features [61] multi-task CNN features [62]
fused CNN features

55.15 56.40 [87.45] 58.11 [99.64] 57.51 [98.39] 59.47 [∼100]

67.37 69.21 [90.82] 70.96 [99.50] 71.05 [99.59] 72.24 [99.98]

54.84 55.57 [82.99] 59.13 [∼100] 55.38 [75.82] 59.33 [∼100]

Table 7.1: The performance comparison between each of our proposed features and the fused features. The numbers are reported in the format “accuracy (%) [conﬁdence level (%)].” The conﬁdence represents the conﬁdence of the feature outperforming the CNN baseline according to the binomial test. The italic numbers are the performance exceeding the 95% conﬁdence level (statistically signiﬁcant results).

We follow the methods introduced in Chapter 4, Chapter 5, and Chapter 6 to generate the multi-task, multi-depth, and multi-scale CNN features and concatenate them for each image. After that, we use the concatenated feature vectors of the training data to train the fully connected networks in Figure 7.1. At testing time, we feed the concatenated feature vector of each testing data as the input of the fully connected networks in Figure 7.1, and the output is the predicted label. Applying this feature fusion method in Figure 7.1 to the 3 tasks, we report the performance in Sec. 7.3.

7.3 Experimental Results
We compare the performance of each of our proposed CNN features with that of the fused features in Table 7.1, where the numbers are reported in the format “accuracy (%) [conﬁdence level (%)].” The conﬁdence level represents the conﬁdence of the feature outperforming the CNN baseline according to the binomial test. The italic numbers in Table 7.1 are the performance passing the 95% conﬁ-
88

dence level (statistically signiﬁcant results).
Out of the three proposed features, most of the results of the multi-scale and multi-task features are statistically signiﬁcant, but the results of the multidepth features do not pass the 95% conﬁdence level. We think that the reason is because the amount of training data is important for the performance, which is shown in [15, 82]. The multi-scale features are learned from the augmented training data generated by cropping the original training images. The multitask features implicitly utilize the training data from other tasks by extracting the features learned from other tasks. However, the multi-depth features are learned from only the original training data without extra training images, which may be the reason why the improvement brought by the multi-depth features is less signiﬁcant than that brought by the multi-scale and multi-task features. Table 7.1 shows that the fused features outperform not only the CNN baseline but also each of our proposed features, and the performance improvement brought by the fused features is shown to be statistically signiﬁcant.
7.4 Feature Fusion for Predicting Emotion Distributions
7.4.1 Motivation
Encouraged by the promising results of the feature fusion in Sec. 7.3 and the multi-task features in Chapter 4, we want to explore these ideas in predicting emotion distributions and predicting the emotion stimuli map (ESM) introduced in Chapter 2 and Chapter 3 respectively. Speciﬁcally, we want to know whether the features learned from predicting the ESM can be used to
89

Figure 7.2: The illustration of our proposed combined networks for predicting emotion distributions. We only show convolution, deconvolution, and fully connected layers for clarity.
improve the performance of predicting emotion distributions. To verify this idea, we predict emotion distributions using our proposed combined networks illustrated in Figure 7.2, where only convolution, deconvolution, and fully connected layers are shown for clarity.
7.4.2 Proposed Combined Networks
Figure 7.2 illustrates our method in two steps. In the ﬁrst step (Figure 7.2 left), we train two networks, N1 and N2, individually for predicting emotion distri-
90

butions and the ESM respectively. N2 is trained with the fully convolutional networks (the FCNEL in Sec. 3.5) using the training set of the EmotionROI database [67] introduced in Chapter 3. N1 is trained with the AlexNet [46] architecture using the supervision of S train, where S train is the training set of the Emotion6 database [66] introduced in Chapter 2. Similar to our prior work [66], we pre-train N1 with the Caffe reference model [32] trained for ImageNet classiﬁcation [11] and ﬁne-tune N1 with S train. However, unlike the method CNN in our original work [66] which uses softmax loss layer, we use sigmoid cross entropy loss layer in N1 because according to our user study [66], different emotions are not mutually exclusive. We also do not use Euclidean loss layer for each emotion category as that used in the method CNNR in our original work [66] because we prefer training one model which directly predicts the 7D emotion distribution evoked by the input image to training 7 models (each predicts the probability of one emotion category). The base learning rate used to train N1 is empirically set to 10−4.
In the second step (Figure 7.2 right), we form the architecture of the combined networks, N3, using those of N1 and N2. Speciﬁcally, we adopt the architecture from conv-1 to fc-7 in N1 and the architecture from conv-1 to conv5 in N2, forming the left and right streams of N3, respectively. Because N2 does not have fc-6 and fc-7 layers, we add fc-6 and fc-7 to the right stream of N3 using the same architecture as their left-stream counterparts. This design makes N3 unbiased toward either stream in the architecture. The two streams are fed with the same input, and their outputs are concatenated by the concat layer. The output layer of N3 is fully connected with the concat layer such that a 7-D vector is predicted. We also use sigmoid cross entropy loss to train N3. When training N3, we pre-train the portions enclosed by the two rectangles in Figure 7.2 with
91

the trained N1 and N2 such that the left stream and the right stream capture the features learned from predicting emotion distributions and the ESM respectively. After pre-training, we ﬁx the parameters in the gray area marked in Figure 7.2, ﬁne-tuning the parameters outside the gray area using S train. This ﬁne-tuning strategy lets the parameters in both streams update jointly, which is inspired by Jung et al. [37]. This strategy is also consistent with the feature fusing method introduced in this chapter by adding fully connected layers on top of different feature sources. For other training details, we use the same setting for both N1 and N3 unless otherwise speciﬁed. In testing phase, the 7-D vector output by N3 is normalized such that it forms a legal probability distribution representing the emotion distribution evoked by the input image.
7.4.3 Experimental Results
Using the same 1386/594 training/testing images and 4 evaluation metrics used in our prior work [66], we compare the performance of our combined networks with the best result reported in [66]. The 4 evaluation metrics calculating the closeness of the predicted and ground truth emotion distributions are KL-divergence (KLD), Bhattacharyya coefﬁcient (BC), Chebyshev distance (CD), and earth mover’s distance (EMD). We report M, the mean of M (M ∈ {KLD, BC, CD, EMD}) in Table 7.2. For BC, higher is better. For the other 3 metrics, lower is better. We also conduct pairwise comparison between our combined networks and our prior method [66]. For each testing image, we compare the scores produced by our prior method [66] and our combined networks, and tabulate which method produced the distribution closer to the ground truth in Table 7.2, where our combined networks not only achieve better
92

method

KLD BC CD EMD

Peng’s work [66]

0.480 0.847 0.265 0.503

combined networks (CN) 0.474 0.852 0.262 0.496

pairwise comparison metric KLD BC CD EMD

wins for CN

333 330 321 323

wins for Peng’s method [66] 261 264 273 271

conﬁdence (CN is better) (%) 99.8 99.7 97.7 98.5

Table 7.2: Performance comparison in predicting emotion distributions. Our combined networks predict more accurate emotion distributions with ≥97.7% conﬁdence compared with Peng’s method [66].

M but also are more likely to produce an emotion distribution closer to the ground truth with ≥97.7% conﬁdence according to the binomial test. Our result supports the hypothesis that using the features learned from predicting the ESM can improve the performance of predicting emotion distributions, which is consistent with the ideas of our proposed multi-task features and the method of fusing features.

93

CHAPTER 8 CONCLUSIONS AND FUTURE WORK
Abstract tasks, the tasks involving general ideas or qualities, are one of the important subjects in computer vision. In this thesis, we study the abstract tasks in 8 different domains which are proposed in recent literature. The ﬁrst two abstract tasks we study are predicting emotion distributions and predicting emotion stimuli maps. Both tasks are proposed by us to better model the diversity of emotions evoked in humans and the emotionally involved areas of images. For both tasks, we build the associated databases, Emotion6 and EmotionROI, and present the state-of-the-art results based on convolutional neural networks (CNN) and fully convolutional networks.
Inspired by the encouraging results in predicting emotion distributions and predicting emotion stimuli maps, we explore the possibility of using CNN to solve general abstract tasks. Experimenting on the abstract tasks in 8 different domains, we ﬁnd that for all the abstract tasks, using CNN achieves better performance than the results reported in the previously works which mostly rely on standard handcrafted features. This ﬁnding supports that CNN is a promising approach to solve general abstract tasks. Based on this ﬁnding, we design the multi-task, multi-depth, and multi-scale CNN features to achieve better benchmarks in abstract tasks. Our three proposed features improve the performance of abstract tasks from different aspects. Multi-task CNN features consist of the features learned from the training data of more than one task, achieving better image representation. Multi-depth features are learned by multiple CNNs with different neural network architectures, utilizing both highlevel and low-level image features. Multi-scale features are learned to capture
94

more detail information from the augmented training data in different scales. The experimental results show that all our proposed CNN features outperform the traditional CNN framework.
To exploit the performance gain brought by all the proposed features simultaneously, we use another fully connected networks to fuse them. We show that the fused features achieve better performance than using each of our proposed features. In addition, we revisit the two emotion related tasks we propose and show that the features learned from predicting emotion stimuli maps can be beneﬁcial to predicting emotion distributions, which supports the ideas of multi-task features and feature fusion.
8.1 Future Work: CNN Visualization for Abstract Tasks
Given the promising results of our proposed CNN-based features, I would like to study what is learned by the CNN in abstract tasks in the future. One of the approaches is to visualize the features captured by the trained neural networks. To the best of my knowledge, visualizing the features learned by neural networks is still a relatively immature research area, and there is no consensus on what is the best method for feature visualization. Most of the previous works [55, 81] studying the visualization of neural networks focus on classical object or scene classiﬁcation tasks. The feature visualization in abstract tasks is often ignored, which motivates me to research on this subject.
In recent works about the visualization of neural networks, Yosinski et al. [81] use the natural image prior in their method, preventing the pixels with extreme intensities from appearing in the generated visualization results.
95

(a) natural image prior [81] (b) center-biased [55] (c) max activation [54] Figure 8.1: The visualization of the output node representing the class “high aesthetic value” in the task AVA in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where all the 9 training images are correctly labeled by the trained CNN.
Yosinski’s method [81] tends to generate repetitive objects all over the visualized image when it is used to visualize the learned CNN output nodes in object classiﬁcation tasks. To solve this issue, Nguyen et al. [55] introduce the center-biased regularization, encouraging the visualized image to contain only one object in the center. Both of these visualization methods [55, 81] use a random image as input and intend to form the relevant object(s) in the visualized image. In this section, I adopt these two state-of-the-art visualization methods [55, 81] to generate the preliminary results of feature visualization in abstract tasks. Speciﬁcally, I revisit the CNN trained with the approach “pt ImageNet + ft” introduced in Table 4.2 in Chapter 4, and visualize one speciﬁc node of the output layer at a time.
Figure 8.1 is the visualization of the output node representing the class “high aesthetic value” in the task AVA in Table 4.1. The 4 images in (a)/(b) are
96

the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where the surrounding green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. Figure 8.1 (a) and (b) do not look like typical objects even though both methods [55, 81] are designed to form object(s), which indicates that more suitable visualization methods for abstract tasks are needed. We think that Figure 8.1 (a) and (b) seem to capture the two common characteristics for the images with high aesthetic values — satisfying rule of third and shallow depth of focus.
We also use the two visualization methods [55, 81] to visualize our proposed multi-task, multi-depth, and multi-scale features. Figure 8.2, 8.3, and 8.4 are displayed in a similar format as that of Figure 8.1. Figure 8.2 is the visualization of the output node representing the class “Rembrandt van Rijn” in the task AST in Table 4.1, showing that using the multi-task features (bottom row (c)) corrects the misclassiﬁcation made by using self features only (top row (c)). Figure 8.3 visualizes the output node representing the class “Gothic architecture” in the task ARC in Table 4.1, showing that the multi-depth features (bottom row) capture the features learned from the two CNNs with different depths (top two rows). Figure 8.4 is the visualization of the output node representing the class “Russian Revival architecture” in the task ARC in Table 4.1, showing the critical features in two different scales. Although Figure 8.2, 8.3, and 8.4 seem to be consistent with the performance improvement brought by our proposed multitask, multi-depth, and multi-scale features, detail analysis with more advanced visualization methods is needed to ﬁgure out what is captured by the CNN in abstract tasks, which is my next research direction.
97

features learned from artist style classiﬁcation
features learned from artist style and artistic style classiﬁcation
(a) natural image prior [81] (b) center-biased [55] (c) max activation [41] Figure 8.2: The visualization of the multi-task features. The visualized output node represents the class “Rembrandt van Rijn” in the task AST in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN.
98

features learned by CNN0 in Chapter 5
features learned by CNN1 in Chapter 5
features learned by jointly training CNN0 and CNN1 in Chapter 5
(a) natural image prior [81] (b) center-biased [55] (c) max activation [77] Figure 8.3: The visualization of the multi-depth features. The visualized output node represents the class “Gothic architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. 99

features in scale 0 in Chapter 6
features in scale 1 in Chapter 6
(a) natural image prior [81] (b) center-biased [55] (c) max activation [77] Figure 8.4: The visualization of the multi-scale features. The visualized output node represents the class “Russian Revival architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN.
100

APPENDIX A RELATED PUBLICATIONS
Conference Papers:
• Kuan-Chuan Peng, Amir Sadovnik, Andrew Gallagher, and Tsuhan Chen. ”Where Do Emotions Come from? Predicting the Emotion Stimuli Map”, IEEE International Conference on Image Processing (ICIP), 2016.
• Kuan-Chuan Peng and Tsuhan Chen. ”Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks”, IEEE Winter Conference on Applications of Computer Vision (WACV), 2016.
• Kuan-Chuan Peng and Tsuhan Chen. ”Cross-layer Features in Convolutional Neural Networks for Generic Classiﬁcation Tasks”, IEEE International Conference on Image Processing (ICIP), 2015.
• Kuan-Chuan Peng and Tsuhan Chen. ”A Framework of Extracting Multiscale Features Using Multiple Convolutional Neural Networks”, IEEE International Conference on Multimedia and Expo (ICME), 2015.
• Kuan-Chuan Peng, Amir Sadovnik, Andrew Gallagher, and Tsuhan Chen. ”A Mixed Bag of Emotions: Model, Predict, and Transfer Emotion Distributions”, IEEE Computer Vision and Pattern Recognition (CVPR), 2015.
• Kuan-Chuan Peng, Kolbeinn Karlsson, Tsuhan Chen, Dongqing Zhang, and Heather Yu. ”A Framework of Changing Image Emotion Using Emotion Prediction”, IEEE International Conference on Image Processing (ICIP), 2014.
• Kuan-Chuan Peng and Tsuhan Chen. ”Incorporating Cloud Distribution in Sky Representation”, IEEE International Conference on Computer Vision (ICCV), 2013.
101

Patent Application: • Kuan-Chuan Peng, Heather Hong Yu, Dongqing Zhang, and Tsuhan Chen. ”Emotion Modiﬁcation for Image and Video Content”, US Patent Application 20150213331, July 30, 2015.
102

BIBLIOGRAPHY
[1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of multilayer neural networks for object recognition. In European Conference on Computer Vision, pages 329–344, 2014.
[2] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189–2202, November 2012.
[3] Y. Bar, N. Levy, and L. Wolf. Classiﬁcation of artistic styles using binarized features derived from a deep neural network, 2014.
[4] I. Ben-Shalom, N. Levy, L. Wolf, N. Dershowitz, A. Ben-Shalom, R. Shweka, Y. Choueka, T. Hazan, and Y. Bar. Congruency-based reranking. In Computer Vision and Pattern Recognition, pages 2107–2114, 2014.
[5] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, 2001.
[6] M. M. Bradley and P. J. Lang. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry, 25(1):49–59, 1994.
[7] B. Celikkale, A. Erdem, and E. Erdem. Visual attention-driven spatial pooling for image memorability. In Computer Vision and Pattern Recognition Workshop, pages 976–983, 2013.
[8] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1– 27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ ˜cjlin/libsvm.
[9] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook. Efﬁcient salient region detection with soft image abstraction. In International Conference on Computer Vision, pages 1529–1536. IEEE, 2013.
[10] E. S. Dan-Glauser and K. R. Scherer. The Geneva affective picture database (GAPED): a new 730-picture database focusing on valence and normative signiﬁcance. Behavior Research Methods, 43(2):468–477, 2011.
103

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. ImageNet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, pages 248–255, 2009.
[12] deviantart. http://www.deviantart.com.
[13] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia, 19(3):34– 41, 2012.
[14] Merriam-Webster Online: Dictionary and Thesaurus. http://www. merriam-webster.com/.
[15] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.
[16] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. McRorie, J.C. Martin, L. Devillers, S. Abrilian, A. Batliner, N. Amir, and K. Karpouzis. The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. Lecture Notes in Computer Science, 4738:488–501, 2007.
[17] dpchallenge. http://www.dpchallenge.com/.
[18] P. Ekman, W. V. Friesen, and P. Ellsworth. What emotion categories or dimensions can observers judge from facial behavior? Emotion in the Human Face, pages 39–55, 1982.
[19] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88:303–338, 2010.
[20] Flickr. https://www.flickr.com/.
[21] J. R. J. Fontaine, K. R. Scherer, E. B. Roesch, and P. C. Ellsworth. The world of emotions is not two-dimensional. Psychological Science, 18(2):1050–1057, 2007.
[22] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Y. Yao. Interestingness prediction by robust learning to rank. In European Conference on Computer Vision, pages 488–503, 2014.
104

[23] M. Gendron and L. F. Barrett. Reconstructing the past: a century of ideas about emotion in psychology. Emotion Review, 1(4):316–339, 2009.
[24] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In International Conference on Image Processing, pages 4034–4038. IEEE, 2013.
[25] S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(10):1915–1926, 2012.
[26] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision, pages 392–407, 2014.
[27] G. Grifﬁn, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.
[28] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. V. Gool. The interestingness of images. In International Conference on Computer Vision, pages 1633–1640. IEEE, 2013.
[29] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361, 2014.
[30] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1469–1482, 2014.
[31] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In Computer Vision and Pattern Recognition, pages 145–152, 2011.
[32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[33] D. Joshi, R. Datta, Q.-T. Luong, E. Fedorovskaya, J. Z. Wang, J. Li, and J. Luo. Aesthetics and emotions in images: a computational perspective. IEEE Signal Processing Magazine, 28(5):94–115, 2011.
105

[34] B. Jou, S. Bhattacharya, and S.-F. Chang. Predicting viewer perceived emotions in animated GIFs. In International Conference on Multimedia, pages 213–216. ACM, 2014.
[35] B. Jou, T. Chen, N. Pappas, M. Redi, M. Topkara, and S.-F. Chang. Visual affect around the world: A large-scale multilingual visual sentiment ontology. In International Conference on Multimedia, pages 159–168. ACM, 2015.
[36] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In International Conference on Computer Vision, pages 2106– 2113, 2009.
[37] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint ﬁne-tuning in deep neural networks for facial expression recognition. In International Conference on Computer Vision, pages 2983–2991. IEEE, 2015.
[38] L. Kang, P. Ye, Y. Li, and D. Doermann. A deep learning approach to document image quality assessment. In International Conference on Image Processing, pages 2570–2574. IEEE, 2014.
[39] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, and H. Winnemoeller. Recognizing image style. In The British Machine Vision Conference, 2014.
[40] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F.-F. Li. Large-scale video classiﬁcation with convolutional neural networks. In Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
[41] F. S. Khan, S. Beigpour, J. V. D. Weijer, and M. Felsberg. Painting-91: a large scale database for computational painting categorization. Machine Vision and Applications, 25:1385–1397, 2014.
[42] A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability of image regions. In The Conference and Workshop on Neural Information Processing Systems, pages 296–304, 2012.
[43] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Discovering elements of fashion styles. In European Conference on Computer Vision, pages 472–488, 2014.
[44] J. Kim, S. Yoon, and V. Pavlovic. Relative spatial features for image memo-
106

rability. In International Conference on Multimedia, pages 761–764. ACM, 2013.
[45] S. G. Koolagudi and K. S. Rao. Emotion recognition from speech: a review. International Journal of Speech Technology, 15(2):99–117, 2012.
[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.
[47] P. J. Lang, M. M. Bradley, and B. N. Cuthbert. International affective picture system (IAPS): affective ratings of pictures and instruction manual. Tech. Rep. No. A-8. 2008.
[48] F.-F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, pages 178–186, 2004.
[49] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
[50] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. RAPID: rating pictorial aesthetics using deep learning. In International Conference on Multimedia. ACM, 2014.
[51] J. Luo, A. Singhal, S. P. Etz, and R. T. Gray. A computational approach to determination of main subject regions in photographic images. Image and Vision Computing, 22:227–241, 2004.
[52] J. Machajdik and A. Hanbury. Affective image classiﬁcation using features inspired by psychology and art theory. In International Conference on Multimedia, pages 83–92. ACM, 2010.
[53] K. Matsumoto, K. Kita, and F. Ren. Emotion estimation of wakamono kotoba based on distance of word emotional vector. In the 7th International conference on Natural Language Processing and Knowledge Engineering, pages 214–220, 2011.
[54] N. Murray, L. Marchesotti, and F. Perronnin. AVA: A large-scale database
107

for aesthetic visual analysis. In Computer Vision and Pattern Recognition, pages 2408–2415, 2012.
[55] A. Nguyen, J. Yosinski, and J. Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. CoRR, abs/1602.03616, 2016.
[56] A. Ortony and T. J. Turner. What’s basic about basic emotions? Psychological Review, 97(3):315–331, 1990.
[57] N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics, 9(1):62–66, 1979.
[58] E. Park, X. Han, T. L. Berg, and A. C. Berg. Combining multiple sources of knowledge in deep cnns for action recognition. In IEEE Winter Conference on Applications of Computer Vision, pages 1–8, 2016.
[59] K.-C. Peng and T. Chen. Incorporating cloud distribution in sky representation. In International Conference on Computer Vision, pages 2152–2159. IEEE, 2013.
[60] K.-C. Peng and T. Chen. Cross-layer features in convolutional neural networks for generic classiﬁcation tasks. In IEEE International Conference on Image Processing, pages 3057–3061, 2015.
[61] K.-C. Peng and T. Chen. A framework of extracting multi-scale features using multiple convolutional neural networks. In IEEE International Conference on Multimedia and Expo, pages 1–6, 2015.
[62] K.-C. Peng and T. Chen. Toward correlating and solving abstract tasks using convolutional neural networks. In IEEE Winter Conference on Applications of Computer Vision, pages 1–9, 2016.
[63] K.-C. Peng, K. Karlsson, T. Chen, D.-Q. Zhang, and H. Yu. A framework of changing image emotion using emotion prediction. In IEEE International Conference on Image Processing, pages 4637–4641, 2014.
[64] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. The Cornell Emotion6 Image Database. http://chenlab.ece.cornell.edu/ people/kuanchuan/publications/Emotion6.zip.
[65] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. The Cornell
108

EmotionROI Image Database. http://chenlab.ece.cornell.edu/ people/kuanchuan/publications/EmotionROI.zip.
[66] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In Computer Vision and Pattern Recognition, pages 860–868, 2015.
[67] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. Where do emotions come from? predicting the emotion stimuli map. In IEEE International Conference on Image Processing, 2016.
[68] K.-C. Peng, H. H. Yu, D. Zhang, and T. Chen. Emotion modiﬁcation for image and video content, 07 2015. US Patent Application 20150213331.
[69] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015.
[70] J. A. Russell. A circumplex model of affect. Journal of Personality and Social Psychology, 39(6):1161–1178, 1980.
[71] E. M. Schmidt and Y. E. Kim. Modeling musical emotion dynamics with conditional random ﬁelds. In the 12th International Society for Music Information Retrieval Conference, pages 777–782, 2011.
[72] M. Solli and R. Lenz. Emotion related structures in large image databases. In International Conference on Image and Video Retrieval, pages 398–405. ACM, 2010.
[73] X. Wang, J. Jia, J. Yin, and L. Cai. Interpretable aesthetic features for affective image classiﬁcation. In International Conference on Image Processing, pages 3230–3234. IEEE, 2013.
[74] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
[75] D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and J.-M. Odobez. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1583–1597, 2016.
109

[76] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition from abbey to zoo. In Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
[77] Z. Xu, D. Tao, Y. Zhang, J. Wu, and A. C. Tsoi. Architectural style classiﬁcation using multinomial latent logistic regression. In European Conference on Computer Vision, pages 600–615, 2014.
[78] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph-based manifold ranking. In Computer Vision and Pattern Recognition, pages 3166–3173, 2013.
[79] Y.-H. Yang and H. H. Chen. Machine recognition of music emotion: a review. ACM Transactions on Intelligent Systems and Technology, 3(3):1–30, 2012.
[80] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In The Conference and Workshop on Neural Information Processing Systems, pages 3320–3328, 2014.
[81] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. In International Conference on Machine Learning Workshop, 2015.
[82] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833, 2014.
[83] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for ﬁne-grained category detection. In European Conference on Computer Vision, pages 834–849, 2014.
110