TOWARDS SOLVING ABSTRACT TASKS USING CONVOLUTIONAL NEURAL NETWORKS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Kuan-Chuan Peng August 2016 c 2016 Kuan-Chuan Peng ALL RIGHTS RESERVED TOWARDS SOLVING ABSTRACT TASKS USING CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng, Ph.D. Cornell University 2016 Abstract tasks are the tasks relating to or involving general ideas or qualities rather than specific people, objects, or actions. Recently, abstract tasks such as artistic style classification and memorability prediction have been receiving increasing attention in the computer vision community. Previous works related to abstract tasks mainly reply on standard handcrafted features without directly learning the features from the training data. In this thesis, we explore the efficacy of using convolutional neural networks (CNN) which learn the features tailored for each abstract task. Predicting emotion distributions and predicting emotion stimuli maps are the first two abstract tasks we work on. In both tasks, we build associated databases and show that CNN-based approaches can predict more accurate emotion distributions and emotion stimuli maps compared with the methods used in the previous works. Given the encouraging results in the emotion-related tasks, we apply CNN to eight different abstract tasks proposed recently in computer vision, showing that CNN-based approaches can outperform the state-of-the-art performance reported in the previous works. In addition to the traditional CNN framework, we propose using multi-task, multi-depth, and multi-scale CNN features to further improve the performance in abstract tasks. Multi-task features incorporate the features learned from the training data of other tasks. Multi-depth features consist of the features learned by different neural network architectures, but multi-scale features are formed by the features learned from the augmented training data in different scales. The experimental results show that all the three proposed CNN features outperform the traditional CNN framework. Furthermore, we train another fully connected networks to fuse our proposed CNN features. The fused features achieve better performance than using each of our proposed features. THESIS COMMITTEE Prof. Tsuhan Chen School of Electrical and Computer Engineering Cornell University Prof. Kavita Bala Department of Computer Science Cornell University Prof. Serge Belongie Department of Computer Science Cornell University and Cornell Tech iii BIOGRAPHICAL SKETCH Kuan-Chuan Peng receives his Ph.D. in the field of Electrical and Computer Engineering from Cornell University. He joined the Advanced Multimedia Processing Lab at Cornell University in 2012, advised by Prof. Tsuhan Chen. His thesis committee includes Prof. Tsuhan Chen, Prof. Kavita Bala, and Prof. Serge Belongie. His thesis focuses on using convolutional neural networks to improve the performance of the abstract tasks in computer vision such as emotion classification and memorability prediction. Before accepting his Ph.D. degree from Cornell University, Kuan-Chuan Peng received his bachelor’s degree in Electrical Engineering at National Taiwan University in 2009. He gained interest in Computer Vision through the special projects advised by Prof. Liang-Gee Chen. Therefore, he shifted his major to Computer Science and got his master’s degree at National Taiwan University in 2012. During his two years of master studies in National Taiwan University, he worked with Prof. Chiou-Shann Fuh and JMicron Technology Corp., developing the algorithm of pedestrian detection for automobile devices. As a researcher in computer vision, Kuan-Chuan Peng not only serves as the reviewer of the related top-tier academic conferences but also publishes his works at those venues, including International Conference on Computer Vision (ICCV) and the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). After graduating from Cornell University, Kuan-Chuan Peng plans to continue his research on neural networks and abstract tasks. iv This thesis is dedicated to my parents: Jui-Pin Peng and Kuei-Yuan Liu. I appreciate your unconditional love and continuous support. v ACKNOWLEDGEMENTS I am grateful for the constant support and guidance from my advisor, Prof. Tsuhan Chen during my Ph.D. life. You are the main reason why I choose Cornell to do my Ph.D. study, and this is one of the best decisions I have ever made in my life. You encourage me to think creatively and provide me with lots of freedom in various topics in computer vision. Your academic insight helps me overcome the challenges when I pursue my Ph.D. I am always inspired by your advice on my research and your invaluable life experiences. You are my role model in both career and research, and I am proud to be your student. I appreciate the suggestions given by my thesis committee: Prof. Kavita Bala and Prof. Serge Belongie. Your broad knowledge and clairvoyant advice encourage me to keep improving my research works. I also want to thank Prof. Noah Snavely and Prof. Thorsten Joachims. Their courses in computer vision and machine learning build a solid foundation for me to carry on my Ph.D. research. I thank Heather Yu and Dongqing Zhang, my manager and mentor when I was an intern researcher in Huawei in 2013. Your industrial insight and continuous support even after my internship are indispensable part of my research. Our study about emotion prediction motivates my succeeding research in abstract tasks. It is a rewarding experience to work with you. I am grateful for the support from my collaborators: Andrew Gallagher, Amir Sadovnik, and Kolbeinn Karlsson. I enjoy the brainstorming in our discussions and appreciate your constructive advice during my Ph.D. study. It is my pleasure to work with you and learn from you. I thank all the members in the Advanced Multimedia Processing (AMP) Lab: Adarsh Kowdle, Zhaoyin Jia, Ruogu Fang, Amandianeze Nwana, Hang Chu, Henry Shu, and Dong Ki vi Kim. I also thank the scholars visiting the AMP Lab during my time in Cornell, including James Guo-Zhen Wang, Toshihiko Yamasaki, Satoshi Ueno, Tim TyngLuh Liu, and Yuuka Kihara. Their feedback for my research and the useful discussions in our weekly group meetings are important for my research. I am lucky to be able to learn from each of you. I am thankful for the Cornell students who have worked with me in the past few years. Their creativity always gives me new ideas in my research. I thank all my friends at Cornell and Cornell Taiwanese Student Association. They always give me the warmest hug especially in the winter time in Ithaca. Their support makes my life at Cornell colorful and memorable. I appreciate the unconditional love from my family. You are always there rooting for me, which makes me unafraid of any obstacles in my life. vii TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction 1 1.1 First Published Appearances of Described Contributions . . . . . 4 2 Modeling and Predicting Evoked Emotion Distributions 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 The Emotion6 Database . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Emotion Model and Category Selection . . . . . . . . . . . 11 2.3.2 Image Collection and User Study . . . . . . . . . . . . . . . 11 2.4 Predicting Emotion Distributions . . . . . . . . . . . . . . . . . . . 14 2.4.1 Support Vector Regression (SVR) . . . . . . . . . . . . . . . 14 2.4.2 Convolutional Neural Networks (CNN) and CNN for Regression (CNNR) . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Predicting Valence–Arousal (VA) Scores . . . . . . . . . . . . . . . 20 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Predicting Emotion Stimuli Maps 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 The EmotionROI Database and User Study . . . . . . . . . . . . . 27 3.3 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 Labeling from Emotion Similarity (LES) . . . . . . . . . . . 30 3.3.2 FCN with Euclidean Loss (FCNEL) . . . . . . . . . . . . . 33 3.4 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Baselines — Saliency and Objectness Detection . . . . . . . 35 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 Multi-task CNN Features 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Databases and Tasks . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . 4.2.3 Training Approaches . . . . . . . . . . . . . . . . . . . . . . 42 42 45 47 47 48 48 viii 4.3 CNN Performance in Abstract Tasks . . . . . . . . . . . . . . . . . 50 4.4 Correlating Abstract Tasks . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 Multi-depth CNN Features 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Generating Multi-depth CNN Features . . . . . . . . . . . . . . . 5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Databases and Tasks . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Training Approach . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 61 63 66 66 67 68 70 6 Multi-scale CNN Features 71 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Label-Inheritable Property . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.1 Databases and Tasks . . . . . . . . . . . . . . . . . . . . . . 76 6.3.2 MSCNN Architecture . . . . . . . . . . . . . . . . . . . . . 77 6.3.3 Training Approach . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7 Fusing Multi-depth, Multi-scale, and Multi-Task Features 85 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Feature Fusion Using Fully Connected Networks . . . . . . . . . . 86 7.2.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . 86 7.2.2 Fused Features . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.4 Feature Fusion for Predicting Emotion Distributions . . . . . . . . 89 7.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.4.2 Proposed Combined Networks . . . . . . . . . . . . . . . . 90 7.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 92 8 Conclusions and Future Work 94 8.1 Future Work: CNN Visualization for Abstract Tasks . . . . . . . . 95 A Related Publications 101 Bibliography 103 ix LIST OF TABLES 1.1 Several standard benchmark databases and the recent CNNrelated works using them. . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 The issues of previous emotion image databases and how our proposed database, Emotion6, solves these issues. . . . . . . . . . 9 2.2 The feature set we use for support vector regression (SVR) in predicting emotion distributions. . . . . . . . . . . . . . . . . . . . 15 2.3 The performance of different methods for predicting emotion distributions compared using PM and M (M ∈ {KLD, BC, CD, EMD}). The upper table shows PM, the probability that Method 1 outperforms Method 2 with distance metric M. Each row in the upper table shows that Method 1 outperforms Method 2 in all M. The lower table lists M, the mean of M, of each method, showing that CNNR achieves better M than the other methods listed here. CNNR performs the best out of all the listed methods in terms of all PMs with better M. . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 The performance of different algorithms for predicting valence and arousal scores. The numbers are the average of absolute difference (AAD) compared with the ground truth in SAM 9point scale [6]. CNNR outperforms the two baselines, and has comparable performance to SVR. . . . . . . . . . . . . . . . . . . 20 3.1 The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in MAE, precision, recall, and 4 F-measures. The performance is shown by PM and M (M ∈ MAE, precision, recall, F0.5, F √0.3, F1, F2 ). The top table lists M, the mean of metric M, of each method, showing that LES and FCNEL achieve better M than the two baselines. For each metric M, the best M is marked in bold. The bottom two tables show PM, the probability that Method 1 outperforms Method 2 with metric M. Each row in the bottom two tables shows that Method 1 outperforms Method 2 in most M. For saliency detection [9], both the results before / after “block fitting” are reported. FCNEL performs the best out of all the listed methods in terms of most PM with better M. . . . . . . . . . . . . . . . . . 37 4.1 The databases and associated abstract tasks used in this thesis along with their properties. In the rest of the thesis, we refer to each task by the corresponding task ID listed under each task. The experimental setting for each task is provided at the bottom of the table, where ρ is Spearman rank correlation between the prediction and the ground truth. . . . . . . . . . . . . . . . . . . 44 x 4.2 The five different CNN training approaches used in this chapter. MImageNet is the Caffe [32] reference model trained for ImageNet [11] classification, and MAVA is our trained reference model for AVA [54] classification. We refer to each training method by its training approach ID. . . . . . . . . . . . . . . . . . 49 4.3 The summary of the experimental results of the 8 abstract tasks listed in Table 4.1 using the five training approaches in Table 4.2. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The bold numbers represent the best performance given the specified evaluation metric, and the underlinednumbers indicate the performance better than that of “train from scratch.” . . . . . . . . . . . . . . . . . . . . . 50 4.4 The summary of the experimental results using the framework in Figure 4.1. The task CAL and the 8 abstract tasks listed in Table 4.1 are included in this experiment. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The underlinednumbers indicate the performance better than that of using self feature. The best concatenating setting in each task is shown in Table 4.5. This table shows that concatenating the CNN features learned from different tasks can improve the performance. However, in our experiment, concatenating all the learned CNN features never performs the best in each task. In fact, in 4 out of 9 tasks, concatenating all the learned CNN features performs even worse than using self feature. . . . 54 4.5 The best concatenating setting out of 264 different settings in each task. The corresponding performance is listed in Table 4.4. This table shows that the best performance of each task is not achieved by concatenating all the learned CNN features, but by concatenating a subset of them. . . . . . . . . . . . . . . . . . . . 55 4.6 The performance ranking in each task by using the CNN features learned from a single task. The numbers are presented in the form “rank (performance).” The rank 1 (9) represents the best (worst) performance in each task. The table shows that self feature usually performs the best if we use the CNN features learned from one task. The best and the worst performances out of the 9 different features in each task are listed at the bottom of the table, which shows that even the worst performance is nontrivial (much better than random guessing). . . . . . . . . . . . . 57 xi 4.7 The performance ranking in each task by using self feature and the CNN features learned from another task. The format of the numbers is the same as that of Table 4.6 except that the underlinednumbers represent the performance better than that of using only self feature. The best and the worst performances in each task out of the 8 different combinations of features are listed at the bottom of the table, where the performance of using only self feature is also provided. The table shows that most of the listed feature combinations outperform using only self feature in each task. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 The summary of the multi-depth CNN features used in this chapter. Serving as a baseline, F0 represents the features extracted from the topmost layer in the traditional CNN framework. F1 to F4 are our proposed multi-depth CNN features which are formed by concatenating fis (i = {0, 1, · · · , 4}) defined in Figure 5.1. We follow the specification of the AlexNet [46] and use k = 4096. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for all the 8 tasks, our proposed multi-depth CNN features (F1 to F4) outperform not only the best known results from prior works but also the features commonly used in the traditional CNN framework (F0). . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1 The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for most of the 8 tasks, our proposed multi-scale CNN features (MSCNN-1 to MSCNN-3) outperform not only the best known results from prior works but also the traditional CNN framework (MSCNN-0). . . . . . . . . . . . . . . . . . . . . . . . 82 7.1 The performance comparison between each of our proposed features and the fused features. The numbers are reported in the format “accuracy (%) [confidence level (%)].” The confidence represents the confidence of the feature outperforming the CNN baseline according to the binomial test. The italic numbers are the performance exceeding the 95% confidence level (statistically significant results). . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.2 Performance comparison in predicting emotion distributions. Our combined networks predict more accurate emotion distributions with ≥97.7% confidence compared with Peng’s method [66]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 xii LIST OF FIGURES 1.1 Example abstract tasks proposed in recent computer vision literature. For each abstract task, we show an example image and its ground truth label from the corresponding database proposed in the listed reference. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Example images of Emotion6 with the corresponding ground truth. The emotion keyword used to search each image is displayed on the top. The graph below each image shows the probability distribution of evoked emotions of that image. The bottom two numbers are valence–arousal (VA) scores in SAM 9point scale [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Two screenshots of the interface of our user study on the Amazon Mechanical Turk. Before the subject answers the questions (the right image), we provide the subject with the instructions and an example (the left image) explaining how to answer the questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Classification performance of CNN and Wang’s method [73] with the Artphoto database [52]. In 6 out of 8 emotion categories, CNN outperforms Wang’s method [73]. . . . . . . . . . . . . . . 17 2.4 The mapping between the color codes and emotion keywords used in Section 2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 The distribution of VA scores of the Emotion6 database. All the images in Emotion6 are placed in VA plane according to their VA scores. The boundary of each image is colored to reflect the dominant evoked emotion according to the color codes in Figure 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 An example showing that different regions in an image contribute to the viewer’s evoked emotion differently. (c), (d), and (e) are cut from the yellow, green, and red rectangles of (a) respectively. (b) shows the regions in (a) which affect the evoked emotion the most marked by user study. (e) will evoke more similar emotions as those evoked by (a) compared with (c) and (d) because (e) contains not only the person jumping but other emotion-related areas, which is consistent with (b), the ground truth of the emotion stimuli map of (a). . . . . . . . . . . . . . . . 24 3.2 An example showing the difference between saliency, objectness detection and the emotion stimuli map. (b) is the ground truth emotion stimuli map using (a) as the input image. (c) and (d) correspond to saliency [9] and objectness [2] detection, respectively. Neither (c) nor (d) perfectly captures (b), where two thirds of the subjects convey that the area affecting evoked emotions includes not only the flower but also other emotion-related areas. 25 xiii 3.3 A screenshot of the interface of our user study on the Amazon Mechanical Turk. We ask the subject to draw a rectangle enclosing the part of the image that most influences the evoked emotion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Some example images from the EmotionROI database with the corresponding ground truth emotion stimuli maps. The emotion keyword used to search each image (provided by the Emotion6 [66] database) is displayed under the image. . . . . . . 29 3.5 The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in PR curve, average precision and recall, and 4 F-measures. These figures show that our two proposed methods (LES and FCNEL) outperform the two baselines (saliency [9] and objectness [2] detection) in PR curve as well as 6 statistics computed from the average precision and recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 The qualitative and quantitative results of predicting emotion stimuli maps with some testing images of the EmotionROI database as input. The representation of each column is as follows: (a) input image, (b) the ground truth emotion stimuli map, (c) the pre-processed result of saliency detection [9], (d) the post-processed result of saliency detection [9], (e) the result of objectness detection [2], (f) the result of LES, and (g) the result of FCNEL. The emotion keyword used to search each input image is shown under each image in column (a) according to the information provided in the Emotion6 [66] database. For column (b) to (g), the corresponding MAE is shown under each image. Our two proposed methods (column (f) and (g)) predict more accurate emotion stimuli maps than other baselines (column (c) to (e)) do for these examples. . . . . . . . . . . . . . . . . . . . . . . 40 4.1 The framework of concatenating the CNN features learned from different tasks. We experiment on 9 tasks (n = 9), including the 8 abstract tasks in Table 4.1 and Caltech-101 [48] object classification task. The switch S i associated with each task Ti (i ∈ {1, 2, · · · , 9}) controls whether the CNN features learned from task Ti are concatenated in the final feature vector. . . . . . . . . 53 xiv 5.1 The five CNN structures adopted in this chapter. CNN0 represents the AlexNet [46], and CNN1 to CNN4 are the same as CNN0 except that some convolutional layers are removed. We use each CNNi (i = {0, 1, · · · , 4}) as a feature extractor which takes an image as input and outputs a feature vector fi from the topmost fully connected layer. These fis are concatenated to form our proposed multi-depth features according to the definition in Table 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.1 The illustration of label-inheritable (LI) property. Given an image database D associated with a task T , if any cropped version of any image I from D can take the same label as that of I, we say that T satisfies LI property. In this figure, we only show the case when the cropped image is the upper right portion of the original image. In fact, the cropped image can be any portion of the original image. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Example images from the three databases (Painting-91 [41], arcDataset [77], and Caltech-101 [48]) associated with three different tasks which satisfy LI property in different degrees. The corresponding database, task, label, and the extent that LI property is satisfied are shown under each image. . . . . . . . . . . . . 75 6.3 The illustration of our proposed multi-scale convolutional neural networks (MSCNN) which consists of m + 1 AlexNet [46] (one for each of the m + 1 different scales). The details of the MSCNN architecture and the training approach are explained in Sec. 6.3.2 and Sec. 6.3.3 respectively. . . . . . . . . . . . . . . . . . 77 7.1 The fully connected networks we use to fuse our proposed multitask, multi-depth, and multi-scale CNN features. . . . . . . . . . 86 7.2 The illustration of our proposed combined networks for predicting emotion distributions. We only show convolution, deconvolution, and fully connected layers for clarity. . . . . . . . 90 8.1 The visualization of the output node representing the class “high aesthetic value” in the task AVA in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where all the 9 training images are correctly labeled by the trained CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 xv 8.2 The visualization of the multi-task features. The visualized output node represents the class “Rembrandt van Rijn” in the task AST in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. . . . . . . . . 98 8.3 The visualization of the multi-depth features. The visualized output node represents the class “Gothic architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4 The visualization of the multi-scale features. The visualized output node represents the class “Russian Revival architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. . . . . . . . . 100 xvi CHAPTER 1 INTRODUCTION During the past few years, computer vision scientists are excited about the breakthrough progress in object classification. According to the article [69] published by the team hosting the ImageNet [11] visual recognition challenge, the error rate of object classification decreases from 16.42% in 2012 to 6.66% in 2014. Machines are taught to be able to accurately classify objects. However, humans care more than just objects. For instance, according to the user study conducted by Murray et al. [54], humans can distinguish the images with high aesthetic values from the images with low aesthetic values even though all these images contain the same type of objects. Another example showing that humans are interested in more than objects is image tagging, where humans use not only objects but also non-object attributes to tag or describe the images, which can be easily verifiable on common image hosting websites such as Flickr [20]. Both the examples of image aesthetic values and image tagging show that abstract attributes matter. Adopting the definition from the MerriamWebster dictionary [14], we define abstract attributes as the attributes relating to or involving general ideas or qualities rather than specific people, objects, or actions. To bridge the gap between humans and machines, one approach is to teach machines to solve the tasks involving abstract attributes. In this thesis, we use the term abstract tasks to refer to the tasks involving abstract attributes. Various kinds of abstract tasks have been proposed in recent computer vision literature, including classification tasks (emotions [52, 73], artist and artistic styles [41], aesthetic qualities [50, 54], fashion styles [43], and architectural styles [77]) and regression tasks (memorability [7, 30, 31, 42, 44] and interest- 1 Figure 1.1: Example abstract tasks proposed in recent computer vision literature. For each abstract task, we show an example image and its ground truth label from the corresponding database proposed in the listed reference. ingness [22, 28]). Figure 1.1 shows some example images and their ground truth labels of these abstract tasks. We observe that these involving abstract attributes are relatively subjective compared with objects and that it is tricky to describe abstract attributes as objects. Because of the above two reasons, most abstract tasks are weakly supervised without the information of location, and the ground truth of abstract tasks has lower consensus. In addition, publicly available databases of abstract tasks only contain limited amount of data (typically a few hundreds or thousands of images), which is not at all comparable 2 database(s) CNN-related works ImageNet [11] PASCAL [19] SUN [76] Caltech-101 [48] Caltech-UCSD Birds 200 [74] [15, 26, 29, 46, 82] [1, 29] [1, 15, 26] [15, 29, 82] [15, 83] Table 1.1: Several standard benchmark databases and the recent CNNrelated works using them. to millions of images already available for standard object or scene classification tasks. We also find that most abstract tasks have higher intra-class variation compared with object classification tasks. The above properties make abstract tasks challenging, which motivates us to propose better methods to solve abstract tasks. During our search for possible methods to solve abstract tasks, we find that convolutional neural networks (CNN) have achieved better performance than non CNN-based approaches for several databases in recent studies [15, 26, 82, 83]. The trend of using CNN-based methods to solve computer vision problems started with Krizhevsky et al. [46] using CNNs to achieve breakthrough improvement in ImageNet [11] visual recognition challenge. In addition, Donahue et al. [15] showed that the network proposed by Krizhevsky et al. [46] can serve as a feature extractor which they use to outperform the state-of-the-art results on several generic tasks. Given these encouraging results [15, 46], more researchers started studying and applying CNN-based approaches to different standard benchmark databases. After summarizing the commonly used standard databases and the recent CNN-related works using them in Table 1.1, we find that these databases used in most CNN-related works are associated with standard classical object or scene classification tasks instead of abstract tasks. 3 In this thesis, we explore the possibility of using CNN to solve abstract tasks. We start our experiment on the two emotion-related tasks, predicting emotion distributions and predicting emotion stimuli maps, and extend our experiment to eight different abstract tasks. After showing that CNN-based approaches can outperform the state-of-the-art results of the eight abstract tasks, we proposed to use multi-task, multi-depth, and multi-scale CNN features to further improve the performance. The rest of this thesis is organized as follows. Chapter 2 presents our work on model and predict evoked emotion distributions. Chapter 3 describes our work on predicting emotion stimuli maps, which is an example of extending the weakly supervised abstract tasks to strongly supervised ones. Chapter 4, Chapter 5, and Chapter 6 elaborate our proposed multi-task, multi-depth, and multi-scale CNN features respectively. After showing supporting results of our proposed CNN features on abstract tasks, we describe a method using fully connected networks to fuse these features in Chapter 7. The thesis is concluded in Chapter 8 with a discussion of potential future work. 1.1 First Published Appearances of Described Contributions Most contributions or their initial versions described in this thesis have first appeared as various publications: 1. Chapter 2: Peng, Sadovnik, Gallagher, Chen [66] 2. Chapter 3: Peng, Sadovnik, Gallagher, Chen [67] 3. Chapter 4: Peng, Chen [62] 4. Chapter 5: Peng, Chen [60] 4 5. Chapter 6: Peng, Chen [61] 6. Chapter 7: Peng, Chen [60, 61, 62] 7. Chapter 8: Peng, Chen [60, 61, 62] The following contributions have appeared as various publications: Peng, Karlsson, Chen, Zhang, Yu [63]; Peng, Yu, Zhang, Chen [68]; Peng, Chen [59]. However, they are beyond the scope of this dissertation, and therefore are not discussed here. 5 CHAPTER 2 MODELING AND PREDICTING EVOKED EMOTION DISTRIBUTIONS Summary This chapter explores a new aspect of photos and human emotions. We show through psychovisual studies that different people have different emotional reactions to the same image, which is a novel departure from previous work that only records and predicts a single dominant emotion for each image. Our studies also show that the same person may have multiple emotional reactions to one image. Predicting emotions in “distributions” instead of a single dominant emotion is important for many applications. In addition, we present a new database, Emotion6, containing distributions of emotions. 2.1 Introduction Images are emotionally powerful. An image can evoke a strong emotion in the viewer. Further, the viewer’s emotion may be sometimes affected in a way that was unexpected by the photographer. For example, an image of a hot air balloon may evoke feelings of joy to some observers (who crave adventure), but fear in others (who have fear of heights). We address the fact that people have different evoked emotions by collecting and predicting the distributions of emotional responses when an image is viewed by a large population. We also address the fact that the same person may have multiple emotions evoked by one image by allowing the subjects to record multiple emotional responses to one image. Our goal is to predict the evoked emotion that a population of 6 observers has when viewing a particular image. We make the following contributions: 1. We show that different people have different emotional reactions to an image and that the same person may have multiple emotional reactions to an image. Our proposed database, Emotion6, addresses both findings by modeling emotion distributions. 2. We use a convolutional neural network (CNN) to predict emotion distributions, rather than simply predicting a single dominant emotion, evoked by an image. Our predictor of emotion distributions for Emotion6 can serve as a better baseline than using support vector regression (SVR) with the features from previous works [52, 72, 73] for future researchers. We also predict emotions in the traditional setting of affective image classification, showing that CNN outperforms Wang’s method [73] on Artphoto database [52]. 2.2 Prior Work In computer vision, image classification based on abstract concepts has recently received a great deal of focus. Aesthetic quality estimation [33] and affective image classification [52, 72, 73] are two typical examples. However, these two abstract concepts are fundamentally different because the evoked emotion of an image is not equivalent to aesthetic quality. For example, one may feel joyful after viewing either amateur or expert photos, and aesthetically ideal images may elicit either happy or sad emotions. Moreover, aesthetic quality is a one- 7 dimensional attribute, whereas emotions are not [21]. In predicting the emotion responses evoked by an image, researchers conduct experiments on various types of images. Wang et al. [73] perform affective image classification on artistic photos or abstract paintings, but Solli and Lenz [72] use Internet images. Machajdik and Hanbury [52] use both artistic and realistic images in their experiment. The fact that groups of people seldom agree on the evoked emotion [33] and that even a person may have multiple emotions evoked by one image, are ignored by previous works. According to the statistics of Emotion6, the image database we collect in Sec. 2.3, more than half of the subjects have emotion responses different from the dominant emotion. Our statistics also show that ∼22% of all the subjects’ responses select at least two emotion keywords to describe one subject’s evoked emotions. Both of these observations support our assertion that emotion should be represented as a distribution rather than as a single dominant emotion. Further, predicting emotions by distribution rather than as a single dominant emotion is important for practical applications. For example, a company has two possible ads arousing different emotion distributions (ad1: 60% joy and 40% surprise; ad2: 70% joy and 30% fear). Though ad2 elicits joy with higher probability than ad1 does, the company may choose ad1 instead of ad2 because ad2 arouses negative emotion in some part of the population. In psychology, researchers have been interested in emotions for decades, leading to three major approaches – “basic emotion”, “appraisal”, and “psychological constructionist” traditions [23]. With debate on these approaches, psychologists designed different kinds of models for explaining fundamental emotions based on various criteria. Ortony and Turner [56] summarized some 8 Issues of previous databases Explanation and how Emotion6 solves the issues Ad-hoc categories Previous databases select emotion categories without psychological background, but Emotion6 uses Ekman’s 6 basic emotions [18] as categories. Unbalanced categories Previous databases have unbalanced proportion of images from each category, but Emotion6 has balanced categories with 330 images per category. Single category per image Assigning each image to only one category (dominant emotion), previous databases ignore that the evoked emotion can vary between observers [33]. Emotion6 expresses the emotion associated with each image in probability distribution. Table 2.1: The issues of previous emotion image databases and how our proposed database, Emotion6, solves these issues. past theories of basic emotions, and even now, there is not complete consensus among psychologists. One of the most popular frameworks in the emotion field proposed by Russell [70], the valence–arousal (VA) model, characterizes emotions in VA dimensions, where valence describes the emotion in the scale of positive to negative emotion, while arousal indicates the degree of stimulation or excitement. We adopt VA model as part of emotion prediction. In terms of emotion categories, we adopt Ekman’s six basic emotions [18], which details are explained in Sec. 2.3. To recognize and classify different emotions, scientists build connections between emotions and various types of input data including text [21], speech [16, 45], facial expressions [16, 13], music [79], and gestures [16]. Among the research related to emotions, we are interested in emotions evoked by consumer photographs (not just artworks or abstract images as in [73]). Unfortunately, the number of related databases is relatively few compared to other areas mentioned previously. These databases, such as IAPS [47], GAPED [10], 9 Figure 2.1: Example images of Emotion6 with the corresponding ground truth. The emotion keyword used to search each image is displayed on the top. The graph below each image shows the probability distribution of evoked emotions of that image. The bottom two numbers are valence–arousal (VA) scores in SAM 9-point scale [6]. and emodb [72] have a few clear limitations. We propose a new emotion database, Emotion6, which this chapter is mainly based on. Table 2.1 summarizes how Emotion6 solves the limitations of previous databases. Sec. 2.3 describes the details of Emotion6. 2.3 The Emotion6 Database For each image in Emotion6, the following information is collected by a user study: 1. The ground truth of VA scores for evoked emotion. 2. The ground truth of emotion distribution for evoked emotion. Consisting of 7 bins, Ekman’s 6 basic emotions [18] and neutral, each emotion distribution represents the probability that an image will be classified into each 10 bin by a subject. For both VA scores, we adopt the self-assessment manikin (SAM) 9-point scale [6], which is also used in [47]. For the valence scores, 1, 5, and 9 mean very negative, neutral, and very positive emotions respectively. For the arousal scores, 1 (9) means the emotion has low (high) stimulating effect. Figure 2.1 shows some images from Emotion6 with the corresponding ground truth. The details about the selection of emotion model/categories/images and the user study are described in the following sections. The database Emotion6 is available online [64]. 2.3.1 Emotion Model and Category Selection According to the list of different theories of basic emotions [56], we use Ekman’s six basic emotions [18] (anger, disgust, joy, fear, sadness, and surprise) as the categories of Emotion6. Each of these six emotions is adopted by at least three psychological theorists in [56], which provides a consensus for the importance of each of these six emotions. We adopt the valence–arousal (VA) model, in addition to using emotion keywords as categories because we want to be consistent with the previous databases where ground truth VA scores are provided. 2.3.2 Image Collection and User Study We collect the images of Emotion6 from Flickr [20] by using the 6 category keywords and synonyms as search terms. High-level semantic content of an image, including strong facial expressions, posed humans, and text, influences the evoked emotion of an image. Instead of focusing on the text or facial expres- 11 Figure 2.2: Two screenshots of the interface of our user study on the Amazon Mechanical Turk. Before the subject answers the questions (the right image), we provide the subject with the instructions and an example (the left image) explaining how to answer the questions. sions, we are more interested in how the low-level features of images such as colors and texture affect the evoked emotion. Therefore, we remove the images containing apparent human facial expressions or text directly related to the evoked emotion because these two contents are shown to have strong relationship to the emotion [16, 21]. In contrast to the database emodb [72], which has no human moderation, we examine each image in Emotion6 to remove erroneous images. A total of 1980 images are collected, 330 for each category, comparable to previous databases. Each image is scaled to approximately VGA resolution while keeping the original aspect ratio. We use Amazon Mechanical Turk (AMT) to collect emotional responses from subjects. For each image, each subject rates the evoked emotion in terms of VA scores, and chooses the keyword(s) best describing the evoked emotion. We provide 7 emotion keywords (Ekman’s 6 basic emotions [18] and neutral), and the subject can select multiple keywords for each image. Instead of directly asking the subject to give VA scores, we rephrase the questions to be similar to GAPED [10]. Figure 2.2 shows two snapshots of the interface. To compare with 12 previous databases, we randomly extract a subset S G containing 220 images from GAPED [10] such that the proportion of each category in S G is the same as that of GAPED. We rejected the responses from a few subjects who failed to demonstrate consistency or provided a constant score for all images. Each human intelligence task (HIT) on AMT contains 10 images, and we offer 10 cents to reward the subject’s completion of each HIT. In the instructions, we inform the subject that the answers will be examined by an algorithm that detects lazy or fraudulent workers and only workers that pass will be paid. In each HIT, the last image is from S G, and the other 9 images are from Emotion6. We create 220 different HITs for AMT such that the following constraints are satisfied: 1. Each HIT contains at least one image from each of 6 categories (by keyword). 2. Images are ordered in such a way that the frequency of an image from category i appearing after category j is equal for all i, j. 3. Each image or HIT cannot be rated more than once by the same subject, and each subject cannot rate more than 55 different HITs. 4. Each image is scored by 15 subjects. Mean and standard deviation, in seconds, on each HIT are 450 and 390 respectively. The minimum time spent on 1 HIT is 127 seconds, which is still reasonable. 432 unique subjects took part in the experiment, rating 76.4 images on average. After collecting the answers from the subjects, we sort the VA scores, and average the middle 9 scores (to remove outliers) to serve as ground truth. For emotion category distribution, the ground truth of each category is 13 the average vote of that category across subjects. To provide grounding for Emotion6, we compute the VA scores of the images from S G using the above method and compare them with the ground truth provided by GAPED [10], where the original scale 0∼100 is converted linearly to 1∼9 to be consistent with our scale. The average of absolute difference of V (A) scores for these images is 1.006 (1.362) in SAM 9-point scale [6], which is comparable in this highly subjective domain. 2.4 Predicting Emotion Distributions Randomly splitting the Emotion6 database into training and testing sets with the proportion of 7:3, we propose three methods—SVR, CNN, CNNR and compare their performance with those of the three baselines. The details of the proposed three methods are explained below. 2.4.1 Support Vector Regression (SVR) Inspired by previous works on affective image classification [52, 72, 73], we adopt features related to color, edge, texture, saliency, and shape to create a normalized 759-dimensional feature set shown in Table 2.2. To verify the affective classification ability of this feature set, we perform the exact experiment from [52], using their database. The average true positive per class is ∼60% for each category, comparable to the results presented in [52]. We train one model for each emotion category using the ground truth of the category in Emotion6 with Support Vector Regression (SVR) provided in 14 Feature Type Dimension Description Texture 24 3 Features from Gray-Level Co-occurrence Matrix (GLCM) including the mean, variance, energy, entropy, contrast, and inverse difference moment [52]. Tamura features (coarseness, contrast and directionality) [52]. Composition 2 1 2 3 Rule of third (distance between salient regions and power points/lines) [73]. Diagonal dominance (distance between prominent lines and two diagonals) [73]. Symmetry (sum of intensity differences between pixels symmetric with respect to the vertical/horizontal central line) [73]. Visual balance (distances of the center of the most salient region from the center of the image, the vertical and horizontal central lines) [73]. Saliency 1 1 2 Difference of areas of the most/least saliency regions. Color difference of the most/least saliency regions. Difference of the sum of edge magnitude of the most/least saliency regions. Color 80 Cascaded CIECAM02 color histograms (lightness, chroma, hue, brightness, and saturation) in the most/least saliency regions. Edge 512 Cascaded edge histograms (8 (8)-bin edge direction (magnitude) in RGB and gray channels) in the most/least saliency regions. Shape 128 Fit an ellipse for every segment from color segmentation and compute the histogram of fit ellipses in terms of angle (4 bins), the ratio of major and minor axes (4 bins), and area (4 bins) in the most/least saliency regions. Table 2.2: The feature set we use for support vector regression (SVR) in predicting emotion distributions. 15 LIBSVM [8] with the parameters of SVR learned by performing 5-fold cross validation on the training set. In the predicting phase, the probabilities of all emotion categories are normalized such that they sum up to 1. To assess the performance of SVR in emotion classification, we compare the emotion with the greatest prediction with the dominant emotion of the ground truth. The accuracy of our model in this multi-class classification setting is 38.9%, which is about 2.7 times that of random guessing (14.3%). 2.4.2 Convolutional Neural Networks (CNN) and CNN for Regression (CNNR) In CNN, we use the exact convolutional neural network in [46] except that the number of output nodes is changed to 7 to represent the probability of the input image being classified as each emotion category in Emotion6. In CNNR, we train a regressor for each emotion category in Emotion6 with the exact convolutional neural network in [46] except that the number of output nodes is changed to 1 to predict a real value and that the softmax loss layer is replaced with the Euclidean loss layer. In the predicting phase, the probabilities of all emotion categories are normalized to sum to 1. Using the Caffe implementation [32] and its default parameters for training the ImageNet [11] model, we pre-train with the Caffe reference model [32] and fine-tune the convolutional neural network with our training set in both CNN and CNNR. To show the efficacy of classification with the convolutional neural network, we use CNN to perform binary emotion classification with the Artphoto database [52] under the same experimental setting of Wang’s method [73]. In 16 Figure 2.3: Classification performance of CNN and Wang’s method [73] with the Artphoto database [52]. In 6 out of 8 emotion categories, CNN outperforms Wang’s method [73]. this experiment, we change the number of output nodes to 2 and train one binary classifier for each emotion under 1-vs-all setting. We repeat the positive examples in the training set such that the number of positive examples is the same as that of the negative ones. Figure 2.3 shows that CNN outperforms Wang’s method [73] in 6 out of 8 emotion categories. In terms of the average of average true positive per class of all 8 emotion categories, CNN (64.724%) also outperforms Wang’s method [73] (63.163%). The preceding experiment shows that CNN achieves state-of-art performance for emotion classification of images. However, what we are really interested in is the prediction of emotion distributions, which better capture the range of human responses to an image. For this task, we use CNNR as previously described, and show that its performance is state-of-the-art for emotion distribution prediction. 17 We compare the predictions of our proposed three methods with the following three baselines: 1. A uniform distribution across all emotion categories. 2. A random probability distribution. 3. Optimally dominant (OD) distribution, a winner-take-all strategy where the emotion category with highest probability in ground truth is set to 1, and other emotion categories have zero probability. The first two baselines represent chance guesses while the third represents a best case scenario for any (prior art) multi-class emotion classifier that outputs a single emotion. We use four different distance metrics to evaluate the similarity between two emotion distributions — KL-divergence (KLD), Bhattacharyya coefficient (BC), Chebyshev distance (CD), and earth mover’s distance (EMD) [53, 71]. Since KLD is not well defined when a bin has value 0, we use a small value ε = 10−10 to approximate the values in such bins. In computing EMD in our paper, we assume that each of the 7 dimensions (Ekman’s 6 basic emotions [18] and neutral) is such that the distance between any two dimensions is the same. For KLD, CD and EMD, lower is better. For BC, higher is better. For each distance metric M, we use M and PM to evaluate the ranking between two algorithms, where M is the mean of M, and PM in Table 2.3 (upper table) is the proportion of images where Method 1 matches the ground truth distribution more accurately than Method 2 according to distance metric M. Method 1 is superior to Method 2 when PM exceeds 0.5. For the random distribution baseline, we repeat 100000 times and report the average PM. The results 18 Method 1 CNNR CNNR CNNR CNNR CNNR Uniform Method 2 Uniform Random OD SVR CNN OD PKLD 0.742 0.815 0.997 0.625 0.934 0.997 PBC 0.783 0.819 0.840 0.660 0.810 0.667 PCD 0.692 0.747 0.857 0.571 0.842 0.736 PEMD 0.756 0.802 0.759 0.620 0.805 0.593 Method Uniform Random OD SVR CNN CNNR KLD 0.697 0.978 10.500 0.577 2.338 0.480 BC 0.762 0.721 0.692 0.820 0.692 0.847 CD 0.348 0.367 0.510 0.294 0.497 0.265 EMD 0.667 0.727 0.722 0.560 0.773 0.503 Table 2.3: The performance of different methods for predicting emotion distributions compared using PM and M (M ∈ {KLD, BC, CD, EMD}). The upper table shows PM, the probability that Method 1 outperforms Method 2 with distance metric M. Each row in the upper table shows that Method 1 outperforms Method 2 in all M. The lower table lists M, the mean of M, of each method, showing that CNNR achieves better M than the other methods listed here. CNNR performs the best out of all the listed methods in terms of all PMs with better M. are in Table 2.3. CNNR outperforms SVR, CNN, and the three baselines in both PM and M, and should be considered as a standard baseline for future emotion distribution research. Table 2.3 also shows that OD performs even worse than uniform baseline. This shows that predicting only one single emotion category like [52, 72, 73] does not well model the fact that people have different emotional responses to the same image and that the same person may have multiple emotional responses to one image. 19 Figure 2.4: The mapping between the color codes and emotion keywords used in Section 2.5. Method 1 CNNR CNNR CNNR Method 2 Popularity Random SVR PV 0.631 0.729 0.556 PA 0.577 0.818 0.502 Method Popularity Random SVR CNNR AAD of Valence 1.590 2.423 1.347 1.219 AAD of Arousal 0.829 2.113 0.734 0.741 Table 2.4: The performance of different algorithms for predicting valence and arousal scores. The numbers are the average of absolute difference (AAD) compared with the ground truth in SAM 9point scale [6]. CNNR outperforms the two baselines, and has comparable performance to SVR. 2.5 Predicting Valence–Arousal (VA) Scores In this section, we present the statistics related to VA scores in SAM 9-point scale [6] and predict VA scores. For the valence scores, 1, 5, and 9 mean very negative, neutral, and very positive emotions respectively. For the arousal scores, 1 (9) means the emotion has low (high) stimulating effect. The boundary of each image is colored according to its dominant evoked emotion using the color codes in Figure 2.4. Figure 2.5 places all the images of Emotion6 in VA plane according to the ground truth of evoked VA scores. We create predictors for VA scores separately using the same set of features 20 Figure 2.5: The distribution of VA scores of the Emotion6 database. All the images in Emotion6 are placed in VA plane according to their VA scores. The boundary of each image is colored to reflect the dominant evoked emotion according to the color codes in Figure 2.4. 21 and similar methods as those of predicting the emotion distributions in Sec. 2.4. We compare the results of SVR and CNNR with the following two baselines: 1. Guessing V (A) score as the mode of all V (A) scores. 2. Guessing VA scores uniformly. We evaluate the results with the average of absolute difference (AAD) compared with the ground truth in SAM 9-point scale [6]. We also report PV (PA), the proportion of the images in the test set where Method 1 predicts more accurate V (A) than Method 2. For the baseline using uniform random guessing, we repeat 100000 times and report the average. The results are listed in Table 2.4. CNNR outperforms the two baselines, and has comparable performance with respect to SVR. 2.6 Conclusion This chapter introduces the idea of representing the emotional responses of observers to an image as a distribution of emotions. We describe the methods for estimating the emotion distribution for an image. Further, our proposed emotion predictor, CNNR, outperforms other methods including using SVR with the features from the previous works and the optimal dominant emotion baseline, the upper-bound of the emotion predictors that predict a single emotion. Finally, we propose a novel image database, Emotion6, and provide ground truth of valence, arousal, and probability distributions in evoked emotions. 22 CHAPTER 3 PREDICTING EMOTION STIMULI MAPS Summary Which parts of an image evoke emotions in an observer? To answer this question, we introduce a novel problem in computer vision — predicting an Emotion Stimuli Map (ESM), which describes pixel-wise contribution to evoked emotions. Using a new image database, EmotionROI, as a benchmark for predicting the ESM, we find that the regions selected by saliency and objectness detection do not correctly predict the image regions which evoke emotion. Although objects represent important regions for evoking emotion, parts of the background are important as well. Based on this fact, we propose two methods for predicting the ESM, one incorporates emotion similarity while the other leverages fully convolutional networks. Both qualitative and quantitative experimental results confirm that our methods can predict the regions which evoke emotions better than both saliency and objectness detection. 3.1 Introduction Images, when viewed, can cause a variety of emotional responses, depending on not only the arrangement of one or more objects in the image but also the emotional state or background of the viewer. For example, an image of bungee jumping can make outdoors-loving people excited, but it can evoke fear in those afraid of heights. Even within the same image, different regions contribute to the viewer’s evoked emotion differently. Imagine we crop the 23 (a) (b) (c) (d) (e) Figure 3.1: An example showing that different regions in an image contribute to the viewer’s evoked emotion differently. (c), (d), and (e) are cut from the yellow, green, and red rectangles of (a) respectively. (b) shows the regions in (a) which affect the evoked emotion the most marked by user study. (e) will evoke more similar emotions as those evoked by (a) compared with (c) and (d) because (e) contains not only the person jumping but other emotion-related areas, which is consistent with (b), the ground truth of the emotion stimuli map of (a). yellow, green, and red rectangles (Figure 3.1 (c), (d), and (e)) from Figure 3.1 (a) and present them individually to a viewer without showing the viewer the full image context (a). The emotional response to (e) is more similar to (a) than to either (c) or (d). We represent the varying degree of influence that regions of an image have on the emotional response of a viewer with an emotion stimuli map (ESM), shown in Figure 3.1 (b), where brighter areas represent higher influence. The ESM (b) is produced by averaging across selections from a user study, and matches the observation that (e) best captures the emotional-inducing regions of (a). In this chapter, we are interested in predicting the ESM. Recently, emotion-related topics have gained increasing attention in computer vision, especially affective image classification. Machajdik and Hanbury [52] perform affective image classification on both artistic and realistic images. Solli and Lenz [72] use Internet images in their experiment, but Wang et al. [73] focus on affective image classification of artistic photos or abstract paint- 24 (a) (b) (c) (d) Figure 3.2: An example showing the difference between saliency, objectness detection and the emotion stimuli map. (b) is the ground truth emotion stimuli map using (a) as the input image. (c) and (d) correspond to saliency [9] and objectness [2] detection, respectively. Neither (c) nor (d) perfectly captures (b), where two thirds of the subjects convey that the area affecting evoked emotions includes not only the flower but also other emotionrelated areas. ings. Our prior work [66] also predict and transfer emotion distributions using Internet images. In addition, there are related works studying emotions from animated GIFs [34] and multilingual perspectives [35]. Even though different forms of multimedia have been explored, none of the previous works analyze the influence of various regions in an image on emotion. There is no benchmark for evaluating the ESM. We use the images collected in the Emotion6 database [66] to build a benchmark database, EmotionROI, for predicting the ESM. The ground truth of the ESM provided in the EmotionROI database is generated based on the answers marked by users in a user study. The details of the EmotionROI database are explained in Sec. 3.2. Saliency detection [9, 25, 51] and objectness measurement [2] are two popular topics closely related to predicting the ESM. While saliency and objectness detection tend to find salient objects in an image, the ESM captures the regions affecting the evoked emotion and those regions may contain not only the salient objects but other emotion-related areas. For example, Figure 3.2 25 (c) and (d) are the results of saliency [9] and objectness [2] detection respectively with Figure 3.2 (a) as the input. Figure 3.2 (c) focuses on the dark salient areas, but Figure 3.2 (d) emphasizes the withered flower. Neither Figure 3.2 (c) nor (d) perfectly captures the ground truth ESM in Figure 3.2 (b), where two thirds of the subjects convey that the area affecting evoked emotions the most includes not only the flower but also other emotion-related areas. In this chapter, we propose two methods for predicting the ESM with the result closer to the ground truth versus the state-of-the-art algorithms for saliency [9] and objectness [2] detection. Previous work related to saliency detection [36] often considers using eyetracking equipments to gather ground truth and perform validation. However, when building the ground truth ESM in the EmotionROI database, we choose not to use eye-tracking equipments because of the following two reasons: 1. Saliency detection is different from predicting the ESM in terms of the task definition, and we also show their difference in Figure 3.2, where (b) and (c) are not even similar. 2. Where humans look at in an image may implicitly reveal partial areas which affect the evoked emotion the most. However, we believe that directly asking the subjects to mark the emotion-related areas is a more straightforward and efficient method which can avoid potential errors caused by the inference from the eye-tracking results. To the best of our knowledge, this is the first work in computer vision addressing the problem of predicting the ESM. We make the following contributions: 26 1. We build a benchmark database, EmotionROI, for predicting the ESM by performing a user study and collecting the ground truth ESMs of the images provided in the Emotion6 database [66]. The EmotionROI database is available online [65]. 2. We propose two methods in Sec. 3.3 for predicting the ESM. One leverages emotion similarity, while the other one learns features with fully convolutional networks. Both our methods predict more accurate ESMs than do the state-of-the-art algorithms of saliency [9] and objectness [2] detection. 3.2 The EmotionROI Database and User Study We use the images in the Emotion6 database [66] to build the EmotionROI database, our proposed benchmark database for predicting the ESM. The EmotionROI database provides the ESMs produced by integrating rectangular areas from the image identified by subjects as influential to the evoked emotion. In the following paragraphs, we introduce the Emotion6 [66] database first and explain the details of building the EmotionROI database. The Emotion6 [66] database consists of 6 emotion categories with 330 images per category. For each image, the following information is provided: 1. The ground truth of evoked emotion distribution in terms of emotion keywords. 2. The emotion keyword used to search each image. The Emotion6 [66] database is assembled from Flickr [20] by entering the 6 category keywords corresponding with the Ekman’s 6 basic emotions [18] (anger, 27 Figure 3.3: A screenshot of the interface of our user study on the Amazon Mechanical Turk. We ask the subject to draw a rectangle enclosing the part of the image that most influences the evoked emotion. disgust, joy, fear, sadness, and surprise) and their synonyms as the searching keywords, followed by a step of human moderation to remove erroneous images. The Emotion6 [66] database contains 1980 images, 330 per category, comparable to previous databases [10, 47]. Each image is approximately VGA resolution. Adopting all the 1980 images in the Emotion6 [66] database, we use the Amazon Mechanical Turk (AMT) to collect responses from subjects, building the ground truth ESMs in the EmotionROI database. We ask the subject to draw a rectangle enclosing the part of the image that most influences the evoked emotion. Figure 3.3 is a snapshot of the interface. We collect the responses in a similar way as that used in the Emotion6 [66] database. We consider the emotion categories provided by the Emotion6 [66] database and create 220 different human intelligence tasks (HITs) (each HIT contains 10 images) for AMT that meet the following constraints: 1. Each HIT contains at least one image from each of the 6 categories. 28 Figure 3.4: Some example images from the EmotionROI database with the corresponding ground truth emotion stimuli maps. The emotion keyword used to search each image (provided by the Emotion6 [66] database) is displayed under the image. 2. Images are ordered in such a way that the frequency of an image from category i appearing after category j is equal for all i, j. We also enforce the following regulations to be consistent with the previous database [10]: 1. The same subject can only respond to each image or HIT at most once, and each subject cannot respond to more than 55 different HITs to increase diversity. 2. We collect 15 responses for each image to have statistically significant results. 432 unique subjects participate in the experiment, responding to an average of 76.4 images each. For the ESM, we assume the influence of each pixel on evoked emotions is proportional to the number of drawn rectangles covering that pixel. The ground truth ESMs are normalized to the range between 0 to 1. Figure 3.4 shows some example images in the EmotionROI database and the 29 corresponding ground truth ESMs. Figure 3.4 also shows the emotion keyword used to search each image (provided by the Emotion6 [66] database). 3.3 Proposed Methods We propose two methods for predicting the ESM — labeling from emotion similarity (LES) and fully convolutional networks with Euclidean loss (FCNEL). LES leverages objectness detection [2] and emotion prediction [66] from prior works without directly learning from the EmotionROI database, while FCNEL directly learns from the ground truth of the training data of the EmotionROI database. We explain LES and FCNEL in Sec. 3.3.1 and Sec. 3.3.2 respectively. 3.3.1 Labeling from Emotion Similarity (LES) Because the regions affecting the evoked emotion may contain not only the main objects but other emotion-related areas, we want to leverage the results of both objectness detection and emotion distribution prediction in our framework of predicting the ESM. For objectness detection, we adopt Alexe’s method [2] which takes an image as input and outputs a set of bounding boxes B representing the most probable locations for an object. For each bounding box bi ∈ B, the objectness score si representing the probability of bi covering an object is also available from Alexe’s method [2]. A pixel-wise objectness map describing the probability of each pixel belonging to some object can be approximated by assigning each pixel the sum of the objectness scores of the bounding boxes covering that pixel and normalizing the map. 30 For emotion distribution prediction, we use the method and the Emotion6 database introduced in our prior work [66]. We adopt the same feature set as that used in [66] which includes the features related to color, edge, texture, saliency, and shape from the previous works studying affective image classification [52, 72, 73]. We train one emotion predictor for each of the 6 emotion categories in the Emotion6 [66] database using the ground truth of that category provided in the Emotion6 [66] database with support vector regression (SVR) [8]. The parameters of SVR are learned by performing 5-fold cross validation. In testing phase, the probabilities of all emotion categories are normalized such that they sum up to 1. Following the above steps, we build our final emotion predictor which takes an image as input and outputs a 7-D emotion distribution vector. Since we observe that most subjects tend to select regions enclosing some object, we design our algorithm to predict the ESM based on the bounding boxes given by objectness detection. We define a weighting parameter wi ∈ [0, 1] asso- ciated with each bounding box bi ∈ B describing the level of influence of bi on the evoked emotion. Given wi for each bi, the ESM can be approximated by assigning each pixel the sum of the weighting parameters wi of the bounding boxes covering that pixel and normalizing the map. We formulate the problem of estimating wi for each bi as a multi-labeling problem. Assuming there are M discrete levels for each wi ∈ L = {l1, l2, · · · , lM}, where lj = j−1 M−1 for j ∈ {1, 2, · · · , M}, our goal is to find a labeling l ∈ L|B| such that the following energy function is minimized: E (l) = ψi (wi) + λ ψi j wi, w j , bi∈B bi∈B b j∈B b j bi (3.1) where ψi and ψi j are the data term and smoothness term respectively, and λ 31 is the weighting of ψi j. This energy function captures the idea that the ESM should be a map with high values in emotionally involved regions, including the main object(s) and partial background contributing to the evoked emotion. The details of ψi and ψi j are presented in the following paragraphs. The data term ψi is defined as: ψi (wi) = A (bi) · si · |wi M − − si| , 1 (3.2) where A (bi) is the area of bi normalized by the area of the entire image. We design the data term such that the final estimation of wi will not be too far away from si since the subjects tend to select rectangles enclosing main objects. Those bis with larger A (bi) and si will be weighted more in E (l) such that both the coverage and objectness scores are respected. The smoothness term ψi j is defined as: ψi j wi, w j =A bi ∩ b j · BC ei, e j · wi − w j , M−1 (3.3) where A bi ∩ b j is the normalized area of the intersection of bi and b j, ei and e j are the evoked emotion distributions predicted by our emotion predictor using bi and b j as input respectively, and BC ei, e j is the Bhattacharyya coefficient between ei and e j. BC (·, ·) returns a value between 0 to 1 representing the similarity between two input probability distributions. The higher the Bhat- tacharyya coefficient is, the more similar the two input probability distributions are. Both the idea of modeling emotions as distributions and the choice of using the evaluation metric BC are inspired by [66]. The design of ψi j encourages bounding boxes with larger overlapping area or more similar evoked emotion distributions to take similar labels. Initializing wi with the label closest to si, we minimize E (l) by graph cut 32 with α-expansion algorithm [5]. In our experiment, we set M = 10 to reach a balance between the diversity of the label set and computational efficiency, and λ is empirically set to 0.001. 3.3.2 FCN with Euclidean Loss (FCNEL) Fully Convolutional Networks (FCN) have been shown to achieve the stateof-the-art performance in semantic segmentation since Long et al. [49] popularized this approach. We leverage FCN as another method for predicting the ESM because FCN provides an end-to-end training framework which generates pixel-wise dense prediction of the same resolution as the input image. Specifically, we adopt the FCN in Long’s work [49] with single stream, 32-pixelprediction-stride version based on the AlexNet [46] architecture. We choose this standard and relatively simple architecture versus other deeper or more complicated networks because the size of our EmotionROI database is relatively small. Therefore, we want to keep the number of parameters which need to be trained manageable. In Long’s work [49], the softmax loss layer is used as the objective function in the FCN for semantic segmentation where any two different semantic labels are mutually exclusive. However, in predicting the ESM, we want to predict the influence on evoked emotions at each pixel location, not one out of many mutually exclusive class labels. Therefore, we change the topmost fully connected layer of FCN such that only one output representing the influence on evoked emotions is predicted at each pixel location. We also change the softmax loss layer to Euclidean loss layer such that the modified FCN can be trained to 33 predict the ESM close to the corresponding ground truth in terms of L2-norm. To distinguish the FCN using Euclidean loss from the common FCN used in semantic segmentation, we use FCNEL to refer to the former method. In our experiment, we train the FCNEL for predicting the ESM by using the Caffe [32] framework. We pre-train our FCNEL with the reference model, FCNAlexNet, which is trained for the PASCAL VOC segmentation task [19] and provided by Long et al. [49]. After pre-training, we fine-tune all the parameters of FCNEL with the training data of the EmotionROI database. To efficiently train FCNEL but also avoid convergence issue of the learned parameters, we empirically set the base learning rate to 10−8. The number of training iterations is set such that each training example is visited at least 20 times. For other training details, we adopt the same setting provided by Long et al. [49] unless otherwise specified. 3.4 Experimental Setting We experiment on our proposed EmotionROI database, and we use the same training/testing split as that used in our prior work [66] unless otherwise specified. Therefore, there are 1386/594 training/testing images out of all the 1980 images in the EmotionROI database. 3.4.1 Evaluation Metrics We use 8 evaluation metrics, including mean absolute error (MAE), precision, recall, 4 commonly used F-measures (F0.5, F √0.3, F1, and F2 scores), and the 34 precision-recall (PR) curve. All the predicted ESMs are normalized to 0 to 1 before evaluation. MAE corresponds to the mean absolute error between the value of the predicted map and the ground truth at all pixel locations. precision is defined as the ratio of emotionally involved pixels correctly assigned to all the pixels identified in the predicted map, while recall represents the percentage of detected emotionally involved pixels out of all the pixels marked in the ground truth. Before computing precision and recall, we binarize each predicted map adaptively according to its Otsu threshold [57]. F-measure is defined in terms of precision and recall as follows: Fβ = 1 + β2 · precision · recall , β2 · precision + recall (3.4) where β controls the weighting between precision and recall. In addition to the 3 common F-measures (F0.5, F1, and F2), we also include F √ 0.3 because it is a standard metric for saliency detection [78]. For the PR curve, we binarize the predicted map using each threshold between [0, 255]/255, which is similar as the method used in [78]. 3.4.2 Baselines — Saliency and Objectness Detection Applying our two proposed methods to all the testing images of the EmotionROI database to predict the ESMs, we compare the results with those of the state-of-the-art method of saliency [9] and objectness [2] detection. Since the saliency detection method proposed by Cheng et al. [9] outputs pixelwise saliency maps instead of the maps consisting of bounding boxes, we post-process the results of the saliency map for fair comparison by drawing a bounding box covering the middle p percent of salient pixels in both width and 35 Figure 3.5: The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in PR curve, average precision and recall, and 4 F-measures. These figures show that our two proposed methods (LES and FCNEL) outperform the two baselines (saliency [9] and objectness [2] detection) in PR curve as well as 6 statistics computed from the average precision and recall. height. The processed saliency map will include one bounding box filled with 1 and other areas filled with 0. Testing p = 50 + 5i for i = {0, 1, · · · , 9}, we find that the resulting MAEs are similar, so we only report the best result among them which uses p = 80. We also compute the MAE of context-aware saliency detection [25], and the results are similar to those of Cheng’s method [9]. Therefore, we only report the results using Cheng’s method [9] for saliency detection. 3.5 Experimental Results We evaluate the predicted ESMs of the 594 testing images of the EmotionROI database using the 8 evaluation metrics mentioned in 3.4.1. The results are summarized in Figure 3.5 and Table 3.1. Figure 3.5 shows PR curve, average precision and recall, and 4 F-measures. For saliency detection [9], we only show 36 Method Saliency [9] Objectness [2] LES FCNEL MAE 0.328 / 0.303 precision 0.585 / 0.666 recall 0.340 / 0.536 F0.5 F √ 0.3 F1 0.455 / 0.600 0.444 / 0.594 0.383 / 0.556 F2 0.349 / 0.538 0.197 0.668 0.510 0.604 0.597 0.551 0.522 0.186 0.132 0.668 0.669 0.600 0.718 0.642 0.672 0.639 0.673 0.617 0.683 0.603 0.701 Method 1 Method 2 PMAE P preci sion Precall LES Saliency [9] 0.973 / 0.944 0.723 / 0.467 0.873 / 0.635 LES Objectness [2] 0.721 0.527 0.840 FCNEL FCNEL FCNEL Saliency [9] 0.983 / 0.990 0.733 / 0.512 0.949 / 0.777 Objectness [2] 0.847 0.503 0.892 LES 0.786 0.492 0.850 Method 1 Method 2 PF0.5 PF √ 0.3 PF1 PF2 LES Saliency [9] 0.937 / 0.651 0.938 / 0.656 0.935 / 0.656 0.902 / 0.639 LES Objectness [2] 0.811 0.823 0.877 0.857 FCNEL Saliency [9] 0.981 / 0.803 0.988 / 0.824 0.986 / 0.860 0.969 / 0.807 FCNEL Objectness [2] 0.763 0.810 0.919 0.916 FCNEL LES 0.677 0.707 0.862 0.870 Table 3.1: The performance of predicting emotion stimuli maps of the 594 testing images of the EmotionROI database in MAE, precision, recall, and 4 F-measures. The performance is shown by PM and M (M ∈ MAE, precision, recall, F0.5, F √0.3, F1, F2 ). The top table lists M, the mean of metric M, of each method, showing that LES and FCNEL achieve better M than the two baselines. For each metric M, the best M is marked in bold. The bottom two tables show PM, the probability that Method 1 outperforms Method 2 with metric M. Each row in the bottom two tables shows that Method 1 outperforms Method 2 in most M. For saliency detection [9], both the results before / after “block fitting” are reported. FCNEL performs the best out of all the listed methods in terms of most PM with better M. 37 the results before “block fitting” in PR curve because after “block fitting,” the ESMs will be binary with two extreme values by construction. Therefore, there will be no more than 3 points in PR curve for saliency detection [9] after “block fitting.” However, we do find that those points are all below the two PR curves representing LES and FCNEL. We also perform pairwise comparison between our two proposed methods and the two baselines in Table 3.1, where for each metric M, M and PM are used to compare different methods. M is the mean of M, and PM in Table 3.1 represents the proportion of the testing images of the EmotionROI database where Method 1 predicts more accurate ESMs than Method 2 does according to metric M. Method 1 is superior to Method 2 when PM exceeds 0.5. The best M for each M is marked in bold in Table 3.1. For saliency detection [9], we report both the results before / after “block fitting” in Table 3.1, where “block fitting” improves the performance. Both Figure 3.5 and Table 3.1 show that LES and FCNEL outperform saliency [9] and objectness [2] detection under most evaluation metrics and that FCNEL performs the best among all the listed methods. The only exception is the results of precision, where saliency [9] after “block fitting,” objectness [2], LES, and FCNEL show comparable performance. Since the most salient object usually has a relatively high value on the ESM, it makes sense that both saliency [9] and objectless [2] achieve reasonable precision. However, what affects evoked emotions is not only salient objects but also other emotionally involved areas, as shown in the ground truth of the EmotionROI database, Figure 3.1, and Figure 3.2. LES and FCNEL show better ability in identifying those emotionally involved areas compared with both saliency [9] and object- 38 ness [2] detection, which is reflected in the metrics involving recall. In addition to Figure 3.5 and Table 3.1, we also analyze the performance with respect to the emotion keywords used to search the testing images of the EmotionROI database (given by the Emotion6 [66] database). Specifically, we evaluate the performance of each method on the testing images searched using each emotion keyword. We find that LES and FCNEL outperform saliency [9] and objectness [2] detection and that FCNEL performs the best out of all the methods under survey, regardless of the emotion category. Figure 3.6 shows the qualitative and quantitative results of predicting the ESMs with some testing images of the EmotionROI database as input. Column (a) to (g) are the input image, the ground truth ESM, the pre-processed result of saliency detection [9], the post-processed result of saliency detection [9], the result of objectness detection [2], the result of LES, and the result of FCNEL respectively. For column (a), the emotion keyword under each image is the keyword used to search that image according to the information provided in the Emotion6 [66] database. For column (b) to (g), the corresponding MAE is shown under each image. Initializing the ESM with column (e), LES adjusts the weighting of each bounding box according to the emotion similarity in the energy minimization such that the final result in column (f) gets closer to the ground truth in column (b). Directly learning from the ground truth of the training images of the EmotionROI database, FCNEL predicts even more accurate ESMs than LES does, as shown in column (g). Compared with saliency [9] and objectness [2] detection, our results show that the introduction of the cues from emotion similarity and the features learned from the training images of the EmotionROI 39 (a) input (b) ground truth (c) saliency (d) saliency (box) (e) objectness (f) LES (g) FCNEL Figure 3.6: The qualitative and quantitative results of predicting emotion stimuli maps with some testing images of the EmotionROI database as input. The representation of each column is as follows: (a) input image, (b) the ground truth emotion stimuli map, (c) the pre-processed result of saliency detection [9], (d) the post-processed result of saliency detection [9], (e) the result of objectness detection [2], (f) the result of LES, and (g) the result of FCNEL. The emotion keyword used to search each input image is shown under each image in column (a) according to the information provided in the Emotion6 [66] database. For column (b) to (g), the corresponding MAE is shown under each image. Our two proposed methods (column (f) and (g)) predict more accurate emotion stimuli maps than other baselines (column (c) to (e)) do for these examples. 40 database improve the results of predicting the ESM. 3.6 Conclusion We identify a novel problem, predicting the emotion stimuli map (ESM), in computer vision. Building a new image database, EmotionROI, as a benchmark for predicting the ESM, we address the major difference between the ESM, saliency and objectness detection — the regions affecting evoked emotions contain both the main objects and additional contextual background necessary for the viewer to fully experience the emotion of the image. Based on the above finding, we propose two methods, LES and FCNEL, for predicting the ESM. LES uses the cues from emotion similarity, while FCNEL uses fully convolutional networks to directly learn from the training set of the EmotionROI database. We present qualitative and quantitative results showing that both our methods predict more accurate ESMs compared with saliency and objectness detection. 41 CHAPTER 4 MULTI-TASK CNN FEATURES Summary Most works using convolutional neural networks (CNN) show the efficacy of their methods in standard object recognition tasks, but not in abstract tasks such as emotion classification and memorability prediction, which are increasingly important (especially as machines become more autonomous, there is a need for semantic understanding). To verify whether CNN-based methods are effective in abstract tasks, we select 8 different abstract tasks in computer vision, evaluating the performance of 5 different CNN-based training approaches in these tasks. We show that CNN-based approaches outperform the state-of-theart results in all the 8 tasks. Furthermore, we show that concatenating CNN features learned from different tasks can enhance the performance in each task. We also show that concatenating the CNN features learned from all the tasks under experiment does not perform the best, which is different from what is usually shown in previous works. Using CNN as a tool to correlate different tasks, we suggest which CNN features researchers should use in each task. 4.1 Introduction Utilizing convolutional neural networks has become a popular approach in recent research [15, 26, 82, 83]. However, relatively few CNN-related works concentrate on abstract tasks. Abstract tasks have been receiving increasing attention, including classification tasks (emotions [52, 73], architec- 42 tural styles [77], aesthetic qualities [50, 54], fashion styles [43], artist and artistic styles [41]) and regression tasks (memorability [7, 30, 31, 42, 44] and interestingness [22, 28]). Most of the mentioned references tackle the abstract tasks without using CNN. In this chapter, we are interested in the performance of CNN in abstract tasks because it is not yet well studied in the current literature. We are curious whether a CNN can outperform the current state-of-the-art results in abstract tasks. In addition, we are also curious about the following questions which are rarely discussed in current literature where different abstract tasks are treated in a relatively independent fashion. We would like to know whether the CNN features learned from an abstract task can improve the performance in another abstract task. Furthermore, we want to identify for a given abstract task, concatenating which of the many learned CNN features will perform the best. These questions are related to the transferability mentioned by Yosinski et al. [80]. However, Yosinski et al. [80] study the transferability only in the ImageNet [11] classification task. This chapter is related to the transferability across different tasks and databases. Our experiments include 8 abstract tasks from 6 databases, covering the abstract tasks mentioned previously. The databases and tasks used in this chapter are summarized in Table 4.1, where each task is assigned a task ID. In this chapter, we use the task ID to refer to each abstract task. Most works using CNN-related features adopt the AlexNet [46] and pretrain it with the supervision for ImageNet [11] classification. As a novel departure from the previous works, this chapter uses not only the CNN features pretrained with the supervision for ImageNet [11] but also the CNN features pre- 43 database task reference task ID Artphoto emotion classification [52] EMO Painting-91 artist classification [41] AST Painting-91 artistic style classification [41] ART AVA aesthetic classification [54] AVA # classes # images image type class labels 8 806 deviantart [12] fear, sad, etc. 91 4266 painting Rubens, Picasso, etc. 13 2338 painting Baroque, Cubbism, etc. 2 >250k dpchallenge [17] high / low aesthetic quality # training images # testing images data split # fold(s) evaluation metric reference of above setting ∼645 ∼160 random 5 1-vs-all accuracy [73] 2275 1991 specified [41] 1 accuracy [41] 1250 1088 specified [41] 1 accuracy [41] ∼233k 19930 specified [54] 1 accuracy [54] database task reference task ID HipsterWars arcDataset Memorability Memorability fashion style architectural style memorability interestingness classification classification prediction prediction [43] [77] [31] dataset: [31]; task: [28] FAS ARC MEM INT # classes # images image type class labels 5 1893 outfit Bohemian, Goth, etc. 10 / 25 2043 / 4786 architecture Georgian, Gothic, etc. regression task 2222 general memorability regression task 2222 general interestingness # training images # testing images data split # fold(s) evaluation metric reference of above setting 853 92 random 100 accuracy [43] 300 / 750 1743 / 4036 random 10 accuracy [77] 1111 1111 specified [31] 25 ρ [31] 1982 240 random 10 ρ [28] Table 4.1: The databases and associated abstract tasks used in this thesis along with their properties. In the rest of the thesis, we refer to each task by the corresponding task ID listed under each task. The experimental setting for each task is provided at the bottom of the table, where ρ is Spearman rank correlation between the prediction and the ground truth. 44 trained with the supervision for AVA [54] under the AlexNet [46]. We compare the performance of these two sets of CNN features in all 8 abstract tasks in Table 4.1, identifying which set of features achieves better performance in each abstract task. When using CNN-based approaches or features for abstract tasks, most existing works limit themselves to one specific domain instead of leveraging the CNN features learned from different domains of abstract tasks. For instance, Bar et al. [3] use CNN-based features to perform artistic style classification of the images in WikiArt database [4]. Peng et al. [66] predict emotion distributions with their proposed Emotion6 database. Karayev et al. [39] work on image style classification with their proposed databases, but Lu et al. [50] are interested in classifying both aesthetic qualities and image styles using AVA database [54]. Though our databases do not completely overlap, 3 of our 8 selected tasks in Table 4.1 (EMO, ART, and AVA) cover similar abstract tasks to theirs. The novelty of this paper is in applying CNN-based features to 8 abstract tasks from 8 different domains and leveraging the features learned from multiple tasks simultaneously. The following section summarizes our findings. 4.1.1 Summary of Findings Superior Performance of CNN-based Approaches in Abstract Tasks Testing the performance of 5 different training approaches with the AlexNet [46], we find that at least one of the five CNN-based approaches outperforms the current state-of-the-art performance in the 8 abstract tasks in Table 4.1. 45 Concatenating CNN Features Learned from Different Tasks Can Enhance the Performance in Each Task Unlike previous works [28, 31, 41, 43, 52, 54, 77] that only use standard or handcrafted features without using the features specifically trained for other abstract tasks, we argue that the performance of a given abstract task can benefit from the features learned from other existing abstract tasks. More precisely, our results show that for each of the abstract tasks in Table 4.1, concatenating the CNN features learned from some other task(s) which are different from the task of interest can outperform using only the CNN features learned from the task of interest. This finding supports that when the computer vision community keeps identifying and proposing brand new tasks, researchers should leverage the knowledge learned from other related tasks. Concatenating CNN Features Learned from All the Tasks Does Not Perform the Best To identify which CNN features to concatenate will perform the best in each task, we evaluate different settings of concatenating CNN features. Our results show that concatenating the CNN features learned from all the tasks in our experiment does not perform the best, which is surprising and inconsistent with what is usually shown in previous works [30, 31, 41, 54] where combining all the features achieves the best performance. We also show that in some abstract tasks, using only the CNN features learned from the task of interest outperforms concatenating the CNN features learned from all the tasks. In addition, for each task, we identify the concatenating setting which results in the best performance in our experiment. 46 Suggestions of Choosing CNN Features to Use in Abstract Tasks We address that CNN can be used as a tool to correlate different abstract tasks. According to the performance of the CNN features learned from different abstract tasks, we are able to interpret a given abstract task using higher-level semantics instead of only using standard or handcrafted features like previous works [28, 41, 43, 52, 54]. For example, according to our results, artistic style features (F ART) outperforms fashion style features (F FAS) in the artist classification task (AST), where the notation “F T” denotes CNN features learned from the task T (T is a task ID). For each task, we release the performance ranking of the CNN features learned from different tasks. The ranking is an indicator of which CNN features we should consider in each task. We hope this method of correlating different abstract tasks will encourage researchers to leverage the knowledge learned from existing tasks more when they solve their tasks of interest. 4.2 Experimental Setup 4.2.1 Databases and Tasks Performing 8 abstract tasks from 6 databases, we summarize all the databases and tasks used in this chapter in Table 4.1 along with their properties and related statistics. We refer to each task by its task ID listed in Table 4.1, where the experimental setting associated with each task is also provided. In Table 4.1, “data split” indicates whether the training/testing splits are randomly generated or 47 specified by the work proposing the dataset/task, and ”# fold(s)” represents the number of different training/testing splits used for the task. The experimental setting of each task follows that of the corresponding reference listed at the bottom of Table 4.1. 4.2.2 Network Architecture For the 6 classification tasks in Table 4.1, we use the Caffe [32] implementation of the AlexNet [46] except that the number of the nodes in the output layer is set to the number of classes in each task. For the 2 regression tasks (MEM and INT), we also use the Caffe [32] implementation of the AlexNet [46] except that the number of the nodes in the output layer is changed to 1 to predict a real value and that the softmax loss layer is replaced with the Euclidean loss layer. When using the Caffe [32] implementation, we adopt its default training parameters for training the CNN for ImageNet [11] classification unless otherwise specified. 4.2.3 Training Approaches Before training, we resize all the images to 256×256 which is the same size used to train the CNN for ImageNet [11] classification in Caffe [32] implementation. We directly adopt the Caffe reference model [32] (denoted as MImageNet) for ImageNet [11] classification. Using the same CNN architecture (except that the number of the nodes in the output layer is set to 2), we train an AVA [54] reference model (denoted as MAVA) with ∼233k training images and training from scratch approach (randomly initialize all the CNN parameters and train with the 48 training approach ID pt ImageNet + ft pt ImageNet + ft-fc8 pt AVA + ft pt AVA + ft-fc8 train from scratch description Pre-train with MImageNet and fine-tune all the CNN parameters using the training set. The same as “pt ImageNet + ft” except that only the CNN parameters associated with the edges directly connected to the output layer are allowed to be updated using the training set. Pre-train with MAVA and fine-tune all the CNN parameters using the training set. The same as “pt AVA + ft” except that only the CNN parameters associated with the edges directly connected to the output layer are allowed to be updated using the training set. Randomly initialize all the CNN parameters and train with the training set. Table 4.2: The five different CNN training approaches used in this chapter. MImageNet is the Caffe [32] reference model trained for ImageNet [11] classification, and MAVA is our trained reference model for AVA [54] classification. We refer to each training method by its training approach ID. training set). We train a reference model for AVA [54] instead of other databases in Table 4.1 because the number of training images in AVA [54] is large enough (>230k) such that training from scratch can achieve reasonable performance. Given MImageNet and MAVA, we list the 5 training approaches used in this chapter in Table 4.2, following the descriptions and setting of supervised pretraining and fine-tuning used in [1] unless otherwise specified. In Table 4.2, pre-training (pt) with MD means using a data-rich auxiliary dataset D (D ∈ ImageNet, AVA ) to initialize the CNN parameters. Fine-tuning (ft) means all the CNN parameters can be updated by continued training on the dataset of interest. “ft-fc8” is the same as “ft” except that only the CNN parameters associated with the edges directly connected to the output layer are allowed to be updated. In this chapter, we use the training approach ID in Table 4.2 to refer to each training approach. Applying all the 5 training approaches listed in Table 4.2 to all the 8 tasks in Table 4.1, we report the experimental results and 49 task ID evaluation metric previous work pt ImageNet + ft pt ImageNet + ft-fc8 pt AVA + ft pt AVA + ft-fc8 train from scratch EMO 1-vs-all accuracy (%) 63.163 [73] 60.127 64.724 59.836 60.644 61.572 AST accuracy (%) 53.100 [41] 56.102 53.541 25.615 4.671 21.698 ART accuracy (%) 62.200 [41] 68.290 65.165 40.625 18.015 38.327 AVA accuracy (%) 73.250 [50] n/a n/a n/a n/a 74.436 task ID evaluation metric previous work pt ImageNet + ft pt ImageNet + ft-fc8 pt AVA + ft pt AVA + ft-fc8 train from scratch FAS accuracy (%) 70.971 [43] 71.294 66.228 57.337 27.554 54.304 ARC accuracy (%) 69.170 / 46.210 [77] 71.159 / 52.953 67.246 / 51.469 35.841 / 20.401 18.233 / 8.290 21.532 / 12.386 MEM ρ 0.500 [42] 0.520 -0.140 0.368 0.080 0.372 INT ρ 0.600 [28] 0.643 0.339 0.511 -0.113 0.382 Table 4.3: The summary of the experimental results of the 8 abstract tasks listed in Table 4.1 using the five training approaches in Table 4.2. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The bold numbers represent the best performance given the specified evaluation metric, and the underlined numbers indicate the performance better than that of “train from scratch.” compare them with the corresponding state-of-the-art performance in Sec. 4.3. 4.3 CNN Performance in Abstract Tasks We summarize the experimental results of the selected 8 abstract tasks in Table 4.3, which shows that in all the 8 tasks, at least one of the five training approaches in Table 4.2 outperforms the state-of-the-art methods. Table 4.3 also shows that for most of the 8 tasks, the training approaches involving pretraining and fine-tuning usually outperform training from scratch. For AST, ART, and ARC, the complete results are shown in Table 4.3, where the results of 50 ARC are displayed in the form: 10-way / 25-way classification accuracy. Reviewing all the experimental results in Table 4.3, we summarize the following tips for future researchers applying CNN-based approaches to abstract tasks: 1. If there is no prior knowledge about the abstract task of interest, one reasonable way is applying all the 5 training approaches in Table 4.2 and selecting the one with the best performance on the validation set. 2. If we have the prior knowledge of which database (out of all the possible databases at hand which can be used for pre-training) is more relevant to the abstract task of interest, we can directly use that database for pretraining instead of trying all of them. 3. Even if we have no prior knowledge, from our empirical experience, “pt ImageNet + ft” usually performs well for abstract tasks. 4.4 Correlating Abstract Tasks 4.4.1 Experimental Setting To find out whether the features learned from one task can enhance the performance in another task, we select a total of 9 tasks (the 8 abstract tasks in Table 4.1 and Caltech-101 [48] object classification task) in our additional experiment. We include Caltech-101 [48] object classification task (we use CAL as its task ID) because we are also curious about whether the features learned from an object classification task can improve the performance in abstract tasks. For 51 the task CAL, following the same setting in [82] (30 training images per class), we achieve comparable accuracy as that reported in [82] by using the network architecture in Sec. 4.2.2 and the training approach “pt ImageNet + ft.” For each of the 9 tasks in the experiment, we train the corresponding CNN with the network architecture in Sec. 4.2.2 and the training approach “pt ImageNet + ft.” We treat each of the 9 trained CNN as a feature extractor which takes an image as input and outputs a 4096-d feature vector from its topmost fully connected layer. We use “F T” to represent the 4096-d feature vector output from the CNN trained with the task T where we call F T “self feature.” For example, F EMO and F AST are learned CNN features corresponding to emotion and artist classification respectively. In the task EMO, F EMO is self feature, but F AST is not. With the 9 trained CNN feature extractors, we illustrate the framework of our experiment in Figure 4.1. Our goal is to evaluate the performance in each task under different settings of concatenating learned CNN features. We generate the concatenated CNN features as follows. First, given an input image, we extract all the F Tis where i ∈ {1, 2, · · · , 9} (Ti is one of the abstract task in Table 4.1 or CAL; Ti T j if i j). Second, we decide whether to concatenate F Ti by the binary switch S i associate with F Ti. If S i is set (reset), F Ti will (will not) be part of the features concatenated to form the final feature vector. In other words, formed by concatenating all the F Ti with set S i, the final concatenated CNN features are a 4096×nset-d vector, where nset is the total number of set S i (i ∈ {1, 2, · · · , 9}). For each of the 9 tasks, we evaluate the performance under a total of 264 different settings of the 9 switches. These settings include: 52 Figure 4.1: The framework of concatenating the CNN features learned from different tasks. We experiment on 9 tasks (n = 9), including the 8 abstract tasks in Table 4.1 and Caltech-101 [48] object classification task. The switch S i associated with each task Ti (i ∈ {1, 2, · · · , 9}) controls whether the CNN features learned from task Ti are concatenated in the final feature vector. 1. 9 different combinations of S i such that nset = 1. 2. 28 − 1 different combinations of S i such that nset > 1 and self feature is concatenated. For each task, we train a classifier or regressor for each of the 264 settings using the concatenated CNN features of the training database, and we test on the concatenated CNN features of the testing database using the trained classifier or regressor. In this experiment, we use support vector machine (SVM) or support vector regression (SVR) provided in LIBSVM [8], linear kernel, and the LIBSVM [8] default parameters to train all the classifiers and regressors. Considering the efficiency of the experiment, we choose to do the first 5 folds 53 task ID evaluation metric EMO AST ART AVA accuracy (%) accuracy (%) accuracy (%) accuracy (%) self feature best concatenating setting (Table 4.5) concatenate all 36.228 39.082 36.971 55.148 57.509 54.596 67.555 71.048 69.210 69.423 69.980 69.458 task ID evaluation metric FAS ARC MEM INT CAL accuracy (%) accuracy (%) ρ ρ accuracy (%) self feature best concatenating setting (Table 4.5) concatenate all 76.957 77.609 74.348 54.440 55.382 53.489 0.398 0.573 0.507 0.630 0.504 0.629 88.217 88.394 85.969 Table 4.4: The summary of the experimental results using the framework in Figure 4.1. The task CAL and the 8 abstract tasks listed in Table 4.1 are included in this experiment. In this table, ρ is the Spearman rank correlation between the prediction and the ground truth. The underlined numbers indicate the performance better than that of using self feature. The best concatenating setting in each task is shown in Table 4.5. This table shows that concatenating the CNN features learned from different tasks can improve the performance. However, in our experiment, concatenating all the learned CNN features never performs the best in each task. In fact, in 4 out of 9 tasks, concatenating all the learned CNN features performs even worse than using self feature. of training/testing splits in the tasks FAS, ARC, MEM, and INT where more than 5 folds are provided. In the task EMO, we perform 8-way classification instead of 1-vs-all setting. For AVA database [54] associated with the task AVA, we use the generic training set with 2495 images to shorten the training time. In the task ARC, we do the 25-way classification task specified in Table 4.1. In the task CAL, we follow the setting in [82] (30 training images per class). Other experimental settings are consistent with Table 4.1 unless otherwise specified. 54 task ID ART AST CAL ARC EMO AVA FAS MEM INT F ART v v vv v v F AST v v vv F CAL v v vv F ARC vv vv F EMO v v vv vv F AVA v vv v vv F FAS vv v vvv F MEM v vv vv F INT vv Table 4.5: The best concatenating setting out of 264 different settings in each task. The corresponding performance is listed in Table 4.4. This table shows that the best performance of each task is not achieved by concatenating all the learned CNN features, but by concatenating a subset of them. 4.4.2 Experimental Results We summarize the performance of concatenating learned CNN features in Table 4.4, where we compare the performance of concatenating all the 9 F Tis with that of self feature in each task. The best performance out of 264 different settings is also listed in Table 4.4, and the corresponding best concatenating setting is identified in Table 4.5. In Table 4.4, the underlined numbers represent the performance better than that of using self feature. Table 4.4 supports the finding that concatenating the CNN features learned from different tasks can improve the performance in each task, regardless of whether the task is an abstract task. Table 4.4 also shows that in all the 9 tasks, concatenating all the learned CNN features is not the best concatenating setting, which is different from most previous works [30, 31, 41, 54] where combining all the features achieves the best performance. One possible reason is because the amount of training data in each task is not sufficiently large enough to perfectly train the 55 4096×9-d weight vector of SVM/SVR. Table 4.4 shows that in the 4 tasks (AST, CAL, ARC, and FAS), concatenating all the learned CNN features performs even worse than using only self feature, which suggests that we should concatenate useful features instead of concatenating all the features. For each task, the best concatenating setting out of 264 different settings is identified in Table 4.5, which can serve as a guide to selecting useful features in each task. To be consistent with most previous works [28, 30, 31, 41, 43, 54] where the performance of each single feature is reported, we summarize the performance ranking in each task in Table 4.6, where the numbers are presented in the form “rank (performance).” In Table 4.6, the rank 1 (9) represents the best (worst) performance in each task. We also separately list the best and the worst performances out of the 9 F Tis in each task. We want to address that even the worst performance in each task is much better than random guessing, which indicates that the non-best performance of concatenating all the features shown in Table 4.4 is not simply because of combining useless features together. In Table 4.6, using self feature performs the best in all the 9 tasks except the two regression tasks MEM and INT. One possible reason is that SVR is not designed to maximize ρ, the evaluation metric specified by the works [28, 31] which propose these tasks. Given that self feature usually performs the best (out of 9 learned CNN features) as shown in Table 4.6, we are curious about the effect if we add an additional feature to the self feature. In Table 4.7, we analyze the performance ranking of using self feature plus the CNN features learned from another task. The format of the numbers is the same as that of Table 4.6 except that the 56 task ID F ART F AST F CAL F ARC F EMO F AVA F FAS F MEM F INT evaluation metric best performance worst performance ART AST CAL ARC EMO 1 (67.555) 1 (67.555) 3 (61.121) 4 (60.938) 4 (60.938) 7 (56.985) 6 (57.813) 9 (51.838) 8 (55.790) 2 (48.569) 1 (55.148) 5 (45.907) 4 (46.760) 3 (47.564) 7 (41.487) 6 (44.751) 9 (37.167) 8 (40.532) 4 (81.080) 5 (81.066) 1 (88.217) 2 (83.667) 8 (70.771) 6 (80.475) 3 (81.514) 9 (65.643) 7 (74.316) 4 (48.845) 6 (48.320) 3 (50.411) 1 (54.440) 2 (50.629) 7 (46.824) 5 (48.697) 9 (40.986) 8 (43.925) 7 (31.138) 6 (31.139) 5 (31.886) 2 (33.868) 1 (36.228) 4 (33.372) 3 (33.748) 9 (27.170) 8 (30.029) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) 67.555 55.148 88.217 54.440 36.228 51.838 37.167 65.643 40.986 27.170 task ID F ART F AST F CAL F ARC F EMO F AVA F FAS F MEM F INT evaluation metric best performance worst performance AVA 5 (63.181) 7 (62.915) 3 (63.297) 4 (63.277) 2 (63.392) 1 (69.423) 6 (63.006) 9 (61.004) 8 (61.666) accuracy (%) 69.423 61.004 FAS 4 (68.478) 6 (66.739) 7 (66.739) 2 (71.739) 5 (68.043) 3 (69.565) 1 (76.957) 9 (60.870) 8 (66.087) accuracy (%) 76.957 60.870 MEM 3 (0.454) 6 (0.442) 1 (0.464) 7 (0.434) 2 (0.459) 5 (0.445) 4 (0.450) 8 (0.398) 9 (0.346) ρ 0.464 0.346 INT 5 (0.520) 6 (0.508) 8 (0.493) 9 (0.487) 7 (0.497) 1 (0.575) 4 (0.542) 3 (0.560) 2 (0.573) ρ 0.575 0.487 Table 4.6: The performance ranking in each task by using the CNN features learned from a single task. The numbers are presented in the form “rank (performance).” The rank 1 (9) represents the best (worst) performance in each task. The table shows that self feature usually performs the best if we use the CNN features learned from one task. The best and the worst performances out of the 9 different features in each task are listed at the bottom of the table, which shows that even the worst performance is non-trivial (much better than random guessing). 57 task ID self feature + F ART self feature + F AST self feature + F CAL self feature + F ARC self feature + F EMO self feature + F AVA self feature + F FAS self feature + F MEM self feature + F INT evaluation metric self feature only best performance worst performance ART AST CAL ARC EMO n/a 1 (70.129) 2 (68.658) 2 (68.658) 4 (68.382) 6 (67.739) 7 (67.647) 5 (67.831) 8 (67.096) 3 (56.404) n/a 2 (56.605) 7 (55.098) 1 (57.308) 6 (55.349) 4 (56.103) 5 (55.550) 8 (53.541) 4 (88.231) 5 (87.823) n/a 3 (88.244) 6 (87.728) 1 (88.319) 2 (88.299) 7 (86.975) 8 (85.834) 2 (55.124) 6 (54.574) 5 (54.747) n/a 1 (55.149) 4 (54.990) 3 (55.064) 7 (53.271) 8 (52.359) 1 (37.095) 3 (36.560) 5 (36.352) 7 (35.855) n/a 2 (36.601) 6 (36.105) 4 (36.479) 8 (35.361) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) 67.555 55.148 88.217 54.440 36.228 70.129 57.308 88.319 55.149 37.095 67.096 53.541 85.834 52.359 35.361 task ID AVA FAS MEM INT self feature + F ART self feature + F AST self feature + F CAL self feature + F ARC self feature + F EMO self feature + F AVA self feature + F FAS self feature + F MEM self feature + F INT 6 (69.498) 3 (69.528) 8 (69.323) 5 (69.503) 7 (69.473) n/a 1 (69.729) 3 (69.528) 2 (69.594) 2 (76.957) 7 (74.348) 4 (76.522) 5 (75.652) 6 (75.435) 1 (77.391) n/a 2 (76.957) 7 (74.348) 7 (0.459) 3 (0.466) 2 (0.468) 5 (0.463) 1 (0.470) 4 (0.465) 6 (0.461) n/a 8 (0.417) 6 (0.598) 4 (0.599) 5 (0.598) 7 (0.598) 3 (0.599) 1 (0.610) 2 (0.603) 8 (0.585) n/a evaluation metric self feature only best performance worst performance accuracy (%) 69.423 69.729 69.323 accuracy (%) 76.957 77.391 74.348 ρ 0.398 0.470 0.417 ρ 0.573 0.610 0.585 Table 4.7: The performance ranking in each task by using self feature and the CNN features learned from another task. The format of the numbers is the same as that of Table 4.6 except that the underlined numbers represent the performance better than that of using only self feature. The best and the worst performances in each task out of the 8 different combinations of features are listed at the bottom of the table, where the performance of using only self feature is also provided. The table shows that most of the listed feature combinations outperform using only self feature in each task. 58 underlined numbers represent the performance better than that of using only self feature. The best and the worst performances in each task out of the 8 different combinations of features are listed at the bottom of Table 4.7, where the performance of using only self feature is also provided. Table 4.7 supports that most combinations of using self feature and another CNN features outperform using only self feature, which encourages us to use the CNN features learned from another task when solving the task of interest. However, there are some cases where combining features decreases the performance (as shown in Table 4.7). These cases address the importance of choosing useful features, and both Table 4.6 and Table 4.7 can serve as the indicators of choosing useful features. For instance, in the task ART, we should consider using F AST to enhance the performance. Table 4.6 and Table 4.7 are also examples of using CNN as a tool to correlate different tasks. Leveraging the characteristic of CNN that the features learned from one task can be naturally bundled as a feature set (for example, F ART), we are able to interpret a given task using higher-level semantics. For example, in fashion style classification (FAS), aesthetic features (F AVA) are more useful (in terms of enhancing performance) than the features learned from artist classification (F AST) according to both Table 4.6 and Table 4.7. 4.5 Conclusion In this chapter, we apply 5 different CNN-based training approaches to the selected 8 abstract tasks receiving great attention in computer vision recently, showing that our results outperform the state-of-the-art results in all the 8 tasks. 59 Unlike previous researchers who use standard or handcrafted features to solve abstract tasks, we propose a framework to leverage the CNN features learned from different tasks. By evaluating the performance of concatenating features in different settings, we show that using the CNN features leaned from one task can enhance the performance in another task. We also show that concatenating all the learned CNN features in this chapter is not the best option. Instead, we should identify the useful features in each task to achieve the best performance. To identify the useful features in each task, we not only show the best concatenating setting but also use CNN as a tool to correlate different tasks. By presenting the performance ranking of the CNN features learned from different tasks, we provide suggestions of which CNN features to use in each task. We hope that the results presented in this chapter will encourage researchers proposing new tasks or interested in existing tasks to cooperate instead of only focusing on the task of interest without utilizing the knowledge learned from existing tasks. 60 CHAPTER 5 MULTI-DEPTH CNN FEATURES Summary Recent works about convolutional neural networks (CNN) show breakthrough performance on various tasks. However, most of them only use the features extracted from the topmost layer of CNN instead of leveraging the features extracted from CNN with different numbers of layers. As the first group which explicitly addresses utilizing the features from CNN with different numbers of layers, we propose multi-depth CNN features which consist of the features extracted from multiple CNNs with different numbers of layers. Our experimental results show that our proposed multi-depth CNN features outperform not only the state-of-the-art results but also the features commonly used in the traditional CNN framework on 8 abstract tasks. As shown by the experimental results, our proposed multi-depth CNN features achieve the best known performance on the 8 abstract tasks in different domains, which makes our proposed multi-depth CNN features promising solutions for generic tasks. 5.1 Introduction Extraordinary performance by using convolutional neural networks (CNN) has been reported in recent literature [1, 24, 26, 29, 38]. However, there is one major constraint in the traditional CNN framework: the final output of the output layer is solely based on the features extracted from the topmost layer. In other words, given the features extracted from the topmost layer, the final output is 61 independent of all the features extracted from other non-topmost layers. At first glance, this constraint seems to be reasonable because the non-topmost layers are implicitly considered in the way that the output of the non-topmost layers is the input of the topmost layer. However, we believe that the features extracted from the non-topmost layers are not explicitly and properly utilized in the traditional CNN framework where partial features generated by the non-topmost layers are ignored during training. Therefore, we want to relax this constraint of the traditional CNN framework by explicitly leveraging the features extracted from non-topmost layers. Inspired by this idea, we propose multi-depth features based on the AlexNet [46], and we show that our proposed features outperform the AlexNet [46] on 8 abstract tasks listed in Table 4.1. The details of our proposed multi-depth CNN features, the experimental setup and the results are presented in Sec. 5.2, Sec. 5.3, and Sec. 5.4 respectively. In recent studies analyzing the performance of multi-layer CNN [1, 82], both works extract features from the AlexNet [46] and evaluate the performance on different databases. These works achieve a consistent conclusion that the features extracted from the topmost layer have the best discriminative ability in classification tasks compared with the features extracted from other nontopmost layers. However, both works [1, 82] only evaluate the performance of the features extracted from one layer at a time without considering the features learned by CNNs with different numbers of layers at once. Unlike [1, 82], in Sec. 5.4, we show that our proposed multi-depth CNN features outperform the features extracted from the topmost layer of the AlexNet [46] (the features used in [1, 82]). In previous works [41, 77] studying the abstract tasks involved in this 62 chapter, the state-of-the-art results are achieved by the traditional handcrafted features (for example, SIFT and HOG) without considering CNN-related features. In Sec. 5.4, we show that our proposed multi-depth CNN features outperform the state-of-the-art results on 8 abstract tasks. Another related prior work is the double-column convolutional neural network (DCNN) proposed by Lu et al. [50]. Using DCNN to predict pictorial aesthetics, Lu et al. [50] extract multi-scale features from multiple CNNs with the multi-scale input data generated from their algorithm. In contrast, this chapter focuses on the multi-depth CNN features extracted from CNNs with different numbers of layers without the need to generate multi-scale input. In this chapter, our main contribution is the concept of utilizing the features extracted from the CNNs with different numbers of layers. Based on this concept, we propose multi-depth CNN features extracted from the CNNs with different numbers of layers and show that our proposed features outperform not only the state-of-the-art performance but also the results of the traditional CNN framework on 8 abstract tasks. To the best of our knowledge, this is the first work explicitly utilizing the features extracted from the CNNs with different numbers of layers, which is a novel departure from most CNN-related works which use only the features extracted from the topmost layer. 5.2 Generating Multi-depth CNN Features Figure 5.1 and Table 5.1 illustrate the involved CNN structures in this chapter and how we form our proposed multi-depth CNN features respectively. There are 5 different CNN structures in Figure 5.1, where CNN0 represents the 63 Figure 5.1: The five CNN structures adopted in this chapter. CNN0 represents the AlexNet [46], and CNN1 to CNN4 are the same as CNN0 except that some convolutional layers are removed. We use each CNNi (i = {0, 1, · · · , 4}) as a feature extractor which takes an image as input and outputs a feature vector fi from the topmost fully connected layer. These fis are concatenated to form our proposed multi-depth features according to the definition in Table 5.1. AlexNet [46], and CNN1 to CNN4 are the “sub-CNNs” of CNN0 (they are the same as CNN0 except that some convolutional layers are removed). We use the same notation of convolutional layers (conv-1 to conv-5) and fully connected layers (fc-6 and fc-7) as that used in [1] to represent the corresponding layers in the AlexNet [46]. In Figure 5.1, in addition to the input and output layers, we only show the convolutional and fully connected layers of the AlexNet [46] for clarity. Instead of using the output from the output layer of each CNNi (i = {0, 1, · · · , 4}), we treat CNNi as a feature extractor which takes an image as 64 feature ID multi-depth features (Figure 5.1) dimension F0 (baseline) F1 F2 F3 F4 f0 f0 + f1 f0 + f1 + f2 f0 + f1 + f2 + f3 f0 + f1 + f2 + f3 + f4 k 2k 3k 4k 5k Table 5.1: The summary of the multi-depth CNN features used in this chapter. Serving as a baseline, F0 represents the features extracted from the topmost layer in the traditional CNN framework. F1 to F4 are our proposed multi-depth CNN features which are formed by concatenating fis (i = {0, 1, · · · , 4}) defined in Figure 5.1. We follow the specification of the AlexNet [46] and use k = 4096. input and outputs a k-d feature vector fi from the topmost fully connected layer, which is inspired by [15]. We follow the specification of the AlexNet [46] and use k = 4096 in our experiment. In Figure 5.1, f0 represents the features extracted from the output of the fc-7 layer of the AlexNet [46], and f1 to f4 represent the features derived from different combinations of the convolutional layers. fi (i = {0, 1, · · · , 4}) is extracted from the topmost fully connected layer of CNNi, not from the intermediate layer of CNN0 because the features extracted from the topmost layer have the best discriminative ability according to [1, 82]. As the features learned from CNN0 and its sub-CNNs, fis implicitly reflect the discriminative ability of the corresponding layers of CNN0. Most CNN-related works use only f0 and ignore the intermediate features ( f1 to f4), but we explicitly extract them as part of our proposed multi-depth CNN features which are explained in the following paragraph. Using the feature vectors ( fis) defined in Figure 5.1, we concatenate these fis 65 and form our proposed multi-depth CNN features. We summarize these multidepth CNN features (F1 to F4) in Table 5.1, where how the features are formed and their dimensions are specified. F0 represents the features extracted from the topmost layer in the traditional CNN framework without concatenating the features from other layers. The feature IDs listed in Table 5.1 are used to refer to the corresponding multi-depth CNN features when we report the experimental results in Sec. 5.4, where we compare the performance of Fi (i = {0, 1, · · · , 4}) on 8 abstract tasks. 5.3 Experimental Setup 5.3.1 Databases and Tasks We conduct experiment on 8 abstract tasks, and these databases and tasks are summarized in Table 4.1, where their properties and related statistics are shown. When reporting the results in Sec. 5.4, we use the task ID listed in Table 4.1 to refer to each task. To evaluate different methods under fair comparison, we use the same experimental settings for the 3 tasks (AST, ART, and ARC) as those used in the references listed at the bottom of Table 4.1. For the task AVA, we use the generic training set with 2495 images to shorten the training time. For the task ARC, there are two different experimental settings (10-way and 25-way classification) provided by [77], and we do both in our experiment. For the task EMO, we perform 8-way classification instead of 1-vs-all setting. For the 3 tasks FAS, MEM, and INT, we choose to do the first 5 folds of training/testing splits considering the efficiency of the experiment. 66 5.3.2 Training Approach In our experiment, we use the Caffe [32] implementation to train the 5 CNNi (i = {0, 1, · · · , 4}) in Figure 5.1 for each of the three tasks in Table 4.1. For each task, CNNi is adjusted such that the number of the nodes in the output layer is set to the number of classes of that task. For the two regression tasks MEM and INT in Table 4.1, we modify the number of the nodes in the output layer to 1 and use the Euclidean loss to replace the softmax loss. When using the Caffe [32] implementation, we adopt its default training parameters for training the AlexNet [46] for ImageNet [11] classification unless otherwise specified. Before training CNNi for each task, all the images in the corresponding database are resized to 256×256 according to the Caffe [32] implementation. In training phase, adopting the Caffe reference model provided by [32] (denoted as MImageNet) for ImageNet [11] classification, we train CNNi (i = {0, 1, · · · , 4}) in Figure 5.1 for each task in Table 4.1. We follow the descriptions and setting of supervised pre-training and fine-tuning used in Agrawal’s work [1], where pre-training with MD means using a data-rich auxiliary database D to initialize the CNN parameters and fine-tuning means that all the CNN parameters can be updated by continued training on the corresponding training set. For each CNNi for each task in Table 4.1, we pre-train it with MImageNet and fine-tune it with the training set of that task. After finishing training CNNi, we form the multi-depth CNN features Fi (i = {0, 1, · · · , 4}) according to Table 5.1. With these multi-depth CNN feature vectors for training, we use support vector machine (SVM)/support vector regression (SVR) to train a linear classifier/regressor supervised by the ground truth of the training images in the corresponding database for the classifica- 67 tion/regression task. Specifically, one linear classifier or regressor is trained for each Fi (i = {0, 1, · · · , 4}) for each task (a total of 5 classifiers or regressors per task). In practice, we use LIBSVM [8] to do so with the cost (parameter C in SVM or SVR) set to the default value 1. Trying different C values, we find that different C values result in similar accuracy, so we just use the default value. In testing phase, we use the given testing image as the input of the trained CNNi (i = {0, 1, · · · , 4}) and generate fis. The multi-depth CNN features Fi (i = {0, 1, · · · , 4}) are formed by concatenating the generated fis according to Table 5.1. After that, we feed each feature vector (Fi) of the testing image as the input of the corresponding trained SVM classifier/SVR regressor, and the output of the SVM classifier/SVR regressor is the predicted label/value of the testing image. 5.4 Experimental Results Using the training approach described in Sec. 5.3.2, we evaluate the performance of Fi (i = {0, 1, · · · , 4}) defined in Table 5.1 on the 8 abstract tasks listed in Table 4.1. The experimental results are summarized in Table 5.2, where the bold numbers represent the best performance for that task. We compare the performance of our proposed multi-depth CNN features (F1 to F4) with the following two baselines: 1. The current known best performance of that task provided by the references listed in Table 5.2. 2. The performance of F0, which represents the commonly used features in 68 task ID evaluation metric previous work F0 (baseline) [62] F1 F2 F3 F4 EMO accuracy (%) n/a 36.23 38.83 38.34 38.34 38.34 AST accuracy (%) 53.10 [41] 55.15 56.25 56.40 56.35 56.35 ART accuracy (%) 62.20 [41] 67.37 68.29 68.57 69.21 69.21 AVA accuracy (%) n/a 69.42 69.57 69.57 69.60 69.60 task ID evaluation metric previous work F0 (baseline) [62] F1 F2 F3 F4 FAS accuracy (%) n/a 76.96 77.39 77.39 76.96 76.96 ARC accuracy (%) 69.17 / 46.21 [77] 70.64 / 54.84 70.94 / 55.44 70.73 / 55.35 70.68 / 55.32 70.68 / 55.31 MEM ρ n/a 0.40 0.43 0.46 0.49 0.49 INT ρ n/a 0.57 0.58 0.62 0.63 0.63 Table 5.2: The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for all the 8 tasks, our proposed multi-depth CNN features (F1 to F4) outperform not only the best known results from prior works but also the features commonly used in the traditional CNN framework (F0). the traditional CNN framework. The results in Table 5.2 show that all of our proposed multi-depth CNN features (F1 to F4) outperform the two baselines on the 8 tasks, which supports our claim that utilizing the features extracted from multiple CNNs with different numbers of layers is better than using the traditional CNN features which are only extracted from the topmost fully connected layer. In Table 5.2, the best performance for each task is not always achieved by F4, the multi-depth features extracted from all the CNNs in our experiment. Although the dimension of F4 is the highest among all the proposed multi-depth features and theoretically it has the most powerful describing ability as the image descriptor, the high dimension of the feature vectors can be more likely to suffer from overfit- 69 ting. The results in Table 5.2 are the trade-off between the describing ability and the overfitting issue of the feature vectors. We think that this trade-off and the database/task bias are the two major reasons explaining why the best performance for each task is achieved by different multi-depth features (different Fis). Table 5.2 also shows that CNN-based features (F0 to F4) outperform the classical handcrafted features (for example, SIFT and HOG) used in the prior works [41, 77], which is consistent with the findings of the recent CNN-related literature [15, 29, 50, 82, 83]. In addition, our proposed multi-depth CNN features are generic features which are applicable to various tasks, not just the features specifically designed for certain tasks. As shown in Table 5.2, these multi-depth CNN features are effective in various domains from artistic style classification to architectural style classification, which makes our proposed multi-depth CNN features promising solutions for other tasks which future researchers are interested in. 5.5 Conclusion In this chapter, we mainly focus on the idea of utilizing the features extracted from multiple CNNs with different numbers of layers. Based on this idea, we propose the multi-depth CNN features, showing their efficacy on 8 abstract tasks. Our proposed multi-depth CNN features outperform not only the stateof-the-art results but also the CNN features commonly used in the traditional CNN framework. Furthermore, we find that our proposed multi-depth CNN features are promising generic features which can be applied to various tasks. 70 CHAPTER 6 MULTI-SCALE CNN FEATURES Summary Most works related to convolutional neural networks (CNN) use the traditional CNN framework which extracts features in only one scale. We propose multi-scale convolutional neural networks (MSCNN) which can not only extract multi-scale features but also solve the issues of the previous methods which use CNN to extract multi-scale features. With the assumption of label-inheritable (LI) property, we also propose a method to generate exponentially more training examples for MSCNN from the given training set. Our experimental results show that MSCNN outperforms both the state-of-the-art methods and the traditional CNN framework on most of the 8 abstract tasks in our experiment, supporting that MSCNN outperforms the traditional CNN framework on the tasks which at least partially satisfy LI property. 6.1 Introduction As mentioned in Chapter 4 and Chapter 5, convolutional neural networks (CNN) have already achieved breakthrough performance in recent literature [1, 24, 26, 29, 38]. However, there are two constraints in the traditional CNN framework: 1. CNN extracts the features in only one scale without explicitly leveraging the features in different scales. 71 2. The performance of CNN highly depends on the amount of training data, which is shown in recent works [15, 82]. Using either the Caltech-101 [48] or Caltech-256 [27] databases, these works [15, 82] modify the number of training examples per class and record the corresponding accuracy. Both works find that satisfactory performance can be achieved only when enough training examples are used. In this chapter, we want to solve the above two issues of the traditional CNN framework. We propose multi-scale convolutional neural networks (MSCNN), a framework of extracting multi-scale features using multiple CNNs. The details of MSCNN are explained in Sec. 6.3. We also propose a method to generate exponentially more training examples for MSCNN from the given training set based on the assumption that the task of interest is label-inheritable (LI). The concept of LI is explained in Sec. 6.2. In recent studies extracting multi-scale features using CNN, Gong et al. [26] propose multi-scale orderless pooling, but He et al. [29] propose spatial pyramid pooling. Both works show experimental results which support the argument that using multi-scale features outperforms using features in only one scale. In both works, the multi-scale features are extracted by using special pooling methods in the same CNN, so the CNN parameters used to extract multiscale features are the same before the pooling stage. However, we argue that the features in different scales should be extracted with different sets of CNN parameters learned from the training examples in different scales, which is explicitly modeled in our proposed MSCNN. In the applications using multiple CNN, Lu et al. [50] predict pictorial aesthetics by using double-column convolutional neural network (DCNN) to 72 extract features in two different scales, global view and fine-grained view. They also show that using DCNN outperforms using features in only one scale. The concept of using multiple CNN is similar in both DCNN and our proposed MSCNN. Nevertheless, the input images for the scale of fine-grained view in DCNN are generated by randomly cropping from the images in the scale of global view. Lu’s approach [50] has the following two drawbacks: 1. The randomly cropped image may not well represent the entire image in the scale of fine-grained view. 2. By using only the cropped images in the scale of fine-grained view, the information which is not in the cropped area is ignored. In MSCNN, instead of throwing away information, we extract the features from every portion of the input image in each scale, which solves the above two drawbacks of DCNN. In this chapter, we make the following contributions: 1. We propose multi-scale convolutional neural networks (MSCNN) which can extract the features in different scales using multiple convolutional neural networks. We show that MSCNN outperforms not only the stateof-the-art performance on most of the 8 abstract tasks listed in Table 4.1 but also the traditional CNN framework which only extracts features in one scale. 2. We also propose a method to generate exponentially more training examples for MSCNN from the given training set. Under the labelinheritable (LI) assumption, our method generates exponentially more 73 Figure 6.1: The illustration of label-inheritable (LI) property. Given an image database D associated with a task T , if any cropped version of any image I from D can take the same label as that of I, we say that T satisfies LI property. In this figure, we only show the case when the cropped image is the upper right portion of the original image. In fact, the cropped image can be any portion of the original image. training examples without the need to collect new training examples explicitly. Although our method and MSCNN are designed under LI assumption, our experimental results support that our proposed framework can still outperform the traditional CNN framework on the tasks which only partially satisfy LI assumption. 6.2 Label-Inheritable Property Our proposed MSCNN is designed for the tasks satisfying label-inheritable property which is illustrated in Figure 6.1. Given an image database D associated with a task T , if any cropped version of any image I from D can take the the same label as that of I, we say that T is label-inheritable (LI). In other words, if the concept represented by the label of any image I in the given database D 74 Figure 6.2: Example images from the three databases (Painting-91 [41], arcDataset [77], and Caltech-101 [48]) associated with three different tasks which satisfy LI property in different degrees. The corresponding database, task, label, and the extent that LI property is satisfied are shown under each image. is also represented in each portion of I, the corresponding task T satisfies LI property. Figure 6.2 shows three example images from the three databases (Painting91 [41], arcDataset [77], and Caltech-101 [48]) associated with three different tasks which satisfy LI property in different degrees. We also list the corresponding database, task, and label under each example image. For any image from the Painting-91 [41] database, each portion of that image is painted by the same artist (Picasso for the leftmost example image in Figure 6.2), so the task “artist style classification” is LI. For the images from the arcDataset [77] database, we can recognize the architectural style from different parts of the 75 architecture (for example, the roof and pillars). However, if the cropped image does not contain any portion of the architecture, the cropped image cannot represent the architectural style of the original image, which is the reason why LI property is only partially satisfied for architectural style classification. For the images from the Caltech-101 [48] database, the LI property is only satisfied when the cropped image contains the entire object which the label represents. If the cropped image contains only a portion of the object, the cropped image may not be able to take the same label. For the rightmost example image in Figure 6.2, the portion of the mouth cannot totally represent the label “faces.” Therefore, we think object classification is mostly not LI. In this chapter, we apply our proposed framework, MSCNN, to 8 tasks (listed in Table 4.1) satisfying LI property in different degrees, showing that MSCNN outperforms the traditional CNN framework on the tasks satisfying LI property. We also show experimental results supporting that MSCNN can still outperform the traditional CNN framework on the tasks which only partially satisfy LI property. The details of the experimental results are shown in Sec. 6.4. 6.3 Experimental Setup 6.3.1 Databases and Tasks We conduct experiment on the 8 abstract tasks specified in Table 4.1, where their properties and related statistics are also shown. We use the task ID listed in Table 4.1 to refer to each task. We adopt the same experimental settings for the 3 tasks (AST, ART, and ARC) as those used in the references listed at the bottom of 76 Figure 6.3: The illustration of our proposed multi-scale convolutional neural networks (MSCNN) which consists of m+1 AlexNet [46] (one for each of the m + 1 different scales). The details of the MSCNN architecture and the training approach are explained in Sec. 6.3.2 and Sec. 6.3.3 respectively. Table 4.1. For the task EMO, we perform 8-way classification instead of 1-vs-all setting. For the 3 tasks FAS, MEM, and INT, we choose to experiment on only the first 5 folds of training/testing splits for the efficiency of the experiment. For the task ARC, we conduct our experiment with both of the two different experimental settings (10-way and 25-way classification) provided by [77]. For the task AVA, we use the generic training set with 2495 images to shorten the training time. 6.3.2 MSCNN Architecture The architecture of our proposed multi-scale convolutional neural networks (MSCNN) is illustrated in Figure 6.3. MSCNN consists of m + 1 AlexNet [46] which extract features in m + 1 different scales (from scale 0 to scale m, one AlexNet [46] per scale). We only show m = 1 in Figure 6.3 for clarity. In each 77 scale, we use the AlexNet [46] as the feature extractor which takes an image as input and outputs a 4096-d feature vector. The 4096-d feature vector is extracted from the topmost fully connected layer of the AlexNet [46]. Given an input image, we extract its multi-scale features according to the following steps: 1. In scale i, we place a 2i ×2i grid on the input image and generate 4i cropped images. 2. We resize each of the 4i cropped images in scale i to a pre-defined image size k × k. 3. Each of the 4i cropped resized images takes turn (in the order of left to right, top to bottom) to be the input of the AlexNet [46] in scale i. In other words, the AlexNet [46] in scale i extracts features 4i times with one of the 4i cropped resized images as input at each time. After that, we generate the feature vector in scale i by concatenating those 4i 4096-d feature vectors in order. 4. The final multi-scale feature vector is formed by concatenating all the feature vectors in all m + 1 scales in order, and the dimension of the multiscale feature vector is 4m+1 − 1 3 × 4096. For the AlexNet [46] in each scale, we use the Caffe [32] implementation except that the number of the nodes in the output layer is set to the number of classes in each task listed in Table 4.1. For the two regression tasks MEM and INT in Table 4.1, we change the number of the nodes in the output layer to 1 and replace the softmax loss with the Euclidean loss. When using the Caffe [32] implementation, we adopt its default training parameters for training 78 the AlexNet [46] for ImageNet [11] classification unless otherwise specified. The pre-defined image size k × k is set to 256×256 according to the Caffe [32] implementation. The details of the training approach and how to make prediction are explained in Sec. 6.3.3. 6.3.3 Training Approach Before training, we generate a training set S i for each scale i from the original training set S of the task of interest T with the assumption that T satisfies LI property. We take the following steps: 1. We place a 2i × 2i grid on each image of S , crop out 4i sub-images, and resize each cropped image to k × k. 2. Due to LI property, each cropped resized image is assigned the same label as that of the original image from which it is cropped. After the above two steps, the generated training set S i consists of 4i ×|S | labeled images (|S | is the number of images in S ), and the size of each image is k × k. We follow the Caffe [32] implementation, using k = 256. In training phase, adopting the Caffe reference model [32] (denoted as MImageNet) for the ImageNet [11] classification, we train the AlexNet [46] in scale i using the training set S i. We follow the descriptions and setting of supervised pre-training and fine-tuning used in Agrawal’s work [1], where pre-training with MD means using a data-rich auxiliary database D to initialize the CNN parameters and fine-tuning means that all the CNN parameters can be updated by continued training on the corresponding training set. For the AlexNet [46] 79 in each scale i, we pre-train it with MImageNet and fine-tune it with S i. After finishing training all the m + 1 AlexNet [46] in m + 1 scales, we extract the 4m+1 − 1 3 × 4096-d multi-scale feature vector of each training image in S using the method described in Sec. 6.3.2. With these multi-scale feature vectors for training, we use support vector machine (SVM)/support vector regression (SVR) to train a linear classifier/regressor supervised by the ground truth of the images in S . In practice, we use LIBSVM [8] to do so with the cost (parameter C in SVM/SVR) set to the default value 1. Trying different C values, we find that different C values produce similar accuracy, so we just use the default value. In testing phase, given a testing image, we generate its 4m+1 − 1 3 × 4096-d multi-scale feature vector using the method described in Sec. 6.3.2. After that, we feed the feature vector of the testing image as the input of the trained SVM classifier/SVR regressor, and the output of the SVM classifier/SVR regressor is the prediction of the testing image. Compared with the traditional CNN framework (m = 0), MSCNN has m + 1 times of the number of parameters to train. However, under LI assumption, we can generate exponentially more training examples (4i × |S | training examples in scale i) from the original training set S without the need to explicitly collect new training examples. For any two training examples Ii and I j from S i and S j respectively (i < j), the overlapping content cannot exceed a quarter of the area of Ii. Therefore, under LI assumption, our training approach not only generates exponentially more training examples but also guarantees certain diversity among the generated training examples. 80 In terms of the training time, since S i contains 4i × |S | training examples, the time needed to train the CNN in scale i is 4i × t0, where t0 is the training time of the traditional CNN framework. The training time of the linear SVM classifier/SVR regressor is far less than the training time of the AlexNet [46], so it is negligible in the total training time. If we train the m + 1 AlexNet [46] in m + 1 different scales in parallel, the total training time of MSCNN will be 4m × t0. When training CNN, the traditional method implemented by Caffe [32] resizes each training image to a pre-defined resolution because of the hardware limitation of GPU. Due to the resizing, partial information in the original highresolution training image is lost. In MSCNN, more details in the original highresolution training image are preserved in scale i (i > 0) because all the training examples in S i (∀i ∈ {0, 1, · · · , m}) are resized to the same resolution. The extra information preserved in scale i (i > 0) makes MSCNN outperform the traditional CNN framework on the tasks satisfying LI property, which is shown in Sec. 6.4. Our proposed MSCNN extracts the features in different scales with scaledependent CNN parameters. In other words, the CNN parameters in scale i are specifically learned from the training images in scale i (the images in S i). This training approach solves the issue of previous works [26, 29] which simply generate the multi-scale features from the same set of CNN parameters which may not be the most suitable one for each scale. In addition, MSCNN extracts the features from every portion of the image in each scale, which solves the drawback of Lu’s work [50] which does not fully utilize the information of the entire image in the scale of fine-grained view. Applying our proposed MSCNN to the 8 tasks listed in Table 4.1, we report the performance in Sec. 6.4. 81 task ID evaluation metric previous work MSCNN-0 (baseline) [62] MSCNN-1 MSCNN-2 MSCNN-3 EMO accuracy (%) n/a 36.23 37.35 37.47 37.72 AST accuracy (%) 53.10 [41] 55.15 58.11 57.91 n/a ART accuracy (%) 62.20 [41] 67.37 69.67 70.96 67.74 AVA accuracy (%) n/a 69.42 71.09 72.30 72.75 task ID evaluation metric previous work MSCNN-0 (baseline) [62] MSCNN-1 MSCNN-2 MSCNN-3 FAS accuracy (%) n/a 76.96 79.78 79.35 n/a ARC accuracy (%) 69.17 / 46.21 [77] 70.64 / 54.84 74.82 / 58.89 75.32 / 59.13 n/a MEM ρ n/a 0.40 0.39 n/a n/a INT ρ n/a 0.57 0.60 0.62 0.63 Table 6.1: The summary of our experimental results. The bold numbers represent the best performance for each task. The results show that for most of the 8 tasks, our proposed multi-scale CNN features (MSCNN-1 to MSCNN-3) outperform not only the best known results from prior works but also the traditional CNN framework (MSCNN-0). 6.4 Experimental Results Using the experimental setting specified in Sec. 6.3, we compare the performance of MSCNN with that of the traditional CNN framework (m = 0) and that of the state-of-the-art methods. We use the notation “MSCNN-m” to represent the MSCNN framework consisting of m + 1 AlexNet [46] in m + 1 different scales (scale 0 to scale m). MSCNN-0 represents the traditional CNN framework which extracts the features in only one scale. The result is summarized in Table 6.1. Table 6.1 shows that our proposed methods (MSCNN-1 to MSCNN-3) outperform both the current state-of-the-art method and the traditional CNN framework (MSCNN-0) on most of the 8 tasks. Since the tasks ART and AST satisfy LI property, the result supports that MSCNN outperforms the traditional 82 CNN framework on the tasks which are LI. MSCNN also outperforms the traditional CNN framework on the task ARC, which only partially satisfies LI property. Table 6.1 also shows that increasing the number of scales in MSCNN may not always improve the performance. This is reasonable because the training images in scale i (the images in S i) are less likely to contain useful features as i increases. Therefore, the performance improves the most when we increase the number of scales in MSCNN to a certain level. Although the dimension of the multi-scale CNN features increases when we increase the number of scales in MSCNN (and hence theoretically gaining more describing ability as the image descriptor), the high dimension of the feature vectors can be more likely to suffer from overfitting. The results in Table 6.1 are the trade-off between the describing ability and the overfitting issue of the feature vectors. We think that this trade-off and the database/task bias are the two major reasons explaining why the best performance for each task is achieved by the MSCNN with different numbers of scales. For most of the 8 tasks in our experiment, MSCNN-1, MSCNN-2, and MSCNN-3 (if applicable) outperform MSCNN-0, which supports that using multi-scale features is better than using the features in one scale. Based on our experimental results, we show that MSCNN outperforms the traditional CNN framework on most of the 8 abstract tasks in Table 4.1. In addition, our results indicate that under the assumption that LI property holds, there is a most suitable number of scales for MSCNN such that the performance can increase the most. 83 6.5 Conclusion We addressed the label-inheritable (LI) property and proposed a novel framework, multi-scale convolutional neural networks (MSCNN). We also proposed a training method for MSCNN under LI assumption such that exponentially more training examples can be generated without the need to collect new training data. MSCNN not only solves the issues of the previous methods which use CNN to extract multi-scale features but also outperforms both the state-of-theart methods and the traditional CNN framework on most of the 8 abstract tasks under our experiment. Our results support that MSCNN can outperform both the state-of-the-art method and the traditional CNN framework on the tasks satisfying (even partially) LI property. 84 CHAPTER 7 FUSING MULTI-DEPTH, MULTI-SCALE, AND MULTI-TASK FEATURES 7.1 Introduction In Chapter 4, Chapter 5, and Chapter 6, we present the efficacy of our proposed multi-task, multi-depth, and multi-scale CNN features respectively on the abstract tasks listed in Table 4.1, showing that each of our proposed features can outperform the state-of-the-art performance on the abstract tasks in our experiment. In this chapter, our goal is to fuse our proposed multi-task, multidepth, and multi-scale CNN features such that the performance can be further improved. For most of the prior works about abstract tasks, including the references listed in Table 4.1, the common approach to fuse different features seems to be the sum of weighted feature values. In practice, this approach is usually implemented by simple concatenation of different features followed by training a linear support vector machine (SVM), which can be observed in [31, 41]. Although this naive approach is easy to implement, our experimental results about the multi-task features in Chapter 4 support that simple concatenation may not be the best method to fuse different features. Because our proposed multi-task, multi-depth, and multi-scale features are all based on CNN, we start to search feature fusion methods based on CNN. Among the previous works involving fusing different CNN-based features [40, 58, 75], one common method is to construct another fully connected networks with different feature sources as the input. We adopt this approach to fuse our proposed CNN features. The details of the fully connected networks used for feature fusion and the experimental results are presented in Sec. 7.2 and Sec. 7.3 respectively. 85 Figure 7.1: The fully connected networks we use to fuse our proposed multi-task, multi-depth, and multi-scale CNN features. 7.2 Feature Fusion Using Fully Connected Networks 7.2.1 Network Architecture Inspired by [40, 58, 75], we use the fully connected networks illustrated in Figure 7.1 to fuse our proposed multi-task, multi-depth, and multi-scale CNN features. The input layer is the concatenation of the three feature sources. Each of the two intermediate fully connected layers has 1024 nodes, which is set empirically to seek the balance between feature describing ability and avoiding overfitting. On top of the two intermediate fully connected layers, we add an output layer with the number of nodes set to be the number of classes of the task of interest. We believe that using the fully connected networks to fuse different features is better than using simple concatenation because the former method is able to use higher-level intermediate feature representation to make prediction. In this chapter, we experiment on 3 abstract tasks—AST, ART, and ARC specified in Table 4.1. For the task ARC, we experiment on the setting of 86 25-way classification. In practice, we use the Caffe [32] implementation to train the fully connected networks in Figure 7.1 from scratch. When using the Caffe [32] implementation, we adopt its default training parameters for training the AlexNet [46] for ImageNet [11] classification unless otherwise specified. 7.2.2 Fused Features Because the publicly available databases for the 3 abstract tasks in our experiment only contain limited amount of data, we simplify our experimental setting such that the number of parameters which need to be trained is manageable. In practice, we reduce the number of parameters by limiting the source features we intend to fuse. In our experiment, we concatenate the following features as the input of the fully connected layer in Figure 7.1: 1. The default AlexNet [46] features extracted from its topmost fully connected layer (the CNN baseline). This feature is the same as the features in scale 0 of the multi-scale features in Figure 6.3, or the features f0 extracted from CNN0 of the multi-depth features in Figure 5.1. 2. The features f1 extracted from CNN1 of the multi-depth features in Figure 5.1. 3. The features in scale 1 of the multi-scale features in Figure 6.3. 4. The multi-task features leaned from the other 2 tasks. For example, if the task of interest is AST, the features extracted from the two AlexNets [46] trained with ART and ARC are the corresponding multi-task features. 87 features\task ID AST ART ARC CNN baseline multi-depth CNN features [60] multi-scale CNN features [61] multi-task CNN features [62] fused CNN features 55.15 56.40 [87.45] 58.11 [99.64] 57.51 [98.39] 59.47 [∼100] 67.37 69.21 [90.82] 70.96 [99.50] 71.05 [99.59] 72.24 [99.98] 54.84 55.57 [82.99] 59.13 [∼100] 55.38 [75.82] 59.33 [∼100] Table 7.1: The performance comparison between each of our proposed features and the fused features. The numbers are reported in the format “accuracy (%) [confidence level (%)].” The confidence represents the confidence of the feature outperforming the CNN baseline according to the binomial test. The italic numbers are the performance exceeding the 95% confidence level (statistically significant results). We follow the methods introduced in Chapter 4, Chapter 5, and Chapter 6 to generate the multi-task, multi-depth, and multi-scale CNN features and concatenate them for each image. After that, we use the concatenated feature vectors of the training data to train the fully connected networks in Figure 7.1. At testing time, we feed the concatenated feature vector of each testing data as the input of the fully connected networks in Figure 7.1, and the output is the predicted label. Applying this feature fusion method in Figure 7.1 to the 3 tasks, we report the performance in Sec. 7.3. 7.3 Experimental Results We compare the performance of each of our proposed CNN features with that of the fused features in Table 7.1, where the numbers are reported in the format “accuracy (%) [confidence level (%)].” The confidence level represents the confidence of the feature outperforming the CNN baseline according to the binomial test. The italic numbers in Table 7.1 are the performance passing the 95% confi- 88 dence level (statistically significant results). Out of the three proposed features, most of the results of the multi-scale and multi-task features are statistically significant, but the results of the multidepth features do not pass the 95% confidence level. We think that the reason is because the amount of training data is important for the performance, which is shown in [15, 82]. The multi-scale features are learned from the augmented training data generated by cropping the original training images. The multitask features implicitly utilize the training data from other tasks by extracting the features learned from other tasks. However, the multi-depth features are learned from only the original training data without extra training images, which may be the reason why the improvement brought by the multi-depth features is less significant than that brought by the multi-scale and multi-task features. Table 7.1 shows that the fused features outperform not only the CNN baseline but also each of our proposed features, and the performance improvement brought by the fused features is shown to be statistically significant. 7.4 Feature Fusion for Predicting Emotion Distributions 7.4.1 Motivation Encouraged by the promising results of the feature fusion in Sec. 7.3 and the multi-task features in Chapter 4, we want to explore these ideas in predicting emotion distributions and predicting the emotion stimuli map (ESM) introduced in Chapter 2 and Chapter 3 respectively. Specifically, we want to know whether the features learned from predicting the ESM can be used to 89 Figure 7.2: The illustration of our proposed combined networks for predicting emotion distributions. We only show convolution, deconvolution, and fully connected layers for clarity. improve the performance of predicting emotion distributions. To verify this idea, we predict emotion distributions using our proposed combined networks illustrated in Figure 7.2, where only convolution, deconvolution, and fully connected layers are shown for clarity. 7.4.2 Proposed Combined Networks Figure 7.2 illustrates our method in two steps. In the first step (Figure 7.2 left), we train two networks, N1 and N2, individually for predicting emotion distri- 90 butions and the ESM respectively. N2 is trained with the fully convolutional networks (the FCNEL in Sec. 3.5) using the training set of the EmotionROI database [67] introduced in Chapter 3. N1 is trained with the AlexNet [46] architecture using the supervision of S train, where S train is the training set of the Emotion6 database [66] introduced in Chapter 2. Similar to our prior work [66], we pre-train N1 with the Caffe reference model [32] trained for ImageNet classification [11] and fine-tune N1 with S train. However, unlike the method CNN in our original work [66] which uses softmax loss layer, we use sigmoid cross entropy loss layer in N1 because according to our user study [66], different emotions are not mutually exclusive. We also do not use Euclidean loss layer for each emotion category as that used in the method CNNR in our original work [66] because we prefer training one model which directly predicts the 7D emotion distribution evoked by the input image to training 7 models (each predicts the probability of one emotion category). The base learning rate used to train N1 is empirically set to 10−4. In the second step (Figure 7.2 right), we form the architecture of the combined networks, N3, using those of N1 and N2. Specifically, we adopt the architecture from conv-1 to fc-7 in N1 and the architecture from conv-1 to conv5 in N2, forming the left and right streams of N3, respectively. Because N2 does not have fc-6 and fc-7 layers, we add fc-6 and fc-7 to the right stream of N3 using the same architecture as their left-stream counterparts. This design makes N3 unbiased toward either stream in the architecture. The two streams are fed with the same input, and their outputs are concatenated by the concat layer. The output layer of N3 is fully connected with the concat layer such that a 7-D vector is predicted. We also use sigmoid cross entropy loss to train N3. When training N3, we pre-train the portions enclosed by the two rectangles in Figure 7.2 with 91 the trained N1 and N2 such that the left stream and the right stream capture the features learned from predicting emotion distributions and the ESM respectively. After pre-training, we fix the parameters in the gray area marked in Figure 7.2, fine-tuning the parameters outside the gray area using S train. This fine-tuning strategy lets the parameters in both streams update jointly, which is inspired by Jung et al. [37]. This strategy is also consistent with the feature fusing method introduced in this chapter by adding fully connected layers on top of different feature sources. For other training details, we use the same setting for both N1 and N3 unless otherwise specified. In testing phase, the 7-D vector output by N3 is normalized such that it forms a legal probability distribution representing the emotion distribution evoked by the input image. 7.4.3 Experimental Results Using the same 1386/594 training/testing images and 4 evaluation metrics used in our prior work [66], we compare the performance of our combined networks with the best result reported in [66]. The 4 evaluation metrics calculating the closeness of the predicted and ground truth emotion distributions are KL-divergence (KLD), Bhattacharyya coefficient (BC), Chebyshev distance (CD), and earth mover’s distance (EMD). We report M, the mean of M (M ∈ {KLD, BC, CD, EMD}) in Table 7.2. For BC, higher is better. For the other 3 metrics, lower is better. We also conduct pairwise comparison between our combined networks and our prior method [66]. For each testing image, we compare the scores produced by our prior method [66] and our combined networks, and tabulate which method produced the distribution closer to the ground truth in Table 7.2, where our combined networks not only achieve better 92 method KLD BC CD EMD Peng’s work [66] 0.480 0.847 0.265 0.503 combined networks (CN) 0.474 0.852 0.262 0.496 pairwise comparison metric KLD BC CD EMD wins for CN 333 330 321 323 wins for Peng’s method [66] 261 264 273 271 confidence (CN is better) (%) 99.8 99.7 97.7 98.5 Table 7.2: Performance comparison in predicting emotion distributions. Our combined networks predict more accurate emotion distributions with ≥97.7% confidence compared with Peng’s method [66]. M but also are more likely to produce an emotion distribution closer to the ground truth with ≥97.7% confidence according to the binomial test. Our result supports the hypothesis that using the features learned from predicting the ESM can improve the performance of predicting emotion distributions, which is consistent with the ideas of our proposed multi-task features and the method of fusing features. 93 CHAPTER 8 CONCLUSIONS AND FUTURE WORK Abstract tasks, the tasks involving general ideas or qualities, are one of the important subjects in computer vision. In this thesis, we study the abstract tasks in 8 different domains which are proposed in recent literature. The first two abstract tasks we study are predicting emotion distributions and predicting emotion stimuli maps. Both tasks are proposed by us to better model the diversity of emotions evoked in humans and the emotionally involved areas of images. For both tasks, we build the associated databases, Emotion6 and EmotionROI, and present the state-of-the-art results based on convolutional neural networks (CNN) and fully convolutional networks. Inspired by the encouraging results in predicting emotion distributions and predicting emotion stimuli maps, we explore the possibility of using CNN to solve general abstract tasks. Experimenting on the abstract tasks in 8 different domains, we find that for all the abstract tasks, using CNN achieves better performance than the results reported in the previously works which mostly rely on standard handcrafted features. This finding supports that CNN is a promising approach to solve general abstract tasks. Based on this finding, we design the multi-task, multi-depth, and multi-scale CNN features to achieve better benchmarks in abstract tasks. Our three proposed features improve the performance of abstract tasks from different aspects. Multi-task CNN features consist of the features learned from the training data of more than one task, achieving better image representation. Multi-depth features are learned by multiple CNNs with different neural network architectures, utilizing both highlevel and low-level image features. Multi-scale features are learned to capture 94 more detail information from the augmented training data in different scales. The experimental results show that all our proposed CNN features outperform the traditional CNN framework. To exploit the performance gain brought by all the proposed features simultaneously, we use another fully connected networks to fuse them. We show that the fused features achieve better performance than using each of our proposed features. In addition, we revisit the two emotion related tasks we propose and show that the features learned from predicting emotion stimuli maps can be beneficial to predicting emotion distributions, which supports the ideas of multi-task features and feature fusion. 8.1 Future Work: CNN Visualization for Abstract Tasks Given the promising results of our proposed CNN-based features, I would like to study what is learned by the CNN in abstract tasks in the future. One of the approaches is to visualize the features captured by the trained neural networks. To the best of my knowledge, visualizing the features learned by neural networks is still a relatively immature research area, and there is no consensus on what is the best method for feature visualization. Most of the previous works [55, 81] studying the visualization of neural networks focus on classical object or scene classification tasks. The feature visualization in abstract tasks is often ignored, which motivates me to research on this subject. In recent works about the visualization of neural networks, Yosinski et al. [81] use the natural image prior in their method, preventing the pixels with extreme intensities from appearing in the generated visualization results. 95 (a) natural image prior [81] (b) center-biased [55] (c) max activation [54] Figure 8.1: The visualization of the output node representing the class “high aesthetic value” in the task AVA in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where all the 9 training images are correctly labeled by the trained CNN. Yosinski’s method [81] tends to generate repetitive objects all over the visualized image when it is used to visualize the learned CNN output nodes in object classification tasks. To solve this issue, Nguyen et al. [55] introduce the center-biased regularization, encouraging the visualized image to contain only one object in the center. Both of these visualization methods [55, 81] use a random image as input and intend to form the relevant object(s) in the visualized image. In this section, I adopt these two state-of-the-art visualization methods [55, 81] to generate the preliminary results of feature visualization in abstract tasks. Specifically, I revisit the CNN trained with the approach “pt ImageNet + ft” introduced in Table 4.2 in Chapter 4, and visualize one specific node of the output layer at a time. Figure 8.1 is the visualization of the output node representing the class “high aesthetic value” in the task AVA in Table 4.1. The 4 images in (a)/(b) are 96 the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where the surrounding green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. Figure 8.1 (a) and (b) do not look like typical objects even though both methods [55, 81] are designed to form object(s), which indicates that more suitable visualization methods for abstract tasks are needed. We think that Figure 8.1 (a) and (b) seem to capture the two common characteristics for the images with high aesthetic values — satisfying rule of third and shallow depth of focus. We also use the two visualization methods [55, 81] to visualize our proposed multi-task, multi-depth, and multi-scale features. Figure 8.2, 8.3, and 8.4 are displayed in a similar format as that of Figure 8.1. Figure 8.2 is the visualization of the output node representing the class “Rembrandt van Rijn” in the task AST in Table 4.1, showing that using the multi-task features (bottom row (c)) corrects the misclassification made by using self features only (top row (c)). Figure 8.3 visualizes the output node representing the class “Gothic architecture” in the task ARC in Table 4.1, showing that the multi-depth features (bottom row) capture the features learned from the two CNNs with different depths (top two rows). Figure 8.4 is the visualization of the output node representing the class “Russian Revival architecture” in the task ARC in Table 4.1, showing the critical features in two different scales. Although Figure 8.2, 8.3, and 8.4 seem to be consistent with the performance improvement brought by our proposed multitask, multi-depth, and multi-scale features, detail analysis with more advanced visualization methods is needed to figure out what is captured by the CNN in abstract tasks, which is my next research direction. 97 features learned from artist style classification features learned from artist style and artistic style classification (a) natural image prior [81] (b) center-biased [55] (c) max activation [41] Figure 8.2: The visualization of the multi-task features. The visualized output node represents the class “Rembrandt van Rijn” in the task AST in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. 98 features learned by CNN0 in Chapter 5 features learned by CNN1 in Chapter 5 features learned by jointly training CNN0 and CNN1 in Chapter 5 (a) natural image prior [81] (b) center-biased [55] (c) max activation [77] Figure 8.3: The visualization of the multi-depth features. The visualized output node represents the class “Gothic architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. 99 features in scale 0 in Chapter 6 features in scale 1 in Chapter 6 (a) natural image prior [81] (b) center-biased [55] (c) max activation [77] Figure 8.4: The visualization of the multi-scale features. The visualized output node represents the class “Russian Revival architecture” in the task ARC in Table 4.1. The 4 images in (a)/(b) are the results generated by Yosinski’s [81]/Nguyen’s [55] method with 4 different random images as input. The top 9 training images activating this output node the most are shown in (c), where green/red color code means that the image is correctly/incorrectly labeled by the trained CNN. 100 APPENDIX A RELATED PUBLICATIONS Conference Papers: • Kuan-Chuan Peng, Amir Sadovnik, Andrew Gallagher, and Tsuhan Chen. ”Where Do Emotions Come from? Predicting the Emotion Stimuli Map”, IEEE International Conference on Image Processing (ICIP), 2016. • Kuan-Chuan Peng and Tsuhan Chen. ”Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks”, IEEE Winter Conference on Applications of Computer Vision (WACV), 2016. • Kuan-Chuan Peng and Tsuhan Chen. ”Cross-layer Features in Convolutional Neural Networks for Generic Classification Tasks”, IEEE International Conference on Image Processing (ICIP), 2015. • Kuan-Chuan Peng and Tsuhan Chen. ”A Framework of Extracting Multiscale Features Using Multiple Convolutional Neural Networks”, IEEE International Conference on Multimedia and Expo (ICME), 2015. • Kuan-Chuan Peng, Amir Sadovnik, Andrew Gallagher, and Tsuhan Chen. ”A Mixed Bag of Emotions: Model, Predict, and Transfer Emotion Distributions”, IEEE Computer Vision and Pattern Recognition (CVPR), 2015. • Kuan-Chuan Peng, Kolbeinn Karlsson, Tsuhan Chen, Dongqing Zhang, and Heather Yu. ”A Framework of Changing Image Emotion Using Emotion Prediction”, IEEE International Conference on Image Processing (ICIP), 2014. • Kuan-Chuan Peng and Tsuhan Chen. ”Incorporating Cloud Distribution in Sky Representation”, IEEE International Conference on Computer Vision (ICCV), 2013. 101 Patent Application: • Kuan-Chuan Peng, Heather Hong Yu, Dongqing Zhang, and Tsuhan Chen. ”Emotion Modification for Image and Video Content”, US Patent Application 20150213331, July 30, 2015. 102 BIBLIOGRAPHY [1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of multilayer neural networks for object recognition. In European Conference on Computer Vision, pages 329–344, 2014. [2] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189–2202, November 2012. [3] Y. Bar, N. Levy, and L. Wolf. Classification of artistic styles using binarized features derived from a deep neural network, 2014. [4] I. Ben-Shalom, N. Levy, L. Wolf, N. Dershowitz, A. Ben-Shalom, R. Shweka, Y. Choueka, T. Hazan, and Y. Bar. Congruency-based reranking. In Computer Vision and Pattern Recognition, pages 2107–2114, 2014. [5] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, 2001. [6] M. M. Bradley and P. J. Lang. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry, 25(1):49–59, 1994. [7] B. Celikkale, A. Erdem, and E. Erdem. Visual attention-driven spatial pooling for image memorability. In Computer Vision and Pattern Recognition Workshop, pages 976–983, 2013. [8] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1– 27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ ˜cjlin/libsvm. [9] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook. Efficient salient region detection with soft image abstraction. In International Conference on Computer Vision, pages 1529–1536. IEEE, 2013. [10] E. S. Dan-Glauser and K. R. Scherer. The Geneva affective picture database (GAPED): a new 730-picture database focusing on valence and normative significance. Behavior Research Methods, 43(2):468–477, 2011. 103 [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. ImageNet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, pages 248–255, 2009. [12] deviantart. http://www.deviantart.com. [13] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia, 19(3):34– 41, 2012. [14] Merriam-Webster Online: Dictionary and Thesaurus. http://www. merriam-webster.com/. [15] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013. [16] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. McRorie, J.C. Martin, L. Devillers, S. Abrilian, A. Batliner, N. Amir, and K. Karpouzis. The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. Lecture Notes in Computer Science, 4738:488–501, 2007. [17] dpchallenge. http://www.dpchallenge.com/. [18] P. Ekman, W. V. Friesen, and P. Ellsworth. What emotion categories or dimensions can observers judge from facial behavior? Emotion in the Human Face, pages 39–55, 1982. [19] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88:303–338, 2010. [20] Flickr. https://www.flickr.com/. [21] J. R. J. Fontaine, K. R. Scherer, E. B. Roesch, and P. C. Ellsworth. The world of emotions is not two-dimensional. Psychological Science, 18(2):1050–1057, 2007. [22] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Y. Yao. Interestingness prediction by robust learning to rank. In European Conference on Computer Vision, pages 488–503, 2014. 104 [23] M. Gendron and L. F. Barrett. Reconstructing the past: a century of ideas about emotion in psychology. Emotion Review, 1(4):316–339, 2009. [24] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In International Conference on Image Processing, pages 4034–4038. IEEE, 2013. [25] S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(10):1915–1926, 2012. [26] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision, pages 392–407, 2014. [27] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007. [28] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. V. Gool. The interestingness of images. In International Conference on Computer Vision, pages 1633–1640. IEEE, 2013. [29] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361, 2014. [30] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1469–1482, 2014. [31] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In Computer Vision and Pattern Recognition, pages 145–152, 2011. [32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [33] D. Joshi, R. Datta, Q.-T. Luong, E. Fedorovskaya, J. Z. Wang, J. Li, and J. Luo. Aesthetics and emotions in images: a computational perspective. IEEE Signal Processing Magazine, 28(5):94–115, 2011. 105 [34] B. Jou, S. Bhattacharya, and S.-F. Chang. Predicting viewer perceived emotions in animated GIFs. In International Conference on Multimedia, pages 213–216. ACM, 2014. [35] B. Jou, T. Chen, N. Pappas, M. Redi, M. Topkara, and S.-F. Chang. Visual affect around the world: A large-scale multilingual visual sentiment ontology. In International Conference on Multimedia, pages 159–168. ACM, 2015. [36] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In International Conference on Computer Vision, pages 2106– 2113, 2009. [37] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In International Conference on Computer Vision, pages 2983–2991. IEEE, 2015. [38] L. Kang, P. Ye, Y. Li, and D. Doermann. A deep learning approach to document image quality assessment. In International Conference on Image Processing, pages 2570–2574. IEEE, 2014. [39] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, and H. Winnemoeller. Recognizing image style. In The British Machine Vision Conference, 2014. [40] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F.-F. Li. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition, pages 1725–1732, 2014. [41] F. S. Khan, S. Beigpour, J. V. D. Weijer, and M. Felsberg. Painting-91: a large scale database for computational painting categorization. Machine Vision and Applications, 25:1385–1397, 2014. [42] A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability of image regions. In The Conference and Workshop on Neural Information Processing Systems, pages 296–304, 2012. [43] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Discovering elements of fashion styles. In European Conference on Computer Vision, pages 472–488, 2014. [44] J. Kim, S. Yoon, and V. Pavlovic. Relative spatial features for image memo- 106 rability. In International Conference on Multimedia, pages 761–764. ACM, 2013. [45] S. G. Koolagudi and K. S. Rao. Emotion recognition from speech: a review. International Journal of Speech Technology, 15(2):99–117, 2012. [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012. [47] P. J. Lang, M. M. Bradley, and B. N. Cuthbert. International affective picture system (IAPS): affective ratings of pictures and instruction manual. Tech. Rep. No. A-8. 2008. [48] F.-F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, pages 178–186, 2004. [49] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, pages 3431–3440, 2015. [50] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. RAPID: rating pictorial aesthetics using deep learning. In International Conference on Multimedia. ACM, 2014. [51] J. Luo, A. Singhal, S. P. Etz, and R. T. Gray. A computational approach to determination of main subject regions in photographic images. Image and Vision Computing, 22:227–241, 2004. [52] J. Machajdik and A. Hanbury. Affective image classification using features inspired by psychology and art theory. In International Conference on Multimedia, pages 83–92. ACM, 2010. [53] K. Matsumoto, K. Kita, and F. Ren. Emotion estimation of wakamono kotoba based on distance of word emotional vector. In the 7th International conference on Natural Language Processing and Knowledge Engineering, pages 214–220, 2011. [54] N. Murray, L. Marchesotti, and F. Perronnin. AVA: A large-scale database 107 for aesthetic visual analysis. In Computer Vision and Pattern Recognition, pages 2408–2415, 2012. [55] A. Nguyen, J. Yosinski, and J. Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. CoRR, abs/1602.03616, 2016. [56] A. Ortony and T. J. Turner. What’s basic about basic emotions? Psychological Review, 97(3):315–331, 1990. [57] N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics, 9(1):62–66, 1979. [58] E. Park, X. Han, T. L. Berg, and A. C. Berg. Combining multiple sources of knowledge in deep cnns for action recognition. In IEEE Winter Conference on Applications of Computer Vision, pages 1–8, 2016. [59] K.-C. Peng and T. Chen. Incorporating cloud distribution in sky representation. In International Conference on Computer Vision, pages 2152–2159. IEEE, 2013. [60] K.-C. Peng and T. Chen. Cross-layer features in convolutional neural networks for generic classification tasks. In IEEE International Conference on Image Processing, pages 3057–3061, 2015. [61] K.-C. Peng and T. Chen. A framework of extracting multi-scale features using multiple convolutional neural networks. In IEEE International Conference on Multimedia and Expo, pages 1–6, 2015. [62] K.-C. Peng and T. Chen. Toward correlating and solving abstract tasks using convolutional neural networks. In IEEE Winter Conference on Applications of Computer Vision, pages 1–9, 2016. [63] K.-C. Peng, K. Karlsson, T. Chen, D.-Q. Zhang, and H. Yu. A framework of changing image emotion using emotion prediction. In IEEE International Conference on Image Processing, pages 4637–4641, 2014. [64] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. The Cornell Emotion6 Image Database. http://chenlab.ece.cornell.edu/ people/kuanchuan/publications/Emotion6.zip. [65] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. The Cornell 108 EmotionROI Image Database. http://chenlab.ece.cornell.edu/ people/kuanchuan/publications/EmotionROI.zip. [66] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In Computer Vision and Pattern Recognition, pages 860–868, 2015. [67] K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen. Where do emotions come from? predicting the emotion stimuli map. In IEEE International Conference on Image Processing, 2016. [68] K.-C. Peng, H. H. Yu, D. Zhang, and T. Chen. Emotion modification for image and video content, 07 2015. US Patent Application 20150213331. [69] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015. [70] J. A. Russell. A circumplex model of affect. Journal of Personality and Social Psychology, 39(6):1161–1178, 1980. [71] E. M. Schmidt and Y. E. Kim. Modeling musical emotion dynamics with conditional random fields. In the 12th International Society for Music Information Retrieval Conference, pages 777–782, 2011. [72] M. Solli and R. Lenz. Emotion related structures in large image databases. In International Conference on Image and Video Retrieval, pages 398–405. ACM, 2010. [73] X. Wang, J. Jia, J. Yin, and L. Cai. Interpretable aesthetic features for affective image classification. In International Conference on Image Processing, pages 3230–3234. IEEE, 2013. [74] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. [75] D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and J.-M. Odobez. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1583–1597, 2016. 109 [76] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition from abbey to zoo. In Computer Vision and Pattern Recognition, pages 3485–3492, 2010. [77] Z. Xu, D. Tao, Y. Zhang, J. Wu, and A. C. Tsoi. Architectural style classification using multinomial latent logistic regression. In European Conference on Computer Vision, pages 600–615, 2014. [78] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph-based manifold ranking. In Computer Vision and Pattern Recognition, pages 3166–3173, 2013. [79] Y.-H. Yang and H. H. Chen. Machine recognition of music emotion: a review. ACM Transactions on Intelligent Systems and Technology, 3(3):1–30, 2012. [80] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In The Conference and Workshop on Neural Information Processing Systems, pages 3320–3328, 2014. [81] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. In International Conference on Machine Learning Workshop, 2015. [82] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833, 2014. [83] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for fine-grained category detection. In European Conference on Computer Vision, pages 834–849, 2014. 110