PerceptSent - Exploring Subjectivity in a Novel Dataset for Visual Sentiment Analysis

Visual sentiment analysis is a challenging problem. Many datasets and approaches have been designed to foster breakthroughs in this trending research topic. However, most works scrutinize only subsymbolic models through visual attributes of the evaluated images, paying less attention to the subjectivity of viewers’ perceptions as a basis for neuro-symbolic systems. Aiming to fill this gap, we present PerceptSent, a novel dataset for visual sentiment analysis that spans 5,000 images shared by users on social networks. Besides the sentiment opinion (positive, slightly positive, neutral, slightly negative, negative) expressed by every evaluator about each image analyzed, the dataset contains evaluator's metadata (age, gender, socioeconomic status, education, and psychological hints) as well as perceptions observed by the evaluator about the image — such as the presence of nature, violence, lack of maintenance, etc. Deep architectures and different problem formulations are explored using our dataset to combine visual and extra attributes (external knowledge) for automatic sentiment analysis. We show evidence that evaluator's perceptionss, when correctly employed, are crucial in visual sentiment analysis, improving the F-score performance from 61% to an impressive rate above 97%. Although, at this point, we do not have automatic approaches to capture these perceptions, our results open up new investigation avenues.


INTRODUCTION
S ENTIMENT analysis, or opinion mining [1], is of paramount importance for diverse tasks, including those related to understanding human behavior and decision-making support [2], [3], [4], [5].Despite the significant advances and well-accepted solutions on textual context [6], [7], [8], [9], a considerable portion of social media content relies on images.Thus, there is a great motivation for advancing those techniques for visual sentiment analysis.However, such a topic is still a challenging problem [10], mainly due to the stronger inherent subjectivity present in this context, which might result in diverging opinions on the sentiment polarity regarding the same image.
Aiming to investigate the influence of users' subjectivity, a direction pointed as promising by a previous study [13], we present as one of our main contributions, PerceptSent, a novel dataset for visual sentiment analysis that contains 5,000 images shared by users on social networks.A pool of five evaluators labeled all images as positive, slightly positive, neutral, slightly negative, and negative.Besides the sentiment opinion for each image, the dataset contains metadata concerning each evaluator -age, gender, socioeconomic strata, education, and psychological hints -their perceptions, such as the presence of nature, violence, lack of maintenance, among others -as well as independent scene objects annotated in the images.
In the experiments, we consider different problem setups, using distinct numbers of target classes and voting decisions.This is important to comprehend several implications for the results, in particular for the baseline performance.As another contribution, we also explore deep neural networks to combine different types of information and knowledge like visual attributes, independent scene objects manually annotated, evaluators' metadata, and perceptions, all of them provided by our dataset.The obtained results indicate that the external knowledge in the form of perceptions offers a dealbreaker in visual sentiment analysis, with an F-score increasing from 61% to above 97%.When feeding the network with such external information, it seems to be learning to identify the sentiment that a specific profile of users would have about a particular image rather than just one general sentiment per image, achieving the best results.information, like perception, which is not a reality at this point.Despite that, from our perspective, the obtained results could inspire new investigation lines to tackle the challenging problem of visual sentiment analysis.The proposed dataset and trained models are available on GitHub [17].
The remainder of our study is organized as follows.Section 2 presents the related work.Section 3 introduces the novel dataset.Section 4 presents and discusses the models and experiments performed.Section 5 explores new perspectives beyond subsymbolic models for sentiment analysis.Finally, Section 6 concludes the study.

RELATED WORK
This section describes some of the main datasets in the context of visual sentiment analysis.Further, we discuss some key algorithms for automatically performing visual sentiment analysis.

Datasets for Visual Sentiment Analysis
According to [18], a psychological model structures the methodological approach as it qualifies and quantifies emotions and feelings according to a particular objective.Categorical Emotion States (CES) propose the existence of a series of basic emotions and seek to fit each emotion into a well-defined category.In contrast, Dimensional Emotion Space (DES) representation is based on quantitative measures using multidimensional scaling.Each model helps convey a specific characteristic of human emotion, and they suggest how the human mind represents and interprets each sentiment, with both models assessing a person's actual emotional state [19].DES models are generally more advanced than CES in expressing complex emotions.The implementation of CES models, on the other hand, is more straightforward because categorizing sets of contents is a more objective task than collecting continuous data along multiple axes of a predetermined space.
In the last decades, the academy and industry have proposed many datasets containing images useful for several tasks related to affective computing.Each one has its characteristics and purpose.There are variations in size, sources, affective categories, data-gathering methods, and labeling procedures.Therefore, it is necessary to point out that the differences in the adopted dataset design components make result comparison difficult.Moreover, the research community has no agreement on using a universal benchmark [20].Lastly, it is not our purpose to catalog all the visual sentiment datasets available.Instead, we identified some of the most important datasets in the field -i.e., those commonly employed or mentioned by the industry and academy -as summarized in Table 1.
All the datasets we listed in Table 1 are based on CES or DES models, i.e., they only consider nominal and interval data types.Nonetheless, it is worth mentioning that an emerging approach suggests that ordinal data would be possible via rank-based annotation protocols [21], [22], [23].However, since this is still an emerging theory, up to this point, there are no data available, and the experiments published so far are restricted to videos, speech, and music [22].Some identified datasets offer an interesting characteristic: ambiguity or uncertainty modeling features.This is the case for Emotion6, IESN, Flickr LDL, and Twitter LDL [11], [18], [24], [25], [26], [27], [28].These datasets offer an escape from the most related efforts, which generally ignore the human observer's subjectivity.This is relevant because uncertainty remains in the target emotion even if measured from many different evaluations [27].Moreover, the widespread practice of employing the same emotion label in the learning process produces noise in the form of an ambiguous learning target.For this reason, many authors propose an ambiguity or uncertainty prediction framework to give a more human-like and comprehensive prediction of emotions [27], [28].In contrast to the standard frameworks, which produce only one prediction, the authors propose an additional target to reflect the uncertainty of human perception based on some internal metric in the dataset (e.g., interrater disagreement).

Automatic Visual Sentiment Analysis
Most algorithms for automatic sentiment analysis focus on text analysis [42], [43], [44].This includes more recent approaches, such as those focusing on the development of sentiment-specific word embeddings [45], [46], helping to encode more affective clues than usual word vectors, approaches using context-aware subsymbolic AI (i.e., statistical natural language processing) such as attention modeling [47], [48] and aspect modeling [46], [49], or approaches exploring ensemble of subsymbolic and symbolic AI (i.e., ontologies and lexica) [50].Due to the considerable advances in textual sentiment analysis, it is an enticing choice to extract text from other media, such as video transcriptions, as in [51].However, exploring a single textual source in different scenarios imposes a series of limitations, mainly when there is little content from that source available.For example, we can mention the limitation on platforms like Instagram, where most posts only have images, and there is little space for text-based approaches [10].
Many large-scale systems for understanding human behavior and decision-making rely on visual sentiment analysis.In [2] the possibility of a large-scale emotional contagion in a virtual community is considered using Facebook.From the authors' perspective, it could naturally happen without consciousness due to the transfer of emotions expressed on Facebook to other people.According to [4], in different cultures, positive and negative emotions expressed on Twitter can be associated with well-established diurnal and seasonal behavior patterns.[3].From tweets posted by users diagnosed with clinical depression, the authors demonstrated a strong correlation between behavioral attributes (e.g., social engagement, emotion, language, and linguistic styles) and the risk of depression.
In addition to speech transcripts [52] and descriptors inspired by psychology and art theory [53], low-level visual attributes like colors [54], texture [55], and image gradients [56] still dominate algorithms for sentiment analysis in visual contexts, such as images and videos.Nevertheless, high-level abstraction for sentiment analysis has received more and more attention.In [57], the focus is on visual content attributes in relation, for example, to material (e.g., metal) and surface (e.g., rust), whereas in [58] visual concepts strongly related to sentiment are explored.
More recently, visual sentiment analysis has focused on researching new architectures of deep learning networks, especially convolutional neural networks (CNN) [59].Aiming to reduce the noise impact, [35] proposes a CNN probability sampling that removes instances with similar sentiment scores.According to [60], transfer learning from the weights of a CNN trained on ImageNet [61] can be used to mitigate the problem of data that contain only images with strong sentiment polarity.In [62], the happiness estimation of people in natural environments is done via discriminating face features extracted by CNN architectures.To detect visual sentiments, [63] combines CNN visual features with sentiment concepts automatically identified from tags available in public images on the web.Aiming to increase performance in sentiment classifying systems, [64] adds visual attention to image inputs of CNNs.
Video, image, text, and emotion-related symbols such as emoticons are common in many microblogs.Some approaches consider a joint representation of these multimodal inputs for sentiment analysis.[65] considers the correlation and independence between different modalities in its multimodal classification model for sentiment analysis in microblogs.In [66], CNN models extract features from textual and visual modalities.Together with text and image, [16] also uses modality in the proposed CNN model to classify sentiment.Common in specific systems such as Twitter and Sina Weibo (a system similar to Twitter in China), [67], [68] explore emoticons to solve some of the previous approach problems, in particular, the issue of scalability related to human-labeled sentiment data.Following a similar direction, [69] considers text and image (tweets) to deal with polarity in sentiment analysis.In sentiment analysis of textual data on social media applications, [70] uses classifier ensembles to group features extracted manually via classical models and automatically via deep models.
Focusing on objective versus subjective tasks, some studies compare deep and classic models with the results provided by humans.In [14], four deep learning models, trained to perform (objective) visual inference tasks, use an automatic approach to extract text from visual content.The results are compared with (subjective) user-supplied text to label social images.An image sentiment classifier based on support vector machines estimates sentiment polarity, considering different combinations of subjective text, objective text, and visual resources.The work shows that the noise introduced by the subjective text affects the model's performance.On the other hand, better results are obtained using visual resources and objective text.Although some works have shown the importance of outdoor images for the semantic classification of urban areas, such as the study that extracts labels for external urban images [15], or the use of semantic descriptors pulled straight from the images by using deep networks [13], there is still a lack of studies to understand the attributes with the strongest influence on performance for outdoor context.
Our study innovates compared with previous ones in the following aspects.First, it offers a new dataset with evaluators' metadata (e.g., socioeconomic information) and evaluators' perception annotations for the sentiment associated with an image.It also provides extra data associated with every image, namely, a list of independent scene objects manually annotated.Second, we show evidence that considering users' perceptions, available in our dataset, is paramount in visual sentiment analysis.These results can potentially open up new lines of investigation regarding visual sentiment analysis.

PERCEPTSENT
Here we describe our first contribution: a new dataset containing not only images and their associated sentiments but also evaluator s' metadata and their perceptions.

Image Collection and Sentiment Polarity
The PerceptSent dataset comprises 5,000 images -focusing primarily on outdoor scenes, 92.7% of the total image samples -collected from Instagram, Flickr, and NYC311, 1 from May to June of 2020, and filtered out to remove duplicated/ similar images.The dataset was then split into 500 tasks for human evaluation, with ten images each.The Amazon Mechanical Turk platform was used to classify 61% of the total tasks.The other remaining tasks were classified in a similar process, but with students and professors from our university.All volunteers are proficient in English.This experiment was approved by our institution.
Five distinct evaluators analyzed each image, resulting in 25,000 evaluations.The polarity sentiment categories considered for each image are: positive, slightly positive, neutral, slightly negative, and negative.We adopted a CES methodology because we focus on subjectivity and context rather than emotions' complexity.The slightly positive and slightly negative categories were included to allow some flexibility in the emotional categorization.
Fig. 1 shows the polarity sentiment distribution for all sentiment categories according to evaluator's classification.
As can be seen, the distribution per category is nearly balanced.

Evaluator's Metadata
In an attempt to analyze if socioeconomic variables and approximated personality (through personality hints) could influence the sentiment polarity attributed to an image, we gathered every evaluator' profile, according to the items shown in Table 2.
Fig. 2 shows the distributions for polarity sentiments when considering basic metadata information from evaluator's.
Data are normalized within each group to facilitate the identification of trending patterns.The percentage scores indicate the representation from total population size for that category in the PerceptSent dataset.Fig. 2a shows that evaluators are well represented in both genders, with a slight predominance of men (50.4% of the total).The normalized plot shows a slight tendency of women to evaluate more images as positive and negative than men, which tend to provide more neutral evaluations.In terms of age, as shown in Fig. 2b, we observed that older people tend to evaluate more images as negative than younger ones (age range [18][19][20][21][22][23][24][25]).Looking for the socioeconomic strata, Fig. 2c, although only 2,08% consider themselves to belong to the wealthiest social strata (52 people out of 2,500), it is possible to note that they provided more negative evaluations compared to the other groups -this result has to be taken with caution because this could be linked with the type of images they evaluated, and due to the (mis)perceptions of inequality effect [71].There is a positive correlation between the increase in education level and the tendency to evaluate more images as slightly negative and negative -see Fig. 2d.
Fig. 3 shows the distributions for polarity sentiments when considering personality hints.Data are normalized within each group to facilitate the identification of trending  patterns.The percentage scores indicate the representation from the total population size for that category in the Per-ceptSent dataset.We may assume that the evaluators are mostly optimistic since the majority see themselves as optimistic people (61.20%), and they disagree that there are only very few good things in life (61.92%).Also, most evaluators assume always to see the bright side of things (63.16%).It is interesting to note that evaluators that totally agree with the statements "I am a person who sees life in an optimistic way" and "I always see a bright side of things" tend to evaluate more images as negative.
All the analyses performed in this section illustrate the visual sentiment analysis difficulty; personal characteristics can highly influence it.

Evaluator's perceptions
The visual scene context is usually complex since the presence of any relevant element, even in the background, can completely change the polarity sentiment.Therefore, the polarity sentiment alone might lack the appropriate justification of why humans can classify the same scene with so diverse sentiments, such as positive and negative.To understand the perceptions associated with each sentiment, we asked the evaluators to select one or more perception categories, as shown in Table 3, or to write the main reason for a sentiment polarity in a text box named as "Other."Fig. 4 shows the word clouds with the most frequent selected perceptions for each selected sentiment polarity.

Agreement on Evaluations
The inherent subjectivity in the visual sentiment analysis can make an image be classified with distinct polarities.Let s g ¼ ðs 1 ; . ..; s n Þ be the grouped sentiments vector for a particular image, where s c is the number of votes given to the sentiment in category c, by all the E evaluators per image, therefore P n c¼1 s c ¼ E. To build the PerceptSent dataset, we considered five (n ¼ 5) categories of sentiments and five (E ¼ 5) evaluators for each image.Then P 5 c¼1 s c ¼ 5, with s 1 defined as the total number of votes for a positive sentiment, s 2 for a slightly positive sentiment, s 3 for a neutral sentiment, s 4 for a slightly negative sentiment and s 5 for a negative sentiment.For example, Fig. 5 shows a scene from the proposed dataset that was classified by the evaluator's pool as positive (s 1 ¼ 3 votes), slightly positive (s 2 ¼ 1 vote) and slightly negative (s 4 ¼ 1 vote), thus, s g ¼ ð3; 1; 0; 1; 0Þ.
From the vector s g of grouped sentiments associated with an image i, we derive an image agreement metric function defined as where MðiÞ ¼ max c ðs g Þ, c ¼ 1 :: n, is the maximum value among all elements of the vector s g .Observe that I A ðiÞ ¼ 1 means that all evaluators of image i had the same sentiment (unanimity), whereas I A ðiÞ ¼ 0 means that no evaluator had the same sentiment, and its corresponding s g ¼ ð1; 1; 1; 1; 1Þ means an absolute disagreement.
A dominant sentiment or category in the grouped sentiments vector s g can be therefore defined as where s is a threshold used for filtering divergent opinions.I A ðiÞ is interesting because it could be used to assess the impact of convergent/divergent opinions in the automatic classification of sentiments.For instance, during the training of deep learning architectures, we could create a subset from PerceptSent, by filtering out all images with I A ðiÞ < 0:8.That is, we would keep only images with a high evaluator's consensus (at least four out of five votes for the same sentiment), and check the impact of augmenting/ reducing such threshold on the model's performance.
In addition to image agreement, we also consider in the present work two famous statistic metrics known as Fleiss's Kappa [72] and Cronbach's Alpha [73], measuring the consensus and consistency of the entire PerceptSent dataset, respectively.Fleiss's Kappa is an inter-rater reliability coefficient, which quantifies the extent of agreement in classification by evaluators, over that which would be expected by chance.It expresses the consensus among evaluators numerically.The Cronbach's Alpha measures the reliability, or internal consistency, expressing how closely related a set of items are as a group.For instance, even if a group of evaluators casts votes differently (poor consensus), their differences might always be the same (strong consistency).
Table 4 shows the guidelines for interpreting the strength of consensus, consistency, and disagreement level of each addressed metric.The table also shows the results for the reference metrics considering the entire PerceptSent dataset, i.e., the totality of 5,000 images.
As can be seen in Table 4, the proposed dataset has an average agreement of I A ¼ 0:62, meaning that most images reach a consensus at the dominant sentiment -as we have five votes per image, three out of five evaluators have selected the same sentiment polarity for a given image.We can also assume that the entire dataset has a fair consensus and good consistency, considering Fleiss's Kappa and Cronbach's Alpha, respectively, indicating that the proposed metric (I A ) is a compromise between the other two.

Standard Deep Architecture and Settings
As discussed in Section 2.2, deep learning neural networks, especially CNNs, have promoted major advances in visual sentiment analysis.Previous studies comparing deep networks for visual sentiment analysis in the context of social media, particularly those based on transfer learning [77], point out ResNet [78] as the state-of-the-art architecture [13].In most successful applications, the idea of residual connections has proven to be powerful in creating deeper and more robust models capable of addressing the vanishing gradient problem and accelerating the training process.
Since our focus is not on deep models comparison, we use ResNet as a standard model in all experiments.In the present work, pre-trained ImageNet weights are the initial stage of a fine-tuning process applied to the ResNet model.We use the Adam optimizer with an initial learning rate of 1e-4, 20 epochs, and 20% as the dropout rate for all tests.
To mitigate the problem of class imbalance during the training phase, we adopt a weighted optimization in which the relevance of each sample is proportional to the number of samples in each class.In other words, we assign a high weight to a minority group and a low weight otherwise.Moreover, we use 5-fold cross-validation.Therefore, the dataset is first partitioned into k ¼ 5 equally sized folds (subsets), and the model runs on each fold for either training/validating (80%) or testing (20%).The hardware setup is an Intel i7-10700 2.9 GHz processor with 128 GB RAM and an NVIDIA GeForce RTX 3090 GPU.

Train and Prediction Using Only Images (Baseline)
As shown in Fig. 6, we consider, as a baseline, a model that receives only images as input and is composed of three dense layers encompassing 2,048, 1,024, and 24 neurons.Fig. 6 shows that training the ResNet model involves extracting the feature vector f i ¼ ðf i1 ; . ..; f iF Þ from the input image i. Vector f i is further used in the pipeline as the input to the dense layer.From a supervised learning perspective, training demands the definition of target labels, which in this work are derived from the five original sentiment polarities (see Section 3 for more details of the human evaluation process used to build the dataset).Therefore, before training, we must define which classification problem is expected to be solved.
In the baseline model experiments, we formulate the sentiment analysis problem considering three variations for the output target classes (C) -see Fig. 6.This way, it is possible to evaluate the challenge posed by the different levels of separability required to classify the sentiment polarity.They range from the most difficult problem (C ¼ 5 target classes) to the easiest ones (C ¼ 2 target classes).In each problem, we consider a particular scheme to group the dataset samples into different target classes.For C ¼ 5, we have

TABLE 4 Interpretation Guidelines for the Strength of Consensus Using
Fleiss's Kapp (From Altman [74] and Adapted by Landis and Koch [75]), The Strength of Consistency Using Cronbach's Alph (From Taber [76]) and the Agreement Index Using the I A Metric Proposed Here the P 5 problem formulation encompassing all five sentiment polarities as originally chosen by the human evaluators, namely, positive, slightly positive, neutral, slightly negative, and negative.For C ¼ 3, i.e., P 3 problem formulation, we need to group polarities; thus, we group the slightly positive samples with the positive ones, and the negative class absorbs slightly negative samples.For C ¼ 2 (a binary classification problem), we have two different problem formulations: P 2þ resulting from joining the neutral samples with the positive and slightly positive ones; P 2À where the neutral samples are grouped with the negative and slightly negative ones.Through the different formulations, we aim to assess the impact of each class and particularly the neutral class, on the automatic sentiment analysis.Some authors consider neutral pictures the ones displaying only static objects [31], and others as scenes depicting simple objects or commercial products [41].There is no clear definition of a neutral image in these cases, which poses a challenge in gathering and labeling data.
To train and test the baseline in a classification context, we need to assign a target label (also known as ground truth) to each dataset sample (only images in this case).This is a particularly challenging task in the PerceptSent dataset since some samples do not have a well-defined category, e.g., an image with one vote for each polarity sentiment.Multi-label techniques could be used in this case, but investigating such techniques is out of the scope of the present paper.Instead, we use a technique to filter out those samples based on the dominant sentiment, as defined in Equation (2).
Table 5 shows three different manners to set the target label based on s: simple dominance for s ¼ 3, qualified dominance for s ¼ 4, and absolute dominance for s ¼ 5.
The threshold is set to define a minimum amount of votes for a specific target label in the pool of five evaluators addressed in the PerceptSent dataset.Table 5 summarizes the four different problem formulations.Remembering that P 5 regards five-class target problem; P 3 regards three-class target problem; P 2þ and P 2À regard binary class target problems, where the symbol '+' means shifting the neutral polarity to the class of positive samples, and 'À' means shifting the neutral polarity to the class of negative samples.Table 5 also shows the number of images (subset sizes) resulting from every tuple < s; C > of parameters combination used in each problem setup.Fig. 7 shows the performance results of the ResNet deep model using only images (baseline).The model is trained to solve twelve different problem setups < s; C > , i.e., all the different combination parameters for s in the filtering technique and C in the problem formulation, as shown in Table 5.The change in performance (F-score), when the problem becomes easier, i.e., when we move from the P 5 problem formulation to the P 2À problem formulation, are shown for all the three addressed s values.
Based on statistical tests (one-way ANOVA and Whelch's ANOVA followed their post-hoc tests, all with a significance level of 0.05), we performed multiple comparisons in two different ways: (i) comparing the different problems for the same value of a and (ii) comparing the different values of a for the same problem.
As expected, the results for problem formulation P 5 are the worst for all dominance voting types.We notice from the analysis of type (i) that results improve significantly from the P 5 to the P 3 formulation.Also, the results improve from P 3 to P 2 þ formulation, except for s ¼ 3, where the difference is not statistically significant.Baseline performs best on binary classification problems since we tackle a much easier problem.In the case of s ¼ 3, the baseline performance on P 2 À is better than that on P 2 þ .A possible explanation is the difficulty separating slightly negative from neutral ones due to their mutual characteristics perceived by the ResNet, particularly when the dominance is less restrictive.
According to comparisons of type (ii), we notice that in problem P 5 , there is no statistically significant difference between the performance, indicating it is insensitive to the different s values.The opposite occurs in P 2 À where the value of s matters.In problems P 3 and P 2 þ , no improvement can be observed from s ¼ 4 to s ¼ 5. Therefore, reducing the dataset size due to the higher value of the image filtering s ¼ 5 cannot be justified in this case.

Influence of External Knowledge Representation
According to [79] "semantic content has a great impact on the emotional influence of a picture," while [80] states that "accurately analyzing what is contained in an image can significantly improve the performance of affective image content analysis."Also, [23] points out that several studies have noticed  poor inter-evaluator sentiment agreement, mainly due to differences in perception, emotional bias, and contextual information.In addition, [13] points to a possible promising direction in addressing users' perception in visual sentiment analysis.Based on that, experiments conducted in this section aim to assess the impact of the presence or absence of EK as an input for the neural model, as well as its representation given by evaluator's metadata and perceptions.
For this, we analyze the ResNet performance when using grouped EK, single EK, and both.Additionally, we also analyze the performance regarding the dominant sentiment for a given image (with the dominance concept defined in Eq. ( 2)), and individual sentiment (when we consider each evaluator separately).Fig. 8 shows four distinct network architectures proposed to explore those scenarios.All experiments consider the P 3 problem formulation, i.e., a total of C ¼ 3 target categories of sentiments and s ¼ 3, i.e., a simple dominance.Due to different ways to consider the sentiment (dominant or individual), part of the models addresses a total of 4,506 samples, and the remainder addresses 25,000 samples, since each of the 5,000 images from the dataset has five individual human evaluations associated with it.Once an image is used for training, all its target data (5 evaluations) will not be present in testing.Meaning once the neural network is performing the tests, it would be the first time the network would analyze that image.Notice that all the experiments consider the feature vector f i ¼ ðf i1 ; . ..; f iF Þ extracted by the ResNet from the input image i, as the standard input to the dense layer.Data fusion occurs in all models at a latent layer before the dense one, which learns the best way to combine the features extracted by the deep architecture with extra information addressed by each model.
The main difference between the models relies on the type of extra information that can be considered in addition to the image to be analyzed.The first model, depicted in Fig. 8a, considers a near-optimum list of independent scene objects.As shown in Figs.8b, 8c, and 8d, the remaining models consider distinct EK representations.Particularly, they consider the perceptions about the input image and the evaluator's metadata.
In the first model D[f i o i ], depicted in Fig. 8a, there is no EK information, and an object descriptor is considered as input for the dense layer.As a proof of concept, this descriptor is currently formed by a near-optimum list of objects present in the scene and manually annotated.However, any robust object detector could be used for this task (e.g., YOLO [81]).
In the experiment performed using the model D[f i p g m g ] depicted in Fig. 8b, EK regarding grouped information (perceptions and metadata) substitutes the subsymbolic information of the previous model.
The neural network performance can degenerate due to poor inter-evaluator sentiment agreement.As shown in Fig. 8c, model D[f i p e m e ] considers an individual sentiment for a particular evaluator, and defining the target class is straightforward.In this approach, there is no data interchange between evaluators regarding sentiment categories, metadata, or perception.The EK information represented by tensors p e and m e is a piece of personalized and very accurate data representing the same information provided by the evaluators once they fulfill the forms.
Assuming that the single evaluator's perceptions employed in the previous experiment might not be available, we consider the last experiment.As shown in Fig. 8d, model D[f i p g m e ] groups only perceptions.
Fig. 9 shows some results for the three-class sentiment polarity problem provided by the approach D[f i p g m g ] depicted in Fig. 8b.Notice that relations between the different perceptions can induce distinct sentiments for each evaluator, justifying the necessity of adding EK to improve performance.Fig. 10 shows the performance of all architectures depicted in Fig. 8, which use either independent scene objects or EK information.It also shows the performance of the baseline (Fig. 6), which uses only images as input.According to statistical tests (T-test or Mann-Whitney U test, both with a significance level of 0.05), except for the independent scene objects model (D[f i o i ]), all the differences from the baseline performance are statistically significant.Therefore, we can conclude that in the performed experiments, all EK-based proposals outperform the baseline approach.Although not shown in the figure, it is worth mentioning an architecture, namely D[f i m g ], with partial EK, i.e., grouped metadata and no perception, that achieved an f-score of only 57%, (4% lower than the baseline).This highlights the importance of perception information.It is also important to point out that an image ensemble [82] with three distinct architectures for visual feature extraction (ResNet, DenseNet and InceptionV3), did not achieve a statistically significant improvement in F-score performance concerning the baseline architecture.
In what follows, we describe how to compute o, i.e., the vector with independent scene objects, as well as p and m, i.e., the perception and metadata tensors, respectively, for single and grouped analyses.In the case of extra scene object information, the vector o i ¼ ðo i1 ; . ..; o i143 Þ indicates the presence or absence in the analyzed image i, of a particular scene object, such as those listed in Table 6.
To incorporate extra scene object information, we used the graphical image annotation tool LabelImg and identified a total of 143 unique objects, as exemplified in Fig. 11.
Aiming to incorporate the perception as an EK input for the ResNet architecture, model D[f i p e m e ] depicted in Fig. 8c, uses single evaluator's perceptions.In the experiments, they are encoded as a tensor with rank=1 given by p e ¼ ðp e1 ; . ..; p e37 Þ encompassing the 37 most relevant perceptions shown in Table 3.Two models, Figs.8b and 8d, use grouped evaluator's perceptions p g ¼ ðp g1 ; . ..; p g37 Þ by summing up all the individual information, for all the E evaluators in the pool associated with the analyzed image, i.e., p gj ¼ P E e¼1 p ej , j ¼ 1; . ..; 37. To incorporate metadata, as shown in Figs.8b, 8c, and 8d, we encode the evaluator's individual information as a tensor with rank=1 given by m e ¼ ðm e1 ; . ..; m e34 Þ.Each element represents the selection (1) or not (0) of the corresponding option in the right column of Table 2.In the experiment considering a grouped EK -see Fig. 8b -each  element in the grouped metadata is derived from individual information as m gj ¼ P E e¼1 m ej , j ¼ 1; . ..; 34, to form the vector m g ¼ ðm g1 ; . ..; m g34 Þ.
In this section, we have shown that, initially considering the subsymbolic deep neural models (the baseline shown in Fig. 6 and the model depicted in Fig. 8a), we not only could improve the performance for sentiment analysis with the models depicted in Figs.8b, 8c, and 8d, but we also indicated a transition toward neuro-symbolic models.Starting from a subsymbolic approaches that are still quite far from neuro-symbolic models, we ended up with models closer to neuro-symbolic systems.Although the model depicted in Fig. 8d loses the individual point of view it is still helpful to evaluate the impact on the results of a hypothetical automatic approach that could capture relevant descriptive perceptions in the image and associate these perceptions with a significance measure.It could also represent the perceptions previously sampled from a target population to represent a specific target group (i.e., by age or region).This means predicting the individual's sentiment based on a sample from a representative population this person is associated withderived, for instance, from metadata.Although models using perceptions do not contain a completely automatic symbolic module, they provide a simple way to represent EK available in the PerceptSent dataset, which might support more complex symbolic processing (see Section 5 for possible extensions).

Evaluating Knowledge Domain's Components
This section evaluates the impact of a set of features of EK on the final accuracy.We perform a feature permutation importance technique, a model-agnostic global perception method providing insights into a machine learning model's behavior.Following the guidelines of Molnar [83], we measure the importance of a feature by calculating the changes in the model's prediction error after its removal.Considering the same configuration of Fig. 8c, we remove the single evaluator's metadata and perception, one at a time, preserving the others, to measure the changes in accuracy.
Knowing the possible most important features is relevant to guide new approaches to such data generation.Fig. 12 shows a subset of the most relevant features.
As we can see, the impact can be different among perceptions and metadata.For instance, the Fireworks instance does not impact performance much, whereas Everyday Image is critical.Most evaluator's metadata do not significantly impact the performance as the features in the subset with the lowest impact are mostly related to them; however, some are more relevant, such as Age.This means one could focus on gathering a partial set of features and perform well.As automatically gathering those types of features could be challenging, these experiments help shed some light on possible new guidelines for this task.

BEYOND SUBSYMBOLIC AI
There is common sense in the AI community that information at distinct levels of abstraction differs in structure and processing; Neuro-symbolic systems could benefit from those different levels.At a lower level, information can be processed by subsymbolic modules (e.g., objects segmented from images), and by symbolic modules otherwise (e.g., processes involving human reasoning).Our attempt to boost the baseline model's performance with a near-optimum list of the main objects extracted from the scene image as independent entities has been proved insufficient to improve the analysis of sentiments in PerceptSent images.However, we have shown that leveraging deep features through particular EK, namely, individualized or grouped   perceptions available in PerceptSent, significantly improves performance.Thus, major progress in this research topic might come through methods designed to automatically extract concepts related to such perceptions.Currently, many expert systems are available, but none can automatically extract representations of the high-level knowledge expressed by the volunteers, i.e., perceptions representing EK.Thus, we need new approaches to represent these human perceptions as close as possible.
In this context, exploring common sense computing [84] should not be neglected.It aims to teach computers the things we all know about the world and give them the capacity for reasoning on these things.For instance, Dragoni et al. [85] propose a conceptual model to support the structuring analysis of emotions from multimodal resources based on SenticNet, a commonsense knowledge base for sentiment analysis.SenticNet leverages an ensemble of symbolic and subsymbolic AI techniques to gain a deeper understanding of natural language.While SenticNet helps identify emotions in text, such as rage and bliss, it is not specialized in identifying the high level of subjectivity present in perceptions like Everyday Image or Poverty.Despite that, it is an example that enables reasoning over the relevant information detected in the images, for instance, objects and captioning, aiming to improve visual sentiment analysis.Thus, common sense computing could be applied to help develop new automatic systems for extracting the perceptions generated by humans in our dataset.
A knowledge graph is a structured representation of facts, with nodes representing entities, real-world objects, or abstract concepts, and edges as connections between entities, where nodes and edges have properties or attributes [86].This representation is useful for modeling the semantic information between entities and their relational properties.In the context of visual information extraction, Fang et al. [87] propose using knowledge graphs in any object detection algorithm to ensure semantic consistency during the object recognition process.From our point of view, the association of propositional logic-based modules with knowledge graphs could support mapping scene objects (the subsymbolic information used in our experiments) into more complex entities like perceptions.For this, propositions whose variables would indicate the presence or absence of objects in the image or any other extra information could be entities related or not through the edges of relational graphs to more abstract entities like perceptions.
Automatic image captioning, namely, the generation of natural-language descriptions for an image [88], can also be an interesting source of information for visual sentiment analysis.Fig. 13 shows three sentences automatically produced for PerceptSent images by an attention-based transformer architecture [88], [89], trained over more than 300 k images, with five reference captions annotated by humans per image, from the MS COCO captioning dataset [90].
Observe that some captions are very informative, Fig. 13a, whereas others, Figs.13b and 13c, failed to capture the scene context.Considerable efforts have been made in this field aiming to improve such descriptions, as the proposal of Wang [91], which uses a neuro-symbolic representation of the image in the form of an attributed relational graph, to model the object relationships.Furthermore, we could use textual descriptions from distinct machine learning techniques, trained over datasets of different domains of human knowledge, and leverage each individual model's strength through ensemble techniques [92], [93], [94], to improve the prediction process.
The examples of neuro-symbolic systems shown in this section make clear the usefulness of going beyond subsymbolic AI toward perhaps more closely to human knowledge.As a shared characteristic, they might comprehend a mixture of sound symbolic reasoning and efficient subsymbolic AI models.Therefore, from our perspective, a promising avenue for developing rich computational cognitive models for visual sentiment analysis emerges from the dataset -Perceptsent -proposed and explored in the present paper.

CONCLUSION
One of the main contributions of this study relies on a novel dataset with socioeconomic information about the evaluators and their perceptions of every assessed image.It also contains subsymbolic information in the form of near-optimum lists of the main objects extracted from the scene image.From our point of view, this dataset will help researchers investigate new aspects of the challenging problem of visual sentiment analysis.
We have also explored in the paper some of the new possibilities introduced by our dataset using the ResNet, a state-of-the-art deep architecture in the context of visual sentiment analysis on social media images.Among other results, we have found evidence of the importance of expressing subjectivity to the network.Considering external knowledge modules composed of different types of perception and metadata representations, we observed that the Fscore values for experiments with individual evaluator's perceptions as input were surprisingly high, above 97%.One of the reasons why an individual sentiment-based approach outperformed a dominant one can be explained by the diversity of opinions when judging a scene.For some people, what is relevant would be the presence of particular objects, while for others, these same objects would be completely irrelevant.In other words, people may agree on some specific sentiment category, but the reasons for choosing that category may differ.Thus, the network might use patterns in such answers and features from the images to achieve such a good performance.The results point out that metadata seems not so relevant.However, it is important to consider that metadata might play a significant role in a hypothetical model responsible for exclusively detecting perception, a very challenging problem due to its high level of abstraction and subjectivity.
Suppose, for instance, a neuro-symbolic approach could detect and extract perceptions from images according to particular individuals -information needed in our best experiment.In that case, the results show that the problem of visual sentiment analysis becomes much more straightforward.Although the current state-of-the-art is far from providing solutions for such type of detection, our results may indicate a path for further investigation.Nevertheless, as stated by [80], while recognizing that the dominant sentiment is too generic, predicting a personalized sentiment is too specific, then modeling sentiments for groups of users who share similar tastes or interests may be an interesting choice.In this sense, we have discussed future directions for going beyond subsymbolic models taking as starting points the proposed models that benefited from the information available in PerceptSent.

Fig. 1 .
Fig. 1.The distribution per sentiment polarity of 25,000 ratings for 5,000 images that compose the PerceptSent dataset.

Fig. 4 .
Fig. 4. Perceptions word cloud: a compact representation of the main human perceptions for the selected sentiment polarities, with frequency occurrence proportional to the font size.Words belonging to positive or slightly positive categories are shown in green, neutral in yellow, and negative or slightly negative in red.

Fig. 5 .
Fig. 5. Sentiment polarity vector (s g ) and selected perceptions for an image sample of the proposed dataset.Note that one evaluator can select for each image multiple perception categories.

Fig. 6 .
Fig. 6.Deep architecture D[f i ] (baseline) on different formulations of the visual sentiment classification problem, each one with a different number C of target classes (polarities).

Fig. 7 .
Fig. 7. F-score performance and 95% confidence intervals for the different problem setups.

Fig. 8 .
Fig. 8. Architecture extensions for sentiment analysis: (a) prediction with independent scene objects; (b) dominant sentiment prediction with grouped EK information; (c) individual sentiment prediction with single EK information; (d) individual sentiment prediction with single and grouped EK information.

Fig. 9 .
Fig. 9.Samples from the PerceptSent dataset and the network prediction results for the architecture shown in Fig. 8b.

Fig. 10 .
Fig.10.Performance on Problem version P 3 and s ¼ 3 of baseline and proposed models.In the case of architectures that consider a dominant sentiment, we have a 5 fold cross-validation on 4,506 samples.The remainder (individual ones) consider 5 fold cross-validation on 25,000 samples.Bar charts highlighted in blue tones use the evaluator's perceptions and metadata as input.

Fig. 11 .
Fig.11.Scene objects: manual annotations of images from PerceptSent dataset.For each image, the dataset provides a list of axis-aligned rectangular boxes with the object coordinates, x min , y min , x max and y max (image rectangles), as well as, the object category (image labels).

Fig. 12 .
Fig. 12. Visual representation of the feature importance results considering EK components.Each line represents the average percentage decrease in accuracy obtained from a 5-fold cross-validation.

Fig. 13 .
Fig. 13.Automatic image captioning using an attention-based transformer architecture for sample images collected from PerceptSent: (a) relatively accurate scene descriptions; (b) lack of important information such as broken, trash, gun, or soldier to capture the semantic meaning of the scene; and (c) completely wrong descriptions.

TABLE 1 A
Summary of Some of the Main Datasets Used for Visual Sentiment Analysis

TABLE 2 Evaluator
's Metadata Collected

TABLE 3 Dominant
Sentiment Polarity and Possible perceptions

TABLE 5 Different
Problems Configurations for the Analysis of the Baseline Approach Regarding the Type of Dominance, Threshold Value and Problem Version

TABLE 6
List of Independent Scene Objects Manually Annotated Over the 5,000 Images for the PerceptSent Dataset