Negativity drives online news consumption

Ethics information

The research complies with all relevant ethical regulations. Ethics approval (2020-N-151) for the main analysis was obtained from the Institutional Review Board (IRB) at ETH Zurich. For the user validation, ethics approval (IRB-FY2021-5555) was obtained from the IRB at New York University. Participants in the validation study were recruited from the subject pool of the Department of Psychology at New York University in exchange for 0.5 h of research credit for varying psychology courses. Participants provided informed consent for the user validation studies. New York University did not require IRB approval for the main analysis, as it is not classified as a human subjects research.

Large-scale field experiments

In this research, we build upon data from the Upworthy Research Archive53. The data have been made available through an agreement between Cornell University and Upworthy. We have access to this dataset upon the condition of following the procedure for a Registered Report. In Stage 1, we had access only to a subset of the dataset (that is, the ‘exploratory sample’), on the basis of which we conducted the preliminary analysis for pre-registering hypotheses. In Stage 2 (this paper), we had access to a separate subset of the data (that is, the ‘confirmatory sample’) on the basis of which we tested the pre-registered hypotheses. Here, our analysis was based on data from N = 22,743 experiments (RCTs) collected on Upworthy between 24 January 2013 and 14 April 2015.

Each RCT corresponds to one news story, in which different headlines for the same news story were compared. Formally, for each headline variation j in an RCT i (\(i = 1, \ldots ,N\)), the following statistics were recorded: (1) the number of impressions, that is, the number of users to whom the headline variation was shown (impressionsij) and (2) the number of clicks a headline variation generated (clicksij). The CTR was then computed as \(\mathrm{CTR}_{ij} = \frac{{\mathrm{clicks}_{ij}}}{{\mathrm{impressions}_{ij}}}.\) The experiments were conducted separately (that is, only a single experiment was conducted at the same time for the entire website) so each test can be analysed as independent of all other tests53. Examples of news headlines in the experiments are presented in Table 2. The Upworthy Research Archive contains data aggregated at the headline level and, thus, does not provide individual-level data for users.

The data were subjected to the following filtering. First, all experiments solely consisting of a single headline variation were discarded. Single headline variations exist because Upworthy conducted RCTs on features of their articles other than headlines, predominantly teaser images. In many RCTs where teaser images were varied, headlines were not varied at all (image data were not made available to researchers by the Upworthy Research Archive, so we were unable to incorporate image RCTs into our analyses although we validated our findings as part of the robustness checks). Second, some experiments contained multiple treatment arms with identical headlines, which were merged into one representative treatment by summing their clicks and impressions. These occurred when images ‘and’ headlines were involved in RCTs for the same story. This is relatively rare in the dataset, but for robustness checks regarding image RCTs, see Supplementary Table 9.

The analysis in the current Registered Report Stage 2 is based on the confirmatory sample of the dataset53, which was made available to us only after pre-registration was conditionally accepted. In the previous pre-registration stage, we presented the results of a preliminary analysis based on a smaller, exploratory sample (see Registered Report Stage 1). Both were processed using identical methodology. The pilot sample for our preliminary analysis comprised 4,873 experiments, involving 22,666 different headlines before filtering and 11,109 headlines after filtering, which corresponds to 4.27 headlines on average per experiment. On average, there were approximately 16,670 participants in each RCT. Additional summary statistics are given in Supplementary Table 1.

Design

We present a design table summarizing our methods in Table 1.

Sampling plan

Given our opportunity to secure an extremely large sample where the N was predetermined, we chose to run a simulation before pre-registration to estimate the level of power we would achieve for observing an effect size represented by a regression coefficient of 0.01 (that is, a 1% effect on the odds of clicks from a standard deviation increase in negative words). This effect size is slightly more conservative than estimates of effect sizes from pilot studies (see Stage 1 of the Registered Report) and is derived from theory76. The size of the confirmatory Upworthy data archive is N = 22,743 RCTs, with between 3 and 12 headlines per RCT. This thus corresponds to a total sample of between 68,229 and 227,430 headlines. Because we were not aware of the exact size during pilot testing, we generated datasets through a bootstrapping procedure that sampled N = 22,743 RCTs with replacement from our pilot sample of tests. We simulated 1,000 such datasets and for each dataset we generated ‘clicks’ using the estimated parameters from the pilot data. Finally, each dataset was analysed using the model as described. This procedure was repeated for both models (varying intercepts, and a combination of varying intercepts and varying slopes). We found that under the assumptions of effect size, covariance matrix and data generating process from our pilot sample, we will have greater than 99% power to detect an effect size of 0.01 in the final sample for both models.

Analysis plan

Text mining framework

Text mining was used to extract emotional words from news headlines. To prepare the data for the text mining procedure, we applied standard preprocessing to the headlines. Specifically, the running text was converted into lower-case and tokenized, and special characters (that is, punctuations and hashtags) were removed. We then applied a dictionary-based approach analogous to those of earlier research22,39,40,41.

We performed sentiment analysis on the basis of the Linguistic Inquiry and Word Count (LIWC)77. The LIWC contains word lists classifying words according to both a positive (n = 620 words, for example ‘love’ and ‘pretty’) and negative sentiment (n = 744 words, for example ‘wrong’ and ‘bad’). A list of the most frequent positive and negative words in our dataset is given in Supplementary Table 2.

Formally, sentiment analysis was based on single words (that is, unigrams) due to the short length of the headlines (mean length: 14.965 words). We counted the number of positive words (npositive) and the number of negative words (nnegative) in each headline. A word was considered ‘positive’ if it is in the dictionary of positive words (and vice versa, for ‘negative’ words). We then normalized the frequency by the length of the headline, that is, the total number of words in the headline (ntotal). This yielded the two separate scores

$$\mathrm{Positive}_{ij} = \frac{{n_{\mathrm{positive}}}}{{n_{\mathrm{total}}}}\;{{{\mathrm{and}}}}\;\mathrm{Negative}_{ij} = \frac{{n_{\mathrm{negative}}}}{{n_{\mathrm{total}}}}$$

for headline j in experiment i. As such, the corresponding scores for each headline represent percentages. For example, if a headline has 10 words out of which one is classified as ‘positive’ and none as ‘negative,’ the scores are \(\mathrm{Positive}_{ij} = 10{{{\mathrm{\% }}}}\) and \(\mathrm{Negative}_{ij} = 0{{{\mathrm{\% }}}}\). If a headline has 10 words and contains one ‘positive’ and one ‘negative’ word, the scores are \(\mathrm{Positive}_{ij} = 10{{{\mathrm{\% }}}}\) and \(\mathrm{Negative}_{ij} = 10{{{\mathrm{\% }}}}\). A headline may contain both positive and negative words, so both variables were later included in the model.

Negation words (for example, ‘not,’ ‘no’) can invert the meaning of statements and thus the corresponding sentiment. We performed negation handling as follows. First, the text was scanned for negation terms using a predefined list, and then all positive (or negative) words in the neighbourhood were counted as belonging to the opposite word list, that is, they were counted as negative (or positive) words. In our analysis, the neighbourhood (that is, the so-called negation scope) was set to 3 words after the negation. As a result, a phrase such as ‘not happy’ was coded as negative rather than positive. Here we used the implementation from the sentimentr package (details at https://cran.r-project.org/web/packages/sentimentr/readme/README.html).

Using the above dictionary approach, our objective was to quantify the presence of positive and negative words. As such, we did not attempt to infer the internal state of a perceiver on the basis of the language they write, consume or share73. Specifically, readers’ preference for headlines containing negative words does not imply that users ‘felt’ more negatively while reading said headlines. In contrast, we quantified how the presence of certain words is linked to concrete behaviour. Following this, our pre-registered hypotheses test whether negative words increase consumption rates (Table 1).

We validated the dictionary approach in the context of our corpus on the basis of a pilot study78. Here we used the positive and negative word lists from LIWC77 and performed negation handling as described above. Perceived judgments of positivity and negativity in headlines correlate with the number of negative and/or positive words each headline contains. Specifically, we correlated the mean of the 8 human judges’ scores for a headline with NRC sentiment rating for that headline. We found a moderate but significant positive correlation (rs = 0.303, P < 0.001). These findings validate that our dictionary approach captures significant variation in the perception of emotions in headlines from perceivers. More details are available in Supplementary Tables 21 and 22.

Two additional text statistics were computed: first, we determined the length of the news headline as given by the number of words. Second, we calculated a text complexity score using the Gunning Fog index79. This index estimates the years of formal education necessary for a person to understand a text upon reading it for the first time: 0.4 × (ASL + 100 × nwsy≥3/nw), where ASL is the average sentence length (number of words), nw is the total number of words and nwsy≥3 is the number of words with three syllables or more. A higher value thus indicates greater complexity. Both headline length and the complexity score were used as control variables in the statistical models. Results based on alternative text complexity scores are reported as part of the robustness checks.

The above text mining pipeline was implemented in R v4.0.2 using the packages quanteda (v2.0.1) and sentimentr (v2.7.1) for text mining.

Empirical model

We estimated the effect of emotions on online news consumption using a multilevel binomial regression. Specifically, we expected that negative language in a headline affects the probability of users clicking on a news story to access its content. To test our hypothesis, we specified a series of regression models where the dependent variable is given by the CTR.

We modelled news consumption as follows: \(i = 1, \ldots ,N\) refers to the different experiments in which different headline variations for news stories are compared through an RCT; clicksij denote the number of clicks from headline variation j belonging to news story i. Analogously, impressionsij refer to the corresponding number of impressions. We followed previous approaches80 and modelled the number of clicks to follow a binomial distribution as

$$\mathrm{clicks}_{ij} \sim \mathrm{Binomial}(\mathrm{impressions}_{ij},\theta _{ij})$$

where 0 ≤ θij ≤ 1 is the probability of a user clicking on a headline in a single Bernoulli trial and where θij corresponds to the CTR of headline variation j from news story i.

We estimated the effect of positive and negative words on the CTR θij and captured between-experiment heterogeneity through a multilevel structure. We further controlled for other characteristics across headline variations, namely length, text complexity and the relative age of a headline (based on the age of the platform). The regression model is then given by

$$\begin{array}{l}{{{\mathrm{logit}}}}\left( {\theta _{ij}} \right) = \alpha + \alpha _i + \beta _1\mathrm{Positive}_{ij} + \beta _2\mathrm{Negative}_{ij}\\ + \gamma _1\mathrm{Length}_{ij} + \gamma _2\mathrm{Complexity}_{ij} + \gamma _3\mathrm{PlatformAge}_{ij}\end{array}$$

where α is the global intercept and αi is an experiment-specific intercept (that is, random effect). Both α and αi are assumed to be independent and identically normally distributed with a mean of zero. The latter captures heterogeneity at the experiment level; that is, some news stories might be more interesting than others. In addition, we controlled for the length (Lengthij) and complexity (Complexityij) of the text in the news headline, as well as the relative age of the current experiment with regard to the platform (PlatformAgeij). The latter denotes the number of days of the current experiment since the first experiment on Upworthy.com in 2012 and thus allowed us to control for potential learning effects as well as changes in editorial practices over time. The coefficient β2 is our main variable of interest: it quantifies the effect of negative words on the CTR.

In the above analysis, all variables were z-standardized for better comparability. That is, before estimation, we subtracted the sample mean and divided the difference by the standard deviation. Because of this, the regression coefficients β1 and β2 quantify changes in the dependent variable in standard deviations. This allowed us to compare the relative effect sizes across positive and negative words (as well as emotional words later). Due to the logit link, the odds ratio is 100 × (eβ − 1), which gives the percentage change in the odds of success as a result of a standard deviation change in the independent variable. In our case, a successful event is indicated by the user clicking the headline.

The above regression builds upon a global coefficient for capturing the effect of language on CTR and, as such, the language reception is assumed to be equal across different RCTs. This is consistent with previous works where a similar global coefficient (without varying slopes) was used22,34,38,39. However, there is reason to assume that the receptivity to language might vary across RCTs and thus among news (for example, the receptivity of negative language might be more dominant for political news than for entertainment news, or for certain news topics over others). As such, the variance in the estimated regression coefficients is no longer assumed to be exactly zero across experiments but may vary. To do so, we augmented the above random effects model by an additional varying-slopes specification. Here, a multilevel structure was used that accounts for the different levels due to the experiments \(i = 1, \ldots ,N\). Specifically, the coefficients β1 and β2 capturing the effect of positive and negative words on CTR, respectively, were allowed to vary across experiments. Of note, a similar varying-slopes formalization was only used for the main analysis on the basis of positive and negative language, and not for the subsequent extension to emotional words where it is not practical due to the fact that there would be comparatively fewer treatment arms in comparison with the number of varying slopes.

Here we conducted the analysis on the basis of both models, that is, (1) the random effect model and (2) the random effect model with additional varying slopes. If the estimates from both models are in the same direction, this should underscore the overall robustness of the findings. If estimated coefficients from the random effect model and the random effect, varying-slopes model contradict each other, both results are reported but precedence in interpretation is given to the latter due to its more flexible specification.

All models were estimated using the lme4 package (v1.1.23) in R.

Extension to discrete emotional words

To provide further insights into how emotional language relates to news consumption, we extended our text mining framework and performed additional secondary analyses. We were specifically interested in the effect of different emotional words (anger, fear, joy and sadness) on the CTR.

Here, our analyses were based on the NRC emotion lexicon due to its widespread use in academia and the scarcity of other comparable dictionaries with emotional words for content analysis63,64. The NRC lexicon comprises 181,820 English words that are classified according to the 8 basic emotions of Plutchik’s emotion model67. Basic emotions are regarded as universally recognized across cultures and on this basis, more complex emotions can be derived69,81. The 8 basic emotions computed via the NRC were anger, anticipation, joy, trust, fear, surprise, sadness and disgust.

We calculated scores for basic emotions embedded in news headlines on the basis of the NRC emotion lexicon63. We counted the frequency of words in the text that belong to a specific basic emotion in the NRC lexicon (that is, an 8-dimensional vector). A list of the most frequent emotional words in our dataset is given in Supplementary Table 18. Afterwards, we divided the word counts by the total number of dictionary words in the text, so that the vector is normalized to sum to one across the basic emotions. Following this definition, the embedded emotions in a text might be composed of, for instance, 40% ‘anger’ while the remaining 60% are ‘fear’. We omitted headline variations that do not contain any emotional words from the NRC emotion lexicon (since, otherwise, the denominator was not defined). Due to this extra filtering step, we obtained a final sample of 39,897 headlines. We again accounted for negations using the above approach in that the corresponding emotional words are not attributed to the emotion but skipped during the computation (as there is no defined ‘opposite’ emotion).

As a next step, we validated the NRC emotion lexicon for the context of our study through a user study. Specifically, we correlated the mean of the 8 human judges’ scores for a headline with NRC emotion rating for that headline. We found that overall, both mean user judgments on emotions and those from the NRC emotion lexicon are correlated (rs: 0.114, P < 0.001). Furthermore, mean user judgements for four basic emotions were significantly correlated, namely anger (rs: 0.22, P = 0.005), fear (rs: 0.29, P < 0.001), joy (rs: 0.24, P = 0.002) and sadness (rs: 0.30, P < 0.001). The four other basic emotions from the NRC emotion lexicon showed considerably lower correlation coefficients in the validation study, namely anticipation (rs: −0.07, P = 0.341), disgust (rs: 0.01, P = 0.926), surprise (rs: −0.06, P = 0.414) and trust (rs: 0.12, P = 0.122). Because of this, we did not pre-register hypotheses for them.

The multilevel regression was specified, analogous to the model above but with different explanatory variables, that is,

$$\begin{array}{l}{{{\mathrm{logit}}}}\left( {\theta _{ij}} \right) = \alpha + \alpha _i + \beta _1\mathrm{Anger}_{ij} + \beta _2\mathrm{Fear}_{ij} \\+ \beta _3\mathrm{Joy}_{ij} + \beta _4\mathrm{Sadness}_{ij} + \gamma _1\mathrm{Length}_{ij} \\+ \gamma _2\mathrm{Complexity}_{ij} + \gamma _3\mathrm{PlatformAge}_{ij}\end{array}$$

where α and αi represent the global intercept and the random effects, respectively. Specifically, α is again the global intercept and αi captures the heterogeneity across experiments i = 1,…, N. As above, we included the control variables, that is, length, text complexity and platform age. The coefficients β1,…, β4 quantify the effect of the emotional words (that is, anger, fear, joy and sadness) on the CTR.

Again, all variables were z-standardized for better comparability (that is, we subtracted the sample mean and divided the difference by the standard deviation). As a result, the regression coefficients quantify changes in the dependent variable in standard deviations. This allows us to compare the relative effect sizes across different emotions.

Protocol registration

The Stage 1 protocol for this Registered Report was accepted in principle on 11 March 2022. The protocol, as accepted by the journal, can be found at https://springernature.figshare.com/articles/journal_contribution/Negativity_drives_online_news_consumption_Registered_Report_Stage_1_Protocol_/19657452.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.