Predicting Controversy

A. Basile - T. Caselli - M. Nissim

CLiCit ~ December 12, 2017

the problem

Some headlines are controversial

…but some are not. Why?

Controversy

noun, con·tro·ver·sy, ˈkän-trə-ˌvər-sē

a discussion marked especially by the expression of opposing views (from Merriam-Webster)

Our definition

We call controversies, situations where — even after lengthy interactions — opinions of the involved participants tend to remain unchanged and become more and more polarized towards extreme values. (see Timmermans et al., 2017)

some examples

In volo sul Piemonte con biplano anni '30 (Repubblica)

[Flying over Piemonte on a biplane from the thirties]

Medico anti vaccini radiato (Corsera)

[Anti-vaccines doctor lost his license]

Piacenza, abbattuto cinghiale #agostino (Ansa)

[Boar #agostino was killed]

What is going on?

  1. In volo sul Piemonte con biplano anni '30
  2. Medico anti vaccini radiato
  3. Piacenza, abbattuto cinghiale #agostino
id sad wow haha angry love like
1 0 0 0 0 0 32
2 22 42 36 220 216 5700
3 78 5 33 34 7 125

let's build a corpus

facebook pages of newspapers

Sources

Collecting data

manual annotation

is expensive

…but

how to do it, exactly?

distant supervision

take user's reactions as (proxy for) annotating controversy

what to do with the counts?

In volo sul Piemonte con biplano anni '30

id sad wow haha angry love like
1 0 0 0 0 0 32

Entropy


H(X)=∑i − P(i)log2P(i)

interpretation

entropy((A*1,B*1))
1

Examples, again

  1. In volo sul Piemonte con biplano anni '30
  2. Medico anti vaccini radiato
  3. Piacenza, abbattuto cinghiale #agostino
id sad wow haha angry love like H
1 0 0 0 0 0 32 0.0
2 22 42 36 220 216 5700 0.5
3 78 5 33 34 7 125 1.9

Bottom line

high entropy = high controversy

Task

Task

Given some text, predict its controversy

Model

  • Support Vector Regressor
  • word- & char-ngrams

Results

baseline std model std
ilgiornale 0.21 0.03 0.22 0.04
ilgiornale+ansa 0.23 0.04 0.19 0.03
ilmanifesto 0.15 0.04 0.11 0.04
ilmanifesto+ansa 0.24 0.04 0.14 0.03
repubblica 0.22 0.07 0.18 0.07
repubblica+ansa 0.24 0.04 0.15 0.04
corsera 0.24 0.06 0.16 0.06
corsera+ansa 0.24 0.03 0.14 0.04

Cross validation ~ MSE (the lower, the better)

Conclusions

the good

built an annotated corpus

trained a system

beat the baseline

the bad

not much better than baseline

sentiment information does not help

the ugly

grouping news by event

anbasile.github.io/predictingcontroversy