I agree with “agreeing to disagree”

David Sasu
6 min readJan 20, 2022


Offensive language detection on social media is one of the prime examples of the application of Natural Language Processing in real-world contexts. To perform this task, machine learning models are trained with high volumes of data from selected social media platforms. This is done to enable these computer models to identify the patterns between social media posts that may be identified as ‘offensive’ or ‘non-offensive’.

Even though these machine learning models are built with very clever and sophisticated algorithms, this task is very difficult to accomplish perfectly because of the novelty and progressive characteristics of language. As a result of this difficulty, it is important that the machine learning models developed for this task are provided with the very best training data available. However, the problem is that most of the available training data that is used to train these models do not accurately depict the often ambiguous nature of language. This often leads to the development of machine learning models that are not very robust. Such models often perform very poorly when they are tested with data that does not directly indicate an ‘offensive’ or ‘non-offensive’ tone. In an accepted EMNLP 2021 paper entitled “Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement ”, the authors (Leonardelli et al., 2021) pointed out this problem, proposed ways to deal with it and provided datasets that are more representative of the ambiguities of natural language. In this blog post, we are going to explore their contributions, suggestions and findings.

The benchmark datasets provided by Leonardelli et al. comprise of posts from Twitter and each post can be classified under one of 3 main topics: Covid-19, US Presidential elections and Black Lives Matter movement. For each post within the dataset, there are 5 crowd-sourced judgements and each judgement expresses an opinion on whether the post is offensive or non-offensive. If all 5 judgements proclaim the post to belong to a particular label A (offensive or non-offensive), then the following expression A++ is associated with that post.On the other hand, if 4 out of 5 judgements agree on the label of the post, the expression A+ is associated with that post. For posts that have an agreement of 3 judgements or less, the expression A0 is associated with them.

To ensure that each benchmark dataset was well balanced in terms of its distribution between ‘offensive’ and ‘non-offensive’ posts and that each post within a particular benchmark dataset and its corresponding group of judgements were of the highest quality, a strict multi-stage annotation process was used. The first stage of this annotation process involved automatically generating 5 different judgements for each post using 5 different BERT-based classifiers. The second stage of this process involved selecting an equal number of posts from each class of agreement of the ensemble (A++, A+, A0) to be manually annotated by native speakers from the United States, who were familiar with the topics that were covered within the posts. The implementation of the first and second stages were necessary to ensure that the data that is finally annotated by humans would be properly distributed among the different labels of ‘offensive’ and ‘non-offensive’. In the third stage, which is the manual annotation process, a group of posts were first selected from the different domains and annotated by expert linguists. The posts within this selected group that had perfect agreement regarding their assigned label were used as a gold standard and the remainder of the posts outside of this selected group were given to regular annotators to be annotated. To ensure that the produced annotations were of high quality, all the annotations provided by annotators who failed to assign an expected label to 70% of the posts that were hand-picked as the gold standard, were rejected. Posts that did not have 5 accepted annotations were also rejected and were not included in the final benchmark datasets.

After the construction of the benchmark datasets, Leonardelli et al. performed different experiments to ascertain the effect of label agreement or disagreement on classifier behaviour. The classifiers that were used in these experiments were both BERT-based classifiers. The difference between both classifiers was that, one classifier was directly fine-tuned on domain data while the other classifier went through an intermediate fine-tuning step using generic data.

In the first set of experiments, the aim was to discover the impact of label agreement or disagreement in training data on classifier performance. For these experiments, the classifiers were trained and tested on different combinations of data from the different domains, with varying levels of label agreement. The results from these experiments indicate that, firstly, it is not necessary to use huge amounts of domain data to fine-tune the classifiers since small amounts of data may produce an equally good performance and secondly, all the combinations of data that included data for which there was no agreement in the ascribed label yielded poorer performances than when this kind of data was excluded.

In the second set of experiments, the aim was to discover the impact of label agreement or disagreement in test data on classifier performance. In these experiments, the classifiers are trained on different combinations of data from different domains, with varying levels of label agreement. However, these classifiers were tested on data from the different domains with a consistent level of label agreement. The results from these experiments demonstrated that the performance of the classifiers decreased as the level of label agreement within the test set decreased. In other words, the more ambiguous the test data became, the worse the classifiers performed. In other subsequent experiments, it was however seen that whenever the classifiers were trained on non-ambiguous and mildly ambiguous data, their performance increased when they were tested on ambiguous data.

In the third set of experiments, the aim was to test the effect of cross-domain classification according to label agreement levels, so as to minimise the impact of possible in-domain overfitting. In these experiments, the data from 2 out of the 3 domains were chosen for training and the data from the remaining domain was used for testing. For instance, data from the domains of Covid-19 and U.S Presidential Campaign with the label agreement level of A++ were chosen as training data and data from the domain of Black Lives Matter movement with the label agreement level of also A++ was chosen as testing data. The results from these experiments showcased that the classifiers yielded a good performance when the training data had a high level of label agreement, even in an out-of-domain scenario. The results also show that adding data with a high level of label disagreement (A0) to the training data has a negative effect on classifier performance.

In the last set of experiments, the aim was to discover whether data with low label agreement provide some useful information for training the classifiers or if the effect of such data is no more than that of random label annotation. These last experiments were quite similar to the first set of experiments. The only caveat was that data with random label annotations was generated and used in place of the data with low levels of label agreement. The results from these experiments demonstrated that the classifier performed worse when using data with random label annotations. This indicated that data with low levels of agreement provide some useful information even though they do not lead to high classifier performance.

From all the different experiments performed, Leonardelli et al. conclude that the inclusion of ambiguous data or data with mild or low levels of label agreement in the training and testing datasets used to develop offensive language detection systems is imperative. They therefore suggest that benchmark datasets used to evaluate the performance of these systems should include more examples of these kinds of data. This makes sense since we are more likely to overestimate the performance of a classifier if we have not observed how well it performs when it is presented with data examples that are ambiguous and hard to classify. Hence, in the development of offensive language detection systems, it is better to agree that disagreement is good.

Link to paper: https://arxiv.org/abs/2109.13563