The risk of bias in hate speech detection

- 5 mins

The Risk of Bias in Hate Speech Detection

Based on Sap et al., 2019

As social media platforms continue to grow, so does the volume of toxic and hateful content online. To manage this at scale, platforms increasingly rely on machine learning models to automatically detect hate speech and abusive language.

But what if these systems are not neutral?

What if the very data used to train them embeds racial bias — and the models simply learn and amplify it?

This post discusses the paper:

Sap et al., 2019 — “The Risk of Racial Bias in Hate Speech Detection”

and explores how racial bias can enter hate speech detection systems — and what we might do about it.


The Core Problem

Hate speech detection typically follows this pipeline:

  1. Collect data (tweets, posts, comments)
  2. Ask human annotators to label the data (hate / offensive / none)
  3. Train a machine learning model
  4. Deploy the model for automated moderation

The assumption is that this process produces an objective system.

However:

If the annotations are biased, the model trained on them will also be biased.


A Motivating Example: PerspectiveAPI

Tools like Google’s PerspectiveAPI assign toxicity scores to text. But prior research has shown that phrases in African American English (AAE) are sometimes rated as more toxic than their white-aligned equivalents — even when they are not intended to be offensive.

This raises a critical concern:

If annotators are unfamiliar with a dialect, they may label it as offensive simply because it sounds unfamiliar. Bias can enter before the model is even trained.


The Paper’s Hypotheses

Sap et al. investigate three main questions:

  1. Do existing hate speech datasets contain racial bias?
  2. Does this bias propagate into trained models?
  3. Does providing annotators with dialect/race information change their judgments?

Methodology Overview

1. Using Dialect as a Proxy for Race

Since Twitter does not typically include self-reported race, the authors use a model (Blodgett et al., 2016) to estimate whether a tweet is written in:

The model assigns probabilities:

This allows researchers to analyze patterns across dialect groups.


2. Testing Bias in Existing Datasets

The authors examine two widely used hate speech datasets:

Key Finding:

Tweets more likely to be written in AAE were significantly more likely to be labeled offensive or abusive. Correlation between AAE probability and toxicity labels

This suggests that bias may already exist in the labeled data.


3. Bias Propagation Through Models

Next, the authors train classifiers on these datasets and evaluate:

Result:

Models trained on these datasets showed higher false positive rates for AAE tweets compared to white-aligned tweets.

False positive disparities across dialect groups

In other words:

Even when AAE tweets were not offensive, they were more likely to be flagged as such.

This demonstrates how dataset bias propagates into deployed systems.


4. Does Context Change Annotations?

The authors ran a controlled experiment on Amazon Mechanical Turk.

Workers were asked whether tweets were:

Three conditions were tested:

  1. Control (no dialect information)
  2. Dialect priming
  3. Race priming

When annotators were encouraged to consider dialect and likely racial background:

Effect of dialect and race priming on offensiveness judgments

This suggests that annotator awareness matters.


Interpretation

The paper provides evidence that:

This highlights a central issue in machine learning:

Models do not create bias — they learn and scale existing human bias.


Broader Implications

This issue is not limited to hate speech detection.

Automated systems may scan social media histories as part of background checks for new employees. Poorly designed models can misinterpret language and produce damaging reports.

Bad models → Bad conclusions → Real consequences.


Representation Matters

If training data reflects historical inequality, models trained on it will reproduce that inequality.

For example:

If women historically received lower credit limits, the model may learn to associate gender with lower creditworthiness.

Simply removing “gender” or “race” from the dataset is not enough. Proxy features (ZIP code, purchasing behavior, social networks) may still encode the same information.


Mitigating Bias in Text Classification

Possible approaches:

But each solution comes with tradeoffs — especially around privacy and user profiling.


Final Thoughts

Unlike humans, algorithms often lack clear accountability structures. Machine learning systems inherit the values embedded in their training data.

When we train on biased data:

Hate speech detection is just one domain where this is visible.

The deeper challenge is this:

How do we build systems that understand context, respect linguistic diversity, and avoid penalizing marginalized communities?

There are no easy answers — but acknowledging the problem is the first step.


References

  1. Sap et al., The Risk of Racial Bias in Hate Speech Detection, ACL 2019
  2. Blodgett et al., 2016
  3. Davidson et al., 2017
  4. Founta et al., 2018
  5. Additional public discussions on ImageNet bias and algorithmic credit risk