Distributed AI Safety (Amodei et al.)

paper

2023·arXiv·arxiv.org/abs/2309.04027

Authors

Emmanuel Klu·Sameer Sethi

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Note: The title 'Distributed AI Safety (Amodei et al.)' appears misattributed; this paper is about ML fairness and bias mitigation using the TIDAL lexicon, not directly about AI safety governance or Anthropic's distributed safety work. Verify the correct URL matches the intended resource.

Paper Details

Citations

0 influential

Year

2023

arXiv:2309.04027 DOI:10.48550/arXiv.2309.04027 Semantic Scholar

Metadata

Importance: 35/100arxiv preprintprimary source

Abstract

Machine learning models can perpetuate unintended biases from unfair and imbalanced datasets. Evaluating and debiasing these datasets and models is especially hard in text datasets where sensitive attributes such as race, gender, and sexual orientation may not be available. When these models are deployed into society, they can lead to unfair outcomes for historically underrepresented groups. In this paper, we present a dataset coupled with an approach to improve text fairness in classifiers and language models. We create a new, more comprehensive identity lexicon, TIDAL, which includes 15,123 identity terms and associated sense context across three demographic categories. We leverage TIDAL to develop an identity annotation and augmentation tool that can be used to improve the availability of identity context and the effectiveness of ML fairness techniques. We evaluate our approaches using human contributors, and additionally run experiments focused on dataset and model debiasing. Results show our assistive annotation technique improves the reliability and velocity of human-in-the-loop processes. Our dataset and methods uncover more disparities during evaluation, and also produce more fair models during remediation. These approaches provide a practical path forward for scaling classifier and generative model fairness in real-world settings.

Summary

This paper introduces TIDAL, a 15,123-term identity lexicon spanning three demographic categories, paired with annotation and augmentation tools to improve fairness evaluation in ML models where sensitive attributes are unavailable. The approach enables human-in-the-loop debiasing of classifiers and generative language models, uncovering more disparities and producing fairer outputs in real-world settings.

Key Points

•Introduces TIDAL, a comprehensive identity lexicon with 15,123 terms across race, gender, and sexual orientation categories to address missing sensitive attributes in text datasets.
•Develops an identity annotation and augmentation tool that improves reliability and velocity of human-in-the-loop fairness processes.
•Demonstrates that the approach uncovers more disparities during evaluation and produces fairer models during remediation compared to prior methods.
•Provides practical, scalable methods applicable to both discriminative classifiers and generative language models.
•Addresses a core challenge in AI fairness: evaluating and mitigating bias when demographic metadata is absent from training data.

Cited by 1 page

Page	Type	Quality
AI Governance Coordination Technologies	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202661 KB

[2309.04027] TIDE: Textual Identity Detection for Evaluating and Augmenting Classification and Language Models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 TIDE: Textual Identity Detection for Evaluating and Augmenting Classification and Language Models

 
 
 Emmanuel Klu 
 These authors contributed equally to this work. 
    
 Sameer Sethi 1 1 footnotemark: 1 
 Google Research 
 eklu@google.com , sethis@google.com 
 
 
 

 
 Abstract

 Machine learning models can perpetuate unintended biases from unfair and imbalanced datasets. Evaluating and debiasing these datasets and models is especially hard in text datasets where sensitive attributes such as race, gender, and sexual orientation may not be available. When these models are deployed into society, they can lead to unfair outcomes for historically underrepresented groups. In this paper, we present a dataset coupled with an approach to improve text fairness in classifiers and language models. We create a new, more comprehensive identity lexicon, TIDAL, which includes 15,123 identity terms and associated sense context across three demographic categories. We leverage TIDAL to develop an identity annotation and augmentation tool that can be used to improve the availability of identity context and the effectiveness of ML fairness techniques. We evaluate our approaches using human contributors, and additionally run experiments focused on dataset and model debiasing. Results show our assistive annotation technique improves the reliability and velocity of human-in-the-loop processes. Our dataset and methods uncover more disparities during evaluation, and also produce more fair models during remediation. These approaches provide a practical path forward for scaling classifier and generative model fairness in real-world settings. The code and dataset are available at https://github.com/google-research/google-research/tree/master/tide_nlp .

 
 
 
 1 Introduction

 
 The growing adoption of machine learning across a variety of applications have reignited concerns about unfair and unintended bias in models. Bias can be introduced throughout the development workflow, for example during problem framing, data sampling and preparation, and even through training algorithm choices (Shah et al., 2020 ; Saleiro et al., 2018 ) . When models contain biases, they can play an active role in perpetuating societal inequities and unfair outcomes for underrepresented groups (Sweeney, 2013 ; Abid et al., 2021 ) .

 
 
 Algorithmic fairness is a rapidly growing field of research with a wide range of definitions, techniques and toolkits available. Fairness is anchored in understanding and mitigating model performance disparities across sensitive and protected attributes. Popular toolkits such as AI Fairness 360 (Bellamy et al., 2018 ) , Fairlearn (Bird et al., 2020 ) , and the Responsible AI toolkit in TensorFlow (Abadi et al., 2015 ) , all assume these attributes are readily available in datasets. In many real-world datasets, 

... (truncated, 61 KB total)

Resource ID: ad0ef791cdf59bfb | Stable ID: sid_Ax9GrxCLXq