Competition and AI Safety

paper

2022·arXiv·arxiv.org/abs/2209.02135

Authors

Stefano Favaro·Matteo Sesia

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This arxiv preprint addresses statistical estimation of coverage probabilities in compressed data settings, with potential applications to understanding model capabilities and limitations in safety-critical AI systems.

Paper Details

Citations

0 influential

Year

2025

arXiv:2209.02135 DOI:10.2139/ssrn.5571958 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.

Cited by 1 page

Page	Type	Quality
AI Risk Interaction Network Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2209.02135] Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

 
 
 Stefano Favaro
 stefano.favaro@unito.it
 Department of Economics and Statistics, University of Torino and Collegio Carlo Alberto, Italy 
 
 
 Matteo Sesia
 sesia@marshall.usc.edu
 Department of Data Sciences and Operations, University of Southern California, Marshall School of Business, Los Angeles, California, USA
 
 

 
 Abstract

 The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses. 

 
 
 Keywords: Bayesian nonparametrics; coverage probability; Dirichlet process prior; distinct counts; missing mass; Pitman-Yor process prior; random hashing; sketch. 

 
 
 
 1 Introduction

 
 
 1.1 Estimation of coverage probabilities

 
 The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem, dating back to the seminal work of Alan M. Turing and Irving J. Good in 1940s ( Good , 1953 ) . To understand this task, consider a generic population of individuals with values in a (possibly infinite) universe 𝕊 𝕊 \mathbb{S} of symbols or species labels. In its most common formulation, the problem assumes n ≥ 1 𝑛 1 n\geq 1 observable data points modeled as random samples from an unknown distribution p = ∑ j ≥ 1 p j  δ s j 𝑝 subscript 𝑗 1 subscript 𝑝 𝑗 subscript 𝛿 subscript 𝑠 𝑗 p=\sum_{j\geq 1}p_{j}\delta_{s_{j}} , where p j subscript 𝑝 𝑗 p_{j} is the probability of symbol s j ∈ 𝕊 subscript 𝑠 𝑗 𝕊 s_{j}\in\mathbb{S} . Then, denoting by ( N j , n ) j ≥ 1 subscript subscript 𝑁 𝑗 𝑛 𝑗 1 (N_

... (truncated, 98 KB total)

Resource ID: dbac492bf9ab7956 | Stable ID: sid_cEUX87U0aE