Skip to content
Longterm Wiki
Back

Competition and AI Safety

paper

Authors

Stefano Favaro·Matteo Sesia

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This arxiv preprint addresses statistical estimation of coverage probabilities in compressed data settings, with potential applications to understanding model capabilities and limitations in safety-critical AI systems.

Paper Details

Citations
0
0 influential
Year
2025

Metadata

arxiv preprintprimary source

Abstract

The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.

Cited by 1 page

PageTypeQuality
AI Risk Interaction Network ModelAnalysis64.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

Stefano Favaro
stefano.favaro@unito.it
Department of Economics and Statistics, University of Torino and Collegio Carlo Alberto, ItalyMatteo Sesia
sesia@marshall.usc.edu
Department of Data Sciences and Operations, University of Southern California, Marshall School of Business, Los Angeles, California, USA

###### Abstract

The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.

Keywords: Bayesian nonparametrics; coverage probability; Dirichlet process prior; distinct counts; missing mass; Pitman-Yor process prior; random hashing; sketch.

## 1 Introduction

### 1.1 Estimation of coverage probabilities

The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem, dating back to the seminal work of Alan M. Turing and Irving J. Good in 1940s (Good, [1953](https://ar5iv.labs.arxiv.org/html/2209.02135#bib.bib42 "")). To understand this task, consider a generic population of individuals with values in a (possibly infinite) universe 𝕊𝕊\\mathbb{S} of symbols or species labels. In its most common formulation, the problem assumes n≥1𝑛1n\\geq 1 observable data points modeled as random samples from an unknown distribution p=∑j≥1pj​δsj𝑝subscript𝑗1subscript𝑝𝑗subscript𝛿subscript𝑠𝑗p=\\sum\_{j\\geq 1}p\_{j}\\delta\_{s\_{j}}, where pjsubscript𝑝𝑗p\_{j} is the probability of symbol sj∈𝕊subscript𝑠𝑗𝕊s\_{j}\\in\\mathbb{S}. Then, denoting by (Nj,n)j≥1subscriptsubscript𝑁𝑗𝑛𝑗1(N\_{j,n})\_{j\\geq 1} the empirical frequencies of distinct symbols, the goal is to estimate the coverage probability of order r≥0𝑟0r\\geq 0:

|     |     |     |     |
| --- | 

... (truncated, 98 KB total)
Resource ID: dbac492bf9ab7956 | Stable ID: ZGE2ZDIyZT