Competition and AI Safety
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This arxiv preprint addresses statistical estimation of coverage probabilities in compressed data settings, with potential applications to understanding model capabilities and limitations in safety-critical AI systems.
Paper Details
Metadata
Abstract
The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Risk Interaction Network Model | Analysis | 64.0 |
Cached Content Preview
# Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data
Stefano Favaro
stefano.favaro@unito.it
Department of Economics and Statistics, University of Torino and Collegio Carlo Alberto, ItalyMatteo Sesia
sesia@marshall.usc.edu
Department of Data Sciences and Operations, University of Southern California, Marshall School of Business, Los Angeles, California, USA
###### Abstract
The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.
Keywords: Bayesian nonparametrics; coverage probability; Dirichlet process prior; distinct counts; missing mass; Pitman-Yor process prior; random hashing; sketch.
## 1 Introduction
### 1.1 Estimation of coverage probabilities
The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem, dating back to the seminal work of Alan M. Turing and Irving J. Good in 1940s (Good, [1953](https://ar5iv.labs.arxiv.org/html/2209.02135#bib.bib42 "")). To understand this task, consider a generic population of individuals with values in a (possibly infinite) universe 𝕊𝕊\\mathbb{S} of symbols or species labels. In its most common formulation, the problem assumes n≥1𝑛1n\\geq 1 observable data points modeled as random samples from an unknown distribution p=∑j≥1pjδsj𝑝subscript𝑗1subscript𝑝𝑗subscript𝛿subscript𝑠𝑗p=\\sum\_{j\\geq 1}p\_{j}\\delta\_{s\_{j}}, where pjsubscript𝑝𝑗p\_{j} is the probability of symbol sj∈𝕊subscript𝑠𝑗𝕊s\_{j}\\in\\mathbb{S}. Then, denoting by (Nj,n)j≥1subscriptsubscript𝑁𝑗𝑛𝑗1(N\_{j,n})\_{j\\geq 1} the empirical frequencies of distinct symbols, the goal is to estimate the coverage probability of order r≥0𝑟0r\\geq 0:
| | | | |
| --- |
... (truncated, 98 KB total)dbac492bf9ab7956 | Stable ID: ZGE2ZDIyZT