Erdil & Besiroglu (2023)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Empirical analysis of Large Language Model trends and prevalence, documenting which LLM architectures, training methods, and families are gaining adoption—relevant for understanding the AI landscape and potential safety implications of widespread LLM deployment.
Paper Details
Metadata
Abstract
Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: https://constellation.sites.stanford.edu/.
Summary
Erdil & Besiroglu (2023) address the challenge of understanding the landscape of Large Language Models by creating Constellation, a comprehensive atlas and web application for exploring nearly 16,000 LLMs uploaded to Hugging Face. Using hierarchical clustering based on model nomenclature, n-grams, and TF-IDF analysis, the authors successfully identify LLM families and organize them into meaningful subgroups. The resulting interactive tool provides multiple visualization options including dendrograms, graphs, word clouds, and scatter plots, enabling researchers and practitioners to navigate and understand trends in the rapidly expanding LLM ecosystem.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Capability-Alignment Race Model | Analysis | 62.0 |
Cached Content Preview
# Computer Science > Digital Libraries
**arXiv:2307.09793** (cs)
\[Submitted on 19 Jul 2023\]
# Title:On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models
Authors: [Sarah Gao](https://arxiv.org/search/cs?searchtype=author&query=Gao,+S), [Andrew Kean Gao](https://arxiv.org/search/cs?searchtype=author&query=Gao,+A+K)
View a PDF of the paper titled On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models, by Sarah Gao and 1 other authors
[View PDF](https://arxiv.org/pdf/2307.09793)
> Abstract:Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: [this https URL](https://constellation.sites.stanford.edu/).
| | |
| --- | --- |
| Comments: | 14 pages, 6 figures, 1 table |
| Subjects: | Digital Libraries (cs.DL); Computation and Language (cs.CL) |
| ACM classes: | I.2.1; H.5.0 |
| Cite as: | [arXiv:2307.09793](https://arxiv.org/abs/2307.09793) \[cs.DL\] |
| | (or [arXiv:2307.09793v1](https://arxiv.org/abs/2307.09793v1) \[cs.DL\] for this version) |
| | [https://doi.org/10.48550/arXiv.2307.09793](https://doi.org/10.48550/arXiv.2307.09793)<br>Focus to learn more<br>arXiv-issued DOI via DataCite |
## Submission history
From: Andrew Gao \[ [view email](https://arxiv.org/show-email/26134f97/2307.09793)\]
**\[v1\]**
Wed, 19 Jul 2023 07:17:43 UTC (3,942 KB)
Full-text links:
## Access Paper:
View a PDF of the paper titled On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models, by Sarah Gao and 1 other authors
- [View PDF](https://arxiv.org/pdf/2307.09793)
[view license](http://creativecommons.org/licenses/by-sa/4.0/ "Rights to this article")
Current browse context:
cs.DL
[< prev](https://arxiv.org/prevnext?id=2307.09793&function=prev&context=cs.DL "previous in cs.DL (accesskey p)") \| [next >](h
... (truncated, 7 KB total)6c2f85e163e0c4a4 | Stable ID: ODAyNTZmYz