Erdil & Besiroglu (2023)

paper

2023·arXiv·arxiv.org/abs/2307.09793

Authors

Sarah Gao·Andrew Kean Gao

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical analysis of Large Language Model trends and prevalence, documenting which LLM architectures, training methods, and families are gaining adoption—relevant for understanding the AI landscape and potential safety implications of widespread LLM deployment.

Paper Details

Citations

0 influential

Year

2023

arXiv:2307.09793 DOI:10.48550/arXiv.2307.09793 Semantic Scholar

Metadata

arxiv preprintanalysis

Abstract

Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: https://constellation.sites.stanford.edu/.

Summary

Erdil & Besiroglu (2023) address the challenge of understanding the landscape of Large Language Models by creating Constellation, a comprehensive atlas and web application for exploring nearly 16,000 LLMs uploaded to Hugging Face. Using hierarchical clustering based on model nomenclature, n-grams, and TF-IDF analysis, the authors successfully identify LLM families and organize them into meaningful subgroups. The resulting interactive tool provides multiple visualization options including dendrograms, graphs, word clouds, and scatter plots, enabling researchers and practitioners to navigate and understand trends in the rapidly expanding LLM ecosystem.

Cited by 1 page

Page	Type	Quality
Capability-Alignment Race Model	Analysis	62.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20265 KB

[2307.09793] On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 

 
 
 
 
 
--> 

 
 
 Computer Science > Digital Libraries

 

 
 arXiv:2307.09793 (cs)
 
 
 
 
 
 [Submitted on 19 Jul 2023] 
 Title: On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models

 Authors: Sarah Gao , Andrew Kean Gao View a PDF of the paper titled On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models, by Sarah Gao and 1 other authors 
 View PDF 

 
 Abstract: Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: this https URL .
 

 
 
 
 Comments: 
 14 pages, 6 figures, 1 table 
 
 
 Subjects: 
 
 Digital Libraries (cs.DL) ; Computation and Language (cs.CL) 
 
 ACM classes: 
 I.2.1; H.5.0 
 
 
 Cite as: 
 arXiv:2307.09793 [cs.DL] 
 
 
 
 (or 
 arXiv:2307.09793v1 [cs.DL] for this version)
 
 
 
 
 https://doi.org/10.48550/arXiv.2307.09793 
 
 
 Focus to learn more 
 
 
 
 arXiv-issued DOI via DataCite 
 
 
 
 
 
 
 
 Submission history

 From: Andrew Gao [ view email ] 
 [v1] 
 Wed, 19 Jul 2023 07:17:43 UTC (3,942 KB)

 
 
 
 
 
 Full-text links: 
 Access Paper:

 
 
View a PDF of the paper titled On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models, by Sarah Gao and 1 other authors View PDF 
 
 
 
 view license 
 
 
 
 Current browse context: cs.DL 

 
 
 < prev 
 
 | 
 next > 
 

 
 new 
 | 
 recent 
 | 2023-07 
 
 Change to browse by:
 
 cs 
 cs.CL 
 
 

 
 
 References & Citations

 
 NASA ADS 
 Google Scholar 

 Semantic Scholar 

 
 
 

 
 export BibTeX citation 
 Loading... 
 

 
 
 
 BibTeX formatted citation

 &times; 
 
 
 loading... 
 
 
 Data provided by: 
 
 
 
 
 Bookmark

 
 
 
 
 
 
 
 
 
 
 
 Bibliographic Tools 
 
 Bibliographic and Citation Tools

 
 
 
 
 


... (truncated, 5 KB total)

Resource ID: 6c2f85e163e0c4a4 | Stable ID: sid_Ta7Dt5xU7K