AlexNet breakthrough

web

NeurIPS(peer-reviewed)·proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8...

Credibility Rating

5/5

Gold(5)

Gold standard. Rigorous peer review, high editorial standards, and strong institutional reputation.

Rating inherited from publication venue: NeurIPS

This foundational capabilities paper is relevant to AI safety discussions around the rapid scaling of deep learning, the role of compute in AI progress, and understanding the historical trajectory of capability jumps that inform risk assessments.

Metadata

Importance: 82/100conference paperprimary source

Summary

The landmark 2012 paper introducing AlexNet, a deep convolutional neural network that dramatically outperformed prior methods on ImageNet classification, achieving a top-5 error rate of 15.3% versus 26.2% for the runner-up. The work demonstrated the power of deep learning at scale, combining large datasets, GPU compute, ReLU activations, and dropout regularization. It is widely credited with igniting the modern deep learning era.

Key Points

•AlexNet achieved top-5 error of 15.3% on ILSVRC-2012, nearly halving the error rate of the second-best entry (26.2%), marking a decisive breakthrough for deep learning.
•The architecture used 60 million parameters across 5 convolutional layers and 3 fully-connected layers, trained on two GPUs in parallel.
•Key innovations included ReLU non-linearities for faster training, dropout regularization to reduce overfitting, and data augmentation techniques.
•The paper demonstrated that scale—large datasets (ImageNet's 1.2M images) combined with GPU compute—was critical to unlocking deep learning's potential.
•AlexNet's success catalyzed massive investment in deep learning capabilities research, directly relevant to understanding AI progress trajectories.

Cited by 1 page

Page	Type	Quality
AI-Driven Concentration of Power	Risk	65.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20264 KB

# ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky University of Toronto [kriz@cs.utoronto.ca](mailto:kriz@cs.utoronto.ca)

Ilya Sutskever University of Toronto [ilya@cs.utoronto.ca](mailto:ilya@cs.utoronto.ca)

Geoffrey E. Hinton University of Toronto [hinton@cs.utoronto.ca](mailto:hinton@cs.utoronto.ca)

# Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of $3 7 . 5 %$ and $1 7 . 0 %$ which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of $1 5 . 3 %$ compared to $2 6 . 2 %$ achieved by the second-best entry.

# 1 Introduction

Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB \[16\], Caltech-101/256 \[8, 9\], and CIFAR-10/100 \[12\]). Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations. For example, the currentbest error rate on the MNIST digit-recognition task $( < 0 . 3 % )$ approaches human performance \[4\]. But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. \[21\]), but it has only recently become possible to collect labeled datasets with millions of images. The new larger datasets include LabelMe \[23\], which consists of hundreds of thousands of fully-segmented images, and ImageNet \[6\], which consists of over 15 million labeled high-resolution images in over 22,000 categories.

To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for 

... (truncated, 4 KB total)

Resource ID: 5f0d3a6682dddbb6 | Stable ID: sid_1jK7DpSc2d