Skip to content
Longterm Wiki
Back

C399862d3b9d6b76c8436e924a68c45b Paper

web

Credibility Rating

5/5
Gold(5)

Gold standard. Rigorous peer review, high editorial standards, and strong institutional reputation.

Rating inherited from publication venue: NeurIPS

AlexNet's 2012 breakthrough is historically significant for AI safety because it initiated the scaling era, demonstrating that compute and data could produce dramatic capability jumps — a dynamic that underlies many current AI risk concerns.

Metadata

Importance: 72/100conference paperprimary source

Summary

This landmark 2012 NeurIPS paper by Krizhevsky, Sutskever, and Hinton introduced AlexNet, a deep convolutional neural network that dramatically won the ImageNet Large Scale Visual Recognition Challenge. It demonstrated that deep learning with GPUs could achieve state-of-the-art image classification, sparking the modern deep learning revolution. The architectural innovations including ReLU activations, dropout regularization, and GPU training became foundational to subsequent AI progress.

Key Points

  • AlexNet achieved top-5 error rate of 15.3% on ImageNet 2012, far surpassing the runner-up at 26.2%, demonstrating deep learning's superiority
  • Introduced key techniques: ReLU nonlinearities for faster training, dropout for regularization, and data augmentation to reduce overfitting
  • Used two GPUs in parallel to train a 60-million parameter network, establishing GPU computing as essential for deep learning
  • Triggered the deep learning revolution in computer vision and AI broadly, leading directly to modern foundation models and capable AI systems
  • Demonstrated that scale (more data, more compute, deeper networks) could yield qualitative capability jumps — a lesson central to AI safety discussions

Cited by 1 page

PageTypeQuality
Deep Learning Revolution EraHistorical44.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20264 KB
# ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky University of Toronto [kriz@cs.utoronto.ca](mailto:kriz@cs.utoronto.ca)

Ilya Sutskever University of Toronto [ilya@cs.utoronto.ca](mailto:ilya@cs.utoronto.ca)

Geoffrey E. Hinton University of Toronto [hinton@cs.utoronto.ca](mailto:hinton@cs.utoronto.ca)

# Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of $3 7 . 5 %$ and $1 7 . 0 %$ which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of $1 5 . 3 %$ compared to $2 6 . 2 %$ achieved by the second-best entry.

# 1 Introduction

Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB \[16\], Caltech-101/256 \[8, 9\], and CIFAR-10/100 \[12\]). Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations. For example, the currentbest error rate on the MNIST digit-recognition task $( < 0 . 3 % )$ approaches human performance \[4\]. But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. \[21\]), but it has only recently become possible to collect labeled datasets with millions of images. The new larger datasets include LabelMe \[23\], which consists of hundreds of thousands of fully-segmented images, and ImageNet \[6\], which consists of over 15 million labeled high-resolution images in over 22,000 categories.

To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for 

... (truncated, 4 KB total)
Resource ID: c838af9838929c0a | Stable ID: NWI2YWVjOD