Reinforcement Learning from Human Feedback

reference

Wikipedia·en.wikipedia.org/wiki/Reinforcement_learning_from_human_f...

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Wikipedia

A solid introductory reference for understanding RLHF, the dominant alignment technique used in modern LLMs; useful for readers new to the field or seeking a broad overview before diving into primary research papers.

Metadata

Importance: 62/100wiki pagereference

Summary

Wikipedia's overview of Reinforcement Learning from Human Feedback (RLHF), a technique for training AI systems using human preference data as a reward signal. It covers the foundational concepts, history, and applications of RLHF, including its central role in aligning large language models like ChatGPT to human intentions. The article explains the process of collecting human feedback, training reward models, and fine-tuning AI systems via reinforcement learning.

Key Points

•RLHF uses human preference judgments to train a reward model, which then guides RL-based fine-tuning of AI systems toward desired behaviors.
•It has become a standard technique for aligning large language models (LLMs) such as ChatGPT, Claude, and others with human values and intentions.
•The process involves three main steps: supervised fine-tuning, reward model training from human comparisons, and RL optimization (often using PPO).
•RLHF can reduce harmful outputs and improve helpfulness, but is subject to reward hacking, feedback biases, and scalability challenges.
•Variants and alternatives such as RLAIF, DPO, and Constitutional AI have been developed to address limitations of standard RLHF.

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0

Cached Content Preview

HTTP 200Fetched May 4, 202666 KB

From Wikipedia, the free encyclopedia 
 
 
 
 
 
 Machine learning technique 
 

 High-level overview of reinforcement learning from human feedback 
 Part of a series on Machine learning 
and data mining 
 Paradigms 
 Supervised learning 

 Unsupervised learning 

 Semi-supervised learning 

 Self-supervised learning 

 Reinforcement learning 

 Meta-learning 

 Online learning 

 Batch learning 

 Curriculum learning 

 Rule-based learning 

 Neuro-symbolic AI 

 Neuromorphic engineering 

 Quantum machine learning 
 
 
 Problems 
 Classification 

 Generative modeling 

 Regression 

 Clustering 

 Dimensionality reduction 

 Density estimation 

 Anomaly detection 

 Data cleaning 

 AutoML 

 Association rules 

 Semantic analysis 

 Structured prediction 

 Feature engineering 

 Feature learning 

 Learning to rank 

 Grammar induction 

 Ontology learning 

 Multimodal learning 
 
 
 Supervised learning 
 ( classification &#160;&#8226;&#32; regression ) 
 Apprenticeship learning 

 Decision trees 

 Ensembles 
 Bagging 

 Boosting 

 Random forest 
 

 k -NN 

 Linear regression 

 Naive Bayes 

 Artificial neural networks 

 Logistic regression 

 Perceptron 

 Relevance vector machine (RVM) 

 Support vector machine (SVM) 
 
 
 Clustering 
 BIRCH 

 CURE 

 Hierarchical 

 k -means 

 Fuzzy 

 Expectation–maximization (EM) 

 
 DBSCAN 

 OPTICS 

 Mean shift 
 
 
 Dimensionality reduction 
 Factor analysis 

 CCA 

 ICA 

 LDA 

 NMF 

 PCA 

 PGD 

 t-SNE 

 SDL 
 
 
 Structured prediction 
 Graphical models 
 Bayes net 

 Conditional random field 

 Hidden Markov 
 
 
 
 Anomaly detection 
 RANSAC 

 k -NN 

 Local outlier factor 

 Isolation forest 
 
 
 Neural networks 
 Autoencoder 

 Deep learning 

 Feedforward neural network 

 Recurrent neural network 
 LSTM 

 GRU 

 ESN 

 reservoir computing 
 

 Boltzmann machine 
 Restricted 
 

 GAN 

 Diffusion model 

 SOM 

 Convolutional neural network 
 U-Net 

 LeNet 

 AlexNet 

 DeepDream 
 

 Neural field 
 Neural radiance field 

 Physics-informed neural networks 
 

 Transformer 
 Vision 
 

 Mamba 

 Spiking neural network 

 Memtransistor 

 Electrochemical RAM (ECRAM)
 
 
 Reinforcement learning 
 Q-learning 

 Policy gradient 

 SARSA 

 Temporal difference (TD) 

 Multi-agent 
 Self-play 
 
 
 
 Learning with humans 
 Active learning 

 Crowdsourcing 

 Human-in-the-loop 

 Mechanistic interpretability 

 RLHF 
 
 
 Model diagnostics 
 Coefficient of determination 

 Confusion matrix 

 Learning curve 

 ROC curve 
 
 
 Mathematical foundations 
 Kernel machines 

 Bias–variance tradeoff 

 Computational learning theory 

 Empirical risk minimization 

 Occam learning 

 PAC learning 

 Statistical learning 

 VC theory 

 Topological deep learning 
 
 
 Journals and conferences 
 AAAI 

 ECML PKDD 

 NeurIPS 

 ICML 

 ICLR 

 IJCAI 

 ML 

 JMLR 
 
 
 Related articles 
 Glossary of artificial intelligence 

 List of datasets for machine-learning research 
 List of dataset

... (truncated, 66 KB total)

Resource ID: a665c398b96149c1 | Stable ID: sid_hilrlQOqzE