Skip to main content

AI History

History of Artificial Intelligence

Five recurring questions drive AI history: representation, learning, search, uncertainty, and tractability. From Turing 1936 through the transformer era, every advance answered at least one of these questions differently.

CoreTier 2StableReference~70 min

Why This Matters

Most AI history is told as a sequence of breakthroughs and winters. That framing obscures the actual structure. The same five questions recur in every era: How should knowledge be represented? How should a system learn from data? How should it search through possibilities? How should it handle uncertainty? How do you make any of this tractable?

A useful way to read AI history is that many major advances changed the answer to one or more of these questions. Understanding the history this way prevents you from thinking that the current approach (large neural networks trained on internet data) is the final answer. It also prevents the opposite error: dismissing scale because previous hype cycles ended badly.

Jump to: World models and attribution

Watch Out

The five-questions frame is a pedagogical device, not a historiographical claim

The five recurring questions (representation, learning, search, uncertainty, tractability) are an organizing device for readers, not an established framework in the history-of-science literature. Professional historians of AI (Crevier, Nilsson, Buchanan) organize the field differently, typically around funding cycles, institutional networks, and specific research programs. The five-questions frame is useful for spotting conceptual continuities across eras. It is not a claim that these five questions exhausted what researchers cared about, nor that each era had a single dominant answer to each.

The Five Recurring Questions

  1. Representation: What data structures encode knowledge? Logic formulas, feature vectors, distributed representations, attention patterns.
  2. Learning: How does the system improve from experience? Hand-coded rules, perceptron updates, backpropagation, self-supervised pretraining.
  3. Search and optimization: How does the system find good solutions in large spaces? Tree search, gradient descent, evolutionary methods, beam search.
  4. Uncertainty: How does the system reason about what it does not know? Boolean logic, fuzzy sets, probability, Bayesian networks, calibration.
  5. Tractability: How do you make the above work at useful scale? Approximation, parallelism, sparsity, hardware co-design.

Every era below can be understood as a shift in the dominant answer to one or more of these questions.

Era-by-Question Matrix

This matrix is the page's main analytical tool. Read each era as a different choice about representation, learning, search, uncertainty, and tractability.

EraRepresentationLearningSearch / OptimizationUncertaintyTractability
Computability and neural logicBoolean logic, ideal neuronsLocal synaptic changeComputation as procedureMostly implicitUniversal machine abstraction
Founding coalitionLogic, search states, early networksPerceptrons, hand designHeuristic searchMinimal formal treatmentToy domains and small machines
Symbolic AIRules, symbols, problem statesHand-coded knowledgeTheorem proving, planningLogic-first reasoningDomain restriction
Perceptron controversyLinear thresholdsPerceptron updatesMargin-driven correctionMostly absentBounded-order limitations
Expert systemsProduction rulesExpert elicitationForward/backward chainingCertainty factorsNarrow deployment domains
Statistical reformGraphical models, kernels, featuresProbabilistic estimationEM, convex optimization, inferenceProbability and Bayesian networksApproximation and factorization
Connectionist returnDistributed vectors, CNN/RNN statesBackpropagation, temporal difference learningGradients, RL, model-based planningSoftmax and probabilistic modelsGPUs, data, and differentiable programs
Scale eraTransformers, latent states, world modelsSelf-supervision, post-training, representation learningGradient descent, decoding, planningCalibration, sampling, uncertainty estimatesScale, parallelism, sparsity, hardware co-design

Era 1: Computability and the Idea of Machine Thought (1936-1950)

Turing's 1936 paper on computable numbers established that a single machine can compute anything that is computable, given enough time and tape. This is the mathematical precondition for AI: if intelligence involves computation, then a universal machine can in principle replicate it.

McCulloch and Pitts (1943) showed that networks of idealized binary neurons can compute any Boolean function. This was a representation result: neural activity could be described in the language of propositional logic.

Hebb (1949) proposed that synaptic connections strengthen when pre-synaptic and post-synaptic neurons fire together. His actual formulation: "When an axon of cell A is near enough to excite cell B... some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased" (The Organization of Behavior, 1949, p. 62). This was a learning result: unsupervised and local, requiring no teacher signal and no global error computation. The catchier paraphrase "neurons that fire together wire together" is due to Löwel and Singer (1992, Science 255:209-212), not Hebb.

Era 2: The Founding Coalition (1950-1960)

The Dartmouth workshop (1956) is conventionally called the founding of AI. The key participants (McCarthy, Minsky, Rochester, Shannon) had diverse views. McCarthy favored logic and search. Minsky was interested in neural networks at this point (his Princeton thesis was on neural net learning). Shannon contributed information theory. Rochester had built a hardware neural network simulator at IBM.

The founding coalition was not purely symbolic. The symbolic-vs-connectionist divide came later.

Rosenblatt's Perceptron (1957) was a hardware learning machine that could classify patterns by adjusting weights. The perceptron convergence theorem states: if the training data is linearly separable, the algorithm finds a separating hyperplane in finite steps. Rosenblatt (1962) gave a proof of this result, but the first tight finite-step bound with explicit constants was provided by Novikoff (1962, "On Convergence Proofs on Perceptrons," Symposium on the Mathematical Theory of Automata, 12:615-622). The canonical bound T(R/γ)2T \leq (R/\gamma)^2 relating the number of updates to the margin γ\gamma and the maximum input norm RR is Novikoff's.

Era 3: Symbolic Problem Solving (1956-1974)

Newell and Simon's Logic Theorist (1956) and General Problem Solver (1957) solved problems by searching through a space of symbolic states. Their Physical Symbol System Hypothesis claimed that symbol manipulation is both necessary and sufficient for general intelligence. This answered the representation question with: symbols and rules.

McCarthy created Lisp (1958), a programming language where code and data share the same structure. Lisp became the lingua franca of AI research for three decades.

Samuel's checkers program (1959) used self-play and temporal difference methods. This was an early instance of reinforcement learning, though the term came later.

Dreyfus (1965, "Alchemy and Artificial Intelligence," RAND Memorandum P-3244; 1972, What Computers Can't Do) provided the most sustained philosophical critique of symbolic AI. Drawing on phenomenology (Heidegger, Merleau-Ponty), he argued that human expertise is embodied and contextual in ways that rule- based symbol manipulation cannot capture. His critique was largely dismissed at the time but anticipated concerns about brittle expert systems. Brooks (1991) echoed related points from an engineering angle; the later embodied-AI work made several of Dreyfus's concerns look much less dismissible than they seemed to many AI researchers at the time.

Era 4: The Perceptron Controversy (1969)

Theorem

Perceptron Limitation (Minsky-Papert 1969)

Statement

(a) XOR case: A single-layer perceptron (linear threshold unit with no hidden layer) cannot compute the XOR function, because XOR is not linearly separable in 2D.

(b) General order-limited result: For a perceptron of order kk (each hidden unit receives input from at most kk input nodes), the Group Invariance Theorem states that any predicate computable by a group-invariant perceptron of order kk must be a function of global statistics computable from kk-wise interactions. The Parity Theorem shows that parity on nn bits requires order nn. The Connectedness Theorem shows that deciding whether a figure is connected requires order equal to the number of pixels, regardless of the number of hidden units.

Intuition

The XOR limitation is elementary: a hyperplane cannot separate the four XOR truth-table points. The deeper results concern predicates that require examining global structure. A perceptron whose units each see only a limited patch of the input cannot, for example, determine if a figure drawn on a retina is connected, no matter how many such limited units you add.

Proof Sketch

For (a): the four XOR points (0,0),(1,1),(0,1),(1,0)(0,0), (1,1), (0,1), (1,0) with labels 0,0,1,10,0,1,1 are not linearly separable.

For (b): the Group Invariance Theorem (Minsky-Papert 1969, Chapters 12-13) uses a symmetry argument. If a predicate ψ\psi is invariant under a group GG of permutations of the input, then any perceptron computing ψ\psi must have total support at least equal to the orbit size. Connectivity and parity are invariant under large groups, forcing order equal to the input size.

Why It Matters

Minsky and Papert did not prove that neural networks are useless. They proved limitations of order-limited perceptrons. They noted explicitly that multi- layer networks might overcome these limits, but nobody knew how to train them at the time. The book is frequently mischaracterized as "proving neural networks cannot work."

Failure Mode

The general results (Parity, Connectedness) apply to bounded-order perceptrons regardless of the number of hidden units. They do not apply to networks of unbounded order. A two-layer threshold network with two hidden units can compute XOR; a sufficiently large multi-layer network can compute any Boolean function. The missing piece in 1969 was not expressivity in principle, but a practical and widely adopted training method for hidden layers.

The effect of Minsky and Papert's book was a reduction in neural network funding, but the technical contribution was honest: single-layer and order- limited networks have real limitations, and multi-layer training was an open problem.

The Lighthill Report (1973, Sir James Lighthill for the UK Science Research Council) evaluated the state of AI research in Britain and concluded that the field had failed to deliver on its promises, especially in robotics and general-purpose problem-solving. The report triggered a substantial withdrawal of UK government funding, marking the British AI winter. Combined with reductions in US DARPA funding around the same time, this constituted the first major AI winter.

Era 5: Expert Systems (1974-1990)

DENDRAL (Feigenbaum, Lederberg, 1965) inferred molecular structure from mass spectrometry data using hand-coded chemical rules. MYCIN (Shortliffe, 1976) diagnosed bacterial infections using production rules with certainty factors.

Expert systems answered the representation question with: if-then rules elicited from domain experts. They answered uncertainty with: ad hoc certainty factors (not probability).

The CYC project (Lenat, 1984) attempted to encode all common-sense knowledge as logical assertions. After decades of work, CYC became the canonical example of how difficult broad common-sense knowledge engineering is: hand-coded knowledge can work in constrained settings, but open-ended common sense produces too many edge cases.

Watch Out

AI winters were not purely technical

The first AI winter (1974) and the second (1987-1993) were driven by funding disappointment as much as by technical failure. Expert systems worked well in narrow domains. The problem was that advocates promised generality and delivered brittleness. The gap between promise and delivery caused funders to withdraw. The Dreyfus critique (1965, 1972) anticipated some of these failures philosophically; the Lighthill Report (1973) triggered the institutional response in Britain.

Era 6: The Statistical and Probabilistic Reform (1985-2005)

Hidden Markov Models brought probability to speech recognition (Jelinek at IBM, 1970s-1980s). Bayesian networks (Pearl, 1988) provided a representation for probabilistic reasoning that could be computed tractably for sparse dependency structures.

Hopfield (1982, PNAS 79:2554-2558) introduced recurrent networks with symmetric weights that converge to local energy minima. Hopfield networks are content-addressable memories: partial or noisy inputs settle to stored patterns. This was the first major connectionist result with a clear theoretical foundation (the energy function) and served as a precursor to the broader connectionist resurgence of the mid-1980s.

This era answered the uncertainty question with: probability theory, not certainty factors. It answered representation with: directed graphical models. It answered tractability with: conditional independence and message passing.

Pearl's contribution was not just Bayesian networks but a principled framework for causal reasoning that went beyond statistical association.

Watch Out

Statistics did not replace AI

A common misconception is that the statistical turn was a rejection of AI. It was the opposite: AI researchers adopted statistical methods to handle uncertainty rigorously instead of with ad hoc certainty factors. The researchers were often the same people.

Era 7: The Connectionist Return (1986-2012)

Rumelhart, Hinton, and Williams (1986) popularized backpropagation for training multi-layer networks. Backpropagation had been discovered multiple times before (Werbos 1974, Linnainmaa 1970), but the 1986 paper demonstrated it on problems that mattered.

LeCun (1989) applied convolutional networks and backpropagation to handwritten digit recognition. Hochreiter and Schmidhuber (1997) introduced LSTMs to address vanishing gradients in recurrent networks.

Watkins (1989, Cambridge PhD thesis "Learning from Delayed Rewards") introduced Q-learning, a model-free reinforcement learning algorithm later given finite state-action convergence results under standard assumptions. Sutton and Barto (1998, Reinforcement Learning: An Introduction) codified the field, unifying temporal-difference learning, Monte Carlo methods, and function approximation under a common framework.

Model-Based RL and World Models

The model-free/model-based distinction became central to reinforcement learning. In model-free RL, the agent learns a policy or value function directly from experience. In model-based RL, the agent also learns or uses a model of how the world changes, then uses that model for planning.

Sutton's Dyna architecture made this explicit: the agent learns from real experience, updates an action model of the world, and performs additional learning from hypothetical experience generated by that model. This was not a side branch of RL; it was one of the cleanest early formulations of how learning, planning, and acting could be unified.

Neural versions of this idea also appeared early. Nguyen and Widrow (1990) trained an emulator network to learn the dynamics of a nonlinear system, then trained a controller on that emulator before using the controller on the real system. Schmidhuber's 1990 report, Making the World Differentiable, explored self-supervised recurrent neural networks for dynamic reinforcement learning and planning, including a learned model of environmental dynamics and future reinforcement.

The 2018 World Models paper by David Ha and Jürgen Schmidhuber did not invent model-based RL or the general idea of an internal model for planning. Its contribution was different: it gave the modern deep-learning community a clean, visual demonstration of a latent neural world model, where an agent learns a compact representation of an environment and can train a controller using its own generated environment.

The historical lesson is attribution discipline: distinguish the old concept, the algorithmic formulation, the neural implementation, the modern demonstration, and the branding. Those are related contributions, but they are not the same contribution.

Brooks (1991, "Intelligence Without Representation," Artificial Intelligence 47:139-159) argued that symbolic internal representations are not necessary for intelligent behavior. His subsumption architecture demonstrated that reactive agents with layered behavior rules could accomplish navigation tasks that required elaborate planning in GOFAI systems. This was an engineering challenge to the symbolic-AI representationalist consensus, later influential in robotics and embodied-AI research.

A smaller neural-network community, including Bengio, LeCun, Hinton, Schmidhuber, Hochreiter, and others, persisted through a period (roughly 1995-2010) when neural network research was unfashionable. SVMs and kernel methods dominated machine learning conferences. The later shift came when increased compute and data made deep networks competitive.

Era 8: The Scale Era (2012-Present)

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) won ImageNet by a large margin using a GPU-trained convolutional network. AlexNet was not a conceptual break from convolutional networks; it was a scaled, GPU-trained convnet in the LeNet lineage, with more data, more compute, ReLUs, dropout, and strong engineering.

The Transformer (Vaswani et al., 2017) replaced recurrence with self-attention, enabling parallel training over sequences. This architecture change made it tractable to train on billions of tokens.

GPT-2 (2019) and GPT-3 (2020) demonstrated that language models scaled to billions of parameters exhibit emergent capabilities not present at smaller scales. Scaling laws (Kaplan et al., 2020) quantified the relationship between compute, data, parameters, and loss.

AlphaGo (Silver et al., 2016, Nature 529:484-489) defeated the world Go champion by combining deep reinforcement learning with Monte Carlo tree search. This was the milestone demonstrating that RL combined with deep learning could master tasks previously thought to require human-level intuition. AlphaZero (2017) generalized the approach to chess and shogi without human game records.

LeCun's JEPA/world-model program is one modern answer to the representation and tractability problem. Rather than predicting every pixel or token, a Joint Embedding Predictive Architecture predicts latent representations of missing or future inputs. In the autonomous-intelligence proposal, this becomes part of a planning architecture: an actor proposes candidate action sequences, the world model predicts future state representations, and cost modules evaluate predicted outcomes. The point is not merely video prediction; it is planning over abstract representations.

This does not make LLMs obsolete or unimportant; it identifies a different research program for systems whose core operation is prediction and planning in representation space, not only next-token continuation.

Watch Out

Deep learning did not make structure irrelevant

The Transformer is highly structured: it uses positional encodings, multi-head attention with learned projections, layer normalization, and residual connections. The claim that deep learning replaces feature engineering with end-to-end learning is half true. It replaces input feature engineering but introduces architectural structure that encodes inductive biases about sequence processing.

Watch Out

World models: attribution without mythology

Discussions of world-model history often collapse several contributions into a single question of priority. That loses the useful distinction between the old concept, the control formulation, the neural implementation, the modern demonstration, and the current representation-learning program.

The world-model debate is a useful test case. A careful account should not ask only who first used a label. It should ask which contribution is being attributed: mental models in psychology, optimal-control planning, learned dynamics, neural emulator/controller systems, recurrent self-supervised world models, modern latent demonstrations, or scalable predictive representation learning beyond next-token modeling.

LeCun's 2022 autonomous-intelligence position paper uses configurable predictive world models and hierarchical joint embedding architectures as part of a proposed path toward machine intelligence. But the paper explicitly says that many of its ideas had been formulated earlier by many authors, and that the paper does not claim priority for them.

The fair version is:

  • The cognitive and control antecedents are older: Craik's mental-model idea, optimal control, and model-predictive control.
  • Sutton's Dyna made the model-based RL version especially clear.
  • Nguyen and Widrow showed a neural emulator/controller version in 1990.
  • Schmidhuber's 1990 report is an early recurrent-neural contribution to learned dynamics, planning, and curiosity-style exploration.
  • Ha and Schmidhuber's 2018 paper popularized a modern latent-neural demonstration.
  • LeCun's JEPA/world-model program is a modern proposal for scalable predictive representation learning and planning, not a claim of first invention.

What History Gets Wrong

Myth 1: AI began as purely symbolic. The founding coalition included neural network researchers. Minsky's thesis was about neural nets. The symbolic-vs-connectionist framing became dominant later.

Myth 2: Minsky proved neural networks impossible. Minsky and Papert proved limitations of order-limited perceptrons (Chapters 1-5 for the XOR/order results; Chapters 11-13 for Parity and Connectedness Theorems). They acknowledged that multi-layer networks might work, but training them was an open problem. The deep result is about bounded-order perceptrons on group-invariant predicates, not about neural networks in general.

Myth 3: Deep learning made structure irrelevant. Modern architectures are heavily structured. The inductive biases in Transformers (attention over all positions, residual connections) are design choices, not emergent properties.

Myth 4: Current methods are the endpoint. Grokking (Power et al., 2022) showed that neural networks can suddenly generalize long after memorizing training data. Training dynamics still produce surprises. The history of AI is a history of confident predictions about what works, followed by corrections.

Myth 5: LeCun's world-model work is a priority claim. False. LeCun has promoted world models as a path toward autonomous machine intelligence, especially through predictive latent-space architectures such as JEPA. But his 2022 proposal explicitly does not claim priority for the general idea. The broader concept of using an internal model for prediction and planning is much older, appearing in psychology, optimal control, model-predictive control, and model-based reinforcement learning.

The better historical distinction is:

  • concept: internal models for prediction and planning
  • RL formulation: learned models used for planning
  • neural-control formulation: emulator/controller systems
  • recurrent neural formulation: learned world models for planning and curiosity
  • modern demonstration: latent neural world models trained on high-dimensional environments
  • current research program: scalable predictive world models for autonomous intelligence

Grokking as a Modern Reminder

Grokking is the phenomenon where a network trained on a small dataset first memorizes the training data (zero training loss, random test performance), then after much further training, suddenly achieves perfect generalization. This was observed on modular arithmetic tasks.

Grokking is a reminder that common overfitting intuitions are incomplete. In these algorithmic settings, a model can first interpolate the training set while performing poorly on the test set, then much later discover a representation that generalizes. This does not contradict a simple theorem of classical learning theory; it exposes how weak our usual intuitions can be for overparameterized networks trained far past interpolation. This connects back to the recurring question of representation: the network discovers a better internal representation given enough training time.

Summary

  • Five questions recur across all eras: representation, learning, search, uncertainty, tractability
  • The symbolic-connectionist divide was not present at the founding
  • AI winters were driven by expectation gaps, not purely by technical failure
  • The Dreyfus critique (1965, 1972) and Lighthill Report (1973) preceded and precipitated the first AI winter
  • Hopfield (1982) and backpropagation (1986) drove the connectionist resurgence
  • Watkins (1989) Q-learning and Sutton-Barto (1998) codified reinforcement learning
  • Model-based RL and neural world models predate the current JEPA/world-model discussion
  • The statistical reform brought principled uncertainty handling
  • AlphaGo (2016) demonstrated RL plus deep learning at human-champion scale
  • Scale (data + compute + architecture) changed what is tractable, but did not eliminate the need for structure
  • Training dynamics still surprise us (grokking, double descent)

Exercises

ExerciseCore

Problem

For each of the five recurring questions (representation, learning, search, uncertainty, tractability), name the dominant answer in the expert systems era (1974-1990) and the scale era (2012-present).

ExerciseAdvanced

Problem

Minsky and Papert showed single-layer perceptrons cannot compute XOR. Construct a two-layer network (with exact weights) that computes XOR on inputs x1,x2{0,1}x_1, x_2 \in \{0, 1\}.

References

Reference note: this page is a pedagogical history, not an exhaustive professional historiography. The references below prioritize primary technical sources and a few standard historical surveys for disputed claims.

Canonical:

  • Turing, "On Computable Numbers" (1936)
  • McCulloch & Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity" (1943)
  • Hebb, The Organization of Behavior (1949), Chapter 4 (synaptic modification rule, p. 62)
  • Löwel & Singer, "Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity" (1992), Science 255:209-212 (source of the "fire together wire together" paraphrase)
  • Novikoff, "On Convergence Proofs on Perceptrons" (1962), Symposium on the Mathematical Theory of Automata, 12:615-622 (tight finite-step bound)
  • Minsky & Papert, Perceptrons (1969), Chapters 1-5 (XOR, order results); Chapters 11-13 (Group Invariance, Parity, Connectedness Theorems)
  • Rumelhart, Hinton & Williams, "Learning Representations by Back-propagating Errors" (1986)
  • Hopfield, "Neural networks and physical systems with emergent collective computational abilities" (1982), PNAS 79:2554-2558
  • LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition" (1989)
  • LeCun, Bottou, Bengio & Haffner, "Gradient-Based Learning Applied to Document Recognition" (1998)
  • Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997)
  • Pearl, Probabilistic Reasoning in Intelligent Systems (1988)
  • Watkins, "Learning from Delayed Rewards" (1989), Cambridge PhD thesis
  • Sutton, "Dyna, an Integrated Architecture for Learning, Planning, and Reacting" (1991), ACM SIGART Bulletin 2(4):160-163
  • Sutton & Barto, Reinforcement Learning: An Introduction (1998, 2nd ed. 2018)
  • Brooks, "Intelligence Without Representation" (1991), Artificial Intelligence 47:139-159
  • Silver et al., "Mastering the game of Go with deep neural networks and tree search" (2016), Nature 529:484-489

Model-based RL and world models:

Philosophical critiques:

  • Dreyfus, "Alchemy and Artificial Intelligence" (1965), RAND Memorandum P-3244
  • Dreyfus, What Computers Can't Do (1972)
  • Lighthill, "Artificial Intelligence: A General Survey" (1973), UK Science Research Council

Historical surveys:

  • Nilsson, The Quest for Artificial Intelligence (2010), Chapters 1-20
  • Crevier, AI: The Tumultuous History of the Search for Artificial Intelligence (1993)
  • Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2021), Chapter 1

Current:

  • Vaswani et al., "Attention Is All You Need" (2017)
  • Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
  • Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (2022)

Next Topics

Related Comparisons

Deeper Threads

  • Symbolic-neural hybrids: from rule-based QA systems to current attempts to combine neural representations with explicit structure.
  • Embodied intelligence: from Brooks's subsumption architecture to robotics foundation models and experience-driven agents.
  • Philosophical critiques: Dreyfus, Searle, Dennett, and the technical responses their arguments forced AI researchers to sharpen.
  • World models and model-based RL: Dyna, neural emulators, Schmidhuber 1990, Ha-Schmidhuber 2018, LeCun's JEPA program, and the attribution discipline needed to separate concept, implementation, demonstration, and scaling.

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

5