History of Artificial Intelligence

Sneiderman, Robby

AI History

History of Artificial Intelligence

Five recurring questions drive AI history: representation, learning, search, uncertainty, and tractability. From Turing 1936 through the transformer era, every advance answered at least one of these questions differently.

CoreTier 2StableReference~70 min

Prereq Map

Why This Matters

Most AI history is told as a sequence of breakthroughs and winters. That framing obscures the actual structure. The same five questions recur in every era: How should knowledge be represented? How should a system learn from data? How should it search through possibilities? How should it handle uncertainty? How do you make any of this tractable?

A useful way to read AI history is that many major advances changed the answer to one or more of these questions. Understanding the history this way prevents you from thinking that the current approach (large neural networks trained on internet data) is the final answer. It also prevents the opposite error: dismissing scale because previous hype cycles ended badly.

Jump to: World models and attribution

Watch Out

The five-questions frame is a pedagogical device, not a historiographical claim

The five recurring questions (representation, learning, search, uncertainty, tractability) are an organizing device for readers, not an established framework in the history-of-science literature. Professional historians of AI (Crevier, Nilsson, Buchanan) organize the field differently, typically around funding cycles, institutional networks, and specific research programs. The five-questions frame is useful for spotting conceptual continuities across eras. It is not a claim that these five questions exhausted what researchers cared about, nor that each era had a single dominant answer to each.

The Five Recurring Questions

Representation: What data structures encode knowledge? Logic formulas, feature vectors, distributed representations, attention patterns.
Learning: How does the system improve from experience? Hand-coded rules, perceptron updates, backpropagation, self-supervised pretraining.
Search and optimization: How does the system find good solutions in large spaces? Tree search, gradient descent, evolutionary methods, beam search.
Uncertainty: How does the system reason about what it does not know? Boolean logic, fuzzy sets, probability, Bayesian networks, calibration.
Tractability: How do you make the above work at useful scale? Approximation, parallelism, sparsity, hardware co-design.

Every era below can be understood as a shift in the dominant answer to one or more of these questions.

Era-by-Question Matrix

This matrix is the page's main analytical tool. Read each era as a different choice about representation, learning, search, uncertainty, and tractability.

Era	Representation	Learning	Search / Optimization	Uncertainty	Tractability
Computability and neural logic	Boolean logic, ideal neurons	Local synaptic change	Computation as procedure	Mostly implicit	Universal machine abstraction
Founding coalition	Logic, search states, early networks	Perceptrons, hand design	Heuristic search	Minimal formal treatment	Toy domains and small machines
Symbolic AI	Rules, symbols, problem states	Hand-coded knowledge	Theorem proving, planning	Logic-first reasoning	Domain restriction
Perceptron controversy	Linear thresholds	Perceptron updates	Margin-driven correction	Mostly absent	Bounded-order limitations
Expert systems	Production rules	Expert elicitation	Forward/backward chaining	Certainty factors	Narrow deployment domains
Statistical reform	Graphical models, kernels, features	Probabilistic estimation	EM, convex optimization, inference	Probability and Bayesian networks	Approximation and factorization
Connectionist return	Distributed vectors, CNN/RNN states	Backpropagation, temporal difference learning	Gradients, RL, model-based planning	Softmax and probabilistic models	GPUs, data, and differentiable programs
Scale era	Transformers, latent states, world models	Self-supervision, post-training, representation learning	Gradient descent, decoding, planning	Calibration, sampling, uncertainty estimates	Scale, parallelism, sparsity, hardware co-design

Era 1: Computability and the Idea of Machine Thought (1936-1950)

Turing's 1936 paper on computable numbers established that a single machine can compute anything that is computable, given enough time and tape. This is the mathematical precondition for AI: if intelligence involves computation, then a universal machine can in principle replicate it.

McCulloch and Pitts (1943) showed that networks of idealized binary neurons can compute any Boolean function. This was a representation result: neural activity could be described in the language of propositional logic.

Hebb (1949) proposed that synaptic connections strengthen when pre-synaptic and post-synaptic neurons fire together. His actual formulation: "When an axon of cell A is near enough to excite cell B... some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased" (The Organization of Behavior, 1949, p. 62). This was a learning result: unsupervised and local, requiring no teacher signal and no global error computation. The catchier paraphrase "neurons that fire together wire together" is due to Löwel and Singer (1992, Science 255:209-212), not Hebb.

Era 2: The Founding Coalition (1950-1960)

The Dartmouth workshop (1956) is conventionally called the founding of AI. The key participants (McCarthy, Minsky, Rochester, Shannon) had diverse views. McCarthy favored logic and search. Minsky was interested in neural networks at this point (his Princeton thesis was on neural net learning). Shannon contributed information theory. Rochester had built a hardware neural network simulator at IBM.

The founding coalition was not purely symbolic. The symbolic-vs-connectionist divide came later.

Rosenblatt's Perceptron (1957) was a hardware learning machine that could classify patterns by adjusting weights. The perceptron convergence theorem states: if the training data is linearly separable, the algorithm finds a separating hyperplane in finite steps. Rosenblatt (1962) gave a proof of this result, but the first tight finite-step bound with explicit constants was provided by Novikoff (1962, "On Convergence Proofs on Perceptrons," Symposium on the Mathematical Theory of Automata, 12:615-622). The canonical bound $T \leq (R/\gamma)^2$ relating the number of updates to the margin $\gamma$ and the maximum input norm $R$ is Novikoff's.

Era 3: Symbolic Problem Solving (1956-1974)

Newell and Simon's Logic Theorist (1956) and General Problem Solver (1957) solved problems by searching through a space of symbolic states. Their Physical Symbol System Hypothesis claimed that symbol manipulation is both necessary and sufficient for general intelligence. This answered the representation question with: symbols and rules.

McCarthy created Lisp (1958), a programming language where code and data share the same structure. Lisp became the lingua franca of AI research for three decades.

Samuel's checkers program (1959) used self-play and temporal difference methods. This was an early instance of reinforcement learning, though the term came later.

Dreyfus (1965, "Alchemy and Artificial Intelligence," RAND Memorandum P-3244; 1972, What Computers Can't Do) provided the most sustained philosophical critique of symbolic AI. Drawing on phenomenology (Heidegger, Merleau-Ponty), he argued that human expertise is embodied and contextual in ways that rule- based symbol manipulation cannot capture. His critique was largely dismissed at the time but anticipated concerns about brittle expert systems. Brooks (1991) echoed related points from an engineering angle; the later embodied-AI work made several of Dreyfus's concerns look much less dismissible than they seemed to many AI researchers at the time.

Era 4: The Perceptron Controversy (1969)

Theorem

Perceptron Limitation (Minsky-Papert 1969)

Statement

(a) XOR case: A single-layer perceptron (linear threshold unit with no hidden layer) cannot compute the XOR function, because XOR is not linearly separable in 2D.

(b) General order-limited result: For a perceptron of order $k$ (each hidden unit receives input from at most $k$ input nodes), the Group Invariance Theorem states that any predicate computable by a group-invariant perceptron of order $k$ must be a function of global statistics computable from $k$ -wise interactions. The Parity Theorem shows that parity on $n$ bits requires order $n$ . The Connectedness Theorem shows that deciding whether a figure is connected requires order equal to the number of pixels, regardless of the number of hidden units.

Intuition

The XOR limitation is elementary: a hyperplane cannot separate the four XOR truth-table points. The deeper results concern predicates that require examining global structure. A perceptron whose units each see only a limited patch of the input cannot, for example, determine if a figure drawn on a retina is connected, no matter how many such limited units you add.

Proof Sketch

For (a): the four XOR points $(0,0), (1,1), (0,1), (1,0)$ with labels $0,0,1,1$ are not linearly separable.

For (b): the Group Invariance Theorem (Minsky-Papert 1969, Chapters 12-13) uses a symmetry argument. If a predicate $\psi$ is invariant under a group $G$ of permutations of the input, then any perceptron computing $\psi$ must have total support at least equal to the orbit size. Connectivity and parity are invariant under large groups, forcing order equal to the input size.

Why It Matters

Minsky and Papert did not prove that neural networks are useless. They proved limitations of order-limited perceptrons. They noted explicitly that multi- layer networks might overcome these limits, but nobody knew how to train them at the time. The book is frequently mischaracterized as "proving neural networks cannot work."

Failure Mode

The general results (Parity, Connectedness) apply to bounded-order perceptrons regardless of the number of hidden units. They do not apply to networks of unbounded order. A two-layer threshold network with two hidden units can compute XOR; a sufficiently large multi-layer network can compute any Boolean function. The missing piece in 1969 was not expressivity in principle, but a practical and widely adopted training method for hidden layers.

report a correction →

The effect of Minsky and Papert's book was a reduction in neural network funding, but the technical contribution was honest: single-layer and order- limited networks have real limitations, and multi-layer training was an open problem.

The Lighthill Report (1973, Sir James Lighthill for the UK Science Research Council) evaluated the state of AI research in Britain and concluded that the field had failed to deliver on its promises, especially in robotics and general-purpose problem-solving. The report triggered a substantial withdrawal of UK government funding, marking the British AI winter. Combined with reductions in US DARPA funding around the same time, this constituted the first major AI winter.

Era 5: Expert Systems (1974-1990)

DENDRAL (Feigenbaum, Lederberg, 1965) inferred molecular structure from mass spectrometry data using hand-coded chemical rules. MYCIN (Shortliffe, 1976) diagnosed bacterial infections using production rules with certainty factors.

Expert systems answered the representation question with: if-then rules elicited from domain experts. They answered uncertainty with: ad hoc certainty factors (not probability).

The CYC project (Lenat, 1984) attempted to encode all common-sense knowledge as logical assertions. After decades of work, CYC became the canonical example of how difficult broad common-sense knowledge engineering is: hand-coded knowledge can work in constrained settings, but open-ended common sense produces too many edge cases.

Watch Out

AI winters were not purely technical

The first AI winter (1974) and the second (1987-1993) were driven by funding disappointment as much as by technical failure. Expert systems worked well in narrow domains. The problem was that advocates promised generality and delivered brittleness. The gap between promise and delivery caused funders to withdraw. The Dreyfus critique (1965, 1972) anticipated some of these failures philosophically; the Lighthill Report (1973) triggered the institutional response in Britain.

Era 6: The Statistical and Probabilistic Reform (1985-2005)

Hidden Markov Models brought probability to speech recognition (Jelinek at IBM, 1970s-1980s). Bayesian networks (Pearl, 1988) provided a representation for probabilistic reasoning that could be computed tractably for sparse dependency structures.

Hopfield (1982, PNAS 79:2554-2558) introduced recurrent networks with symmetric weights that converge to local energy minima. Hopfield networks are content-addressable memories: partial or noisy inputs settle to stored patterns. This was the first major connectionist result with a clear theoretical foundation (the energy function) and served as a precursor to the broader connectionist resurgence of the mid-1980s.

This era answered the uncertainty question with: probability theory, not certainty factors. It answered representation with: directed graphical models. It answered tractability with: conditional independence and message passing.

Pearl's contribution was not just Bayesian networks but a principled framework for causal reasoning that went beyond statistical association.

Watch Out

Statistics did not replace AI

A common misconception is that the statistical turn was a rejection of AI. It was the opposite: AI researchers adopted statistical methods to handle uncertainty rigorously instead of with ad hoc certainty factors. The researchers were often the same people.

Era 7: The Connectionist Return (1986-2012)

Rumelhart, Hinton, and Williams (1986) popularized backpropagation for training multi-layer networks. Backpropagation had been discovered multiple times before (Werbos 1974, Linnainmaa 1970), but the 1986 paper demonstrated it on problems that mattered.

LeCun (1989) applied convolutional networks and backpropagation to handwritten digit recognition. Hochreiter and Schmidhuber (1997) introduced LSTMs to address vanishing gradients in recurrent networks.

Watkins (1989, Cambridge PhD thesis "Learning from Delayed Rewards") introduced Q-learning, a model-free reinforcement learning algorithm later given finite state-action convergence results under standard assumptions. Sutton and Barto (1998, Reinforcement Learning: An Introduction) codified the field, unifying temporal-difference learning, Monte Carlo methods, and function approximation under a common framework.

Model-Based RL and World Models

The model-free/model-based distinction became central to reinforcement learning. In model-free RL, the agent learns a policy or value function directly from experience. In model-based RL, the agent also learns or uses a model of how the world changes, then uses that model for planning.

Sutton's Dyna architecture made this explicit: the agent learns from real experience, updates an action model of the world, and performs additional learning from hypothetical experience generated by that model. This was not a side branch of RL; it was one of the cleanest early formulations of how learning, planning, and acting could be unified.

Neural versions of this idea also appeared early. Nguyen and Widrow (1990) trained an emulator network to learn the dynamics of a nonlinear system, then trained a controller on that emulator before using the controller on the real system. Schmidhuber's 1990 report, Making the World Differentiable, explored self-supervised recurrent neural networks for dynamic reinforcement learning and planning, including a learned model of environmental dynamics and future reinforcement.

The 2018 World Models paper by David Ha and Jürgen Schmidhuber did not invent model-based RL or the general idea of an internal model for planning. Its contribution was different: it gave the modern deep-learning community a clean, visual demonstration of a latent neural world model, where an agent learns a compact representation of an environment and can train a controller using its own generated environment.

The historical lesson is attribution discipline: distinguish the old concept, the algorithmic formulation, the neural implementation, the modern demonstration, and the branding. Those are related contributions, but they are not the same contribution.

Brooks (1991, "Intelligence Without Representation," Artificial Intelligence 47:139-159) argued that symbolic internal representations are not necessary for intelligent behavior. His subsumption architecture demonstrated that reactive agents with layered behavior rules could accomplish navigation tasks that required elaborate planning in GOFAI systems. This was an engineering challenge to the symbolic-AI representationalist consensus, later influential in robotics and embodied-AI research.

A smaller neural-network community, including Bengio, LeCun, Hinton, Schmidhuber, Hochreiter, and others, persisted through a period (roughly 1995-2010) when neural network research was unfashionable. SVMs and kernel methods dominated machine learning conferences. The later shift came when increased compute and data made deep networks competitive.

Era 8: The Scale Era (2012-Present)

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) won ImageNet by a large margin using a GPU-trained convolutional network. AlexNet was not a conceptual break from convolutional networks; it was a scaled, GPU-trained convnet in the LeNet lineage, with more data, more compute, ReLUs, dropout, and strong engineering.

The Transformer (Vaswani et al., 2017) replaced recurrence with self-attention, enabling parallel training over sequences. This architecture change made it tractable to train on billions of tokens.

GPT-2 (2019) and GPT-3 (2020) demonstrated that language models scaled to billions of parameters exhibit emergent capabilities not present at smaller scales. Scaling laws (Kaplan et al., 2020) quantified the relationship between compute, data, parameters, and loss.

AlphaGo (Silver et al., 2016, Nature 529:484-489) defeated the world Go champion by combining deep reinforcement learning with Monte Carlo tree search. This was the milestone demonstrating that RL combined with deep learning could master tasks previously thought to require human-level intuition. AlphaZero (2017) generalized the approach to chess and shogi without human game records.

LeCun's JEPA/world-model program is one modern answer to the representation and tractability problem. Rather than predicting every pixel or token, a Joint Embedding Predictive Architecture predicts latent representations of missing or future inputs. In the autonomous-intelligence proposal, this becomes part of a planning architecture: an actor proposes candidate action sequences, the world model predicts future state representations, and cost modules evaluate predicted outcomes. The point is not merely video prediction; it is planning over abstract representations.

This does not make LLMs obsolete or unimportant; it identifies a different research program for systems whose core operation is prediction and planning in representation space, not only next-token continuation.

Watch Out

Deep learning did not make structure irrelevant

The Transformer is highly structured: it uses positional encodings, multi-head attention with learned projections, layer normalization, and residual connections. The claim that deep learning replaces feature engineering with end-to-end learning is half true. It replaces input feature engineering but introduces architectural structure that encodes inductive biases about sequence processing.

Watch Out

World models: attribution without mythology

Discussions of world-model history often collapse several contributions into a single question of priority. That loses the useful distinction between the old concept, the control formulation, the neural implementation, the modern demonstration, and the current representation-learning program.

The world-model debate is a useful test case. A careful account should not ask only who first used a label. It should ask which contribution is being attributed: mental models in psychology, optimal-control planning, learned dynamics, neural emulator/controller systems, recurrent self-supervised world models, modern latent demonstrations, or scalable predictive representation learning beyond next-token modeling.

LeCun's 2022 autonomous-intelligence position paper uses configurable predictive world models and hierarchical joint embedding architectures as part of a proposed path toward machine intelligence. But the paper explicitly says that many of its ideas had been formulated earlier by many authors, and that the paper does not claim priority for them.

The fair version is:

The cognitive and control antecedents are older: Craik's mental-model idea, optimal control, and model-predictive control.
Sutton's Dyna made the model-based RL version especially clear.
Nguyen and Widrow showed a neural emulator/controller version in 1990.
Schmidhuber's 1990 report is an early recurrent-neural contribution to learned dynamics, planning, and curiosity-style exploration.
Ha and Schmidhuber's 2018 paper popularized a modern latent-neural demonstration.
LeCun's JEPA/world-model program is a modern proposal for scalable predictive representation learning and planning, not a claim of first invention.

What History Gets Wrong

Myth 1: AI began as purely symbolic. The founding coalition included neural network researchers. Minsky's thesis was about neural nets. The symbolic-vs-connectionist framing became dominant later.

Myth 2: Minsky proved neural networks impossible. Minsky and Papert proved limitations of order-limited perceptrons (Chapters 1-5 for the XOR/order results; Chapters 11-13 for Parity and Connectedness Theorems). They acknowledged that multi-layer networks might work, but training them was an open problem. The deep result is about bounded-order perceptrons on group-invariant predicates, not about neural networks in general.

Myth 3: Deep learning made structure irrelevant. Modern architectures are heavily structured. The inductive biases in Transformers (attention over all positions, residual connections) are design choices, not emergent properties.

Myth 4: Current methods are the endpoint. Grokking (Power et al., 2022) showed that neural networks can suddenly generalize long after memorizing training data. Training dynamics still produce surprises. The history of AI is a history of confident predictions about what works, followed by corrections.

Myth 5: LeCun's world-model work is a priority claim. False. LeCun has promoted world models as a path toward autonomous machine intelligence, especially through predictive latent-space architectures such as JEPA. But his 2022 proposal explicitly does not claim priority for the general idea. The broader concept of using an internal model for prediction and planning is much older, appearing in psychology, optimal control, model-predictive control, and model-based reinforcement learning.

The better historical distinction is:

concept: internal models for prediction and planning
RL formulation: learned models used for planning
neural-control formulation: emulator/controller systems
recurrent neural formulation: learned world models for planning and curiosity
modern demonstration: latent neural world models trained on high-dimensional environments
current research program: scalable predictive world models for autonomous intelligence

Grokking as a Modern Reminder

Grokking is the phenomenon where a network trained on a small dataset first memorizes the training data (zero training loss, random test performance), then after much further training, suddenly achieves perfect generalization. This was observed on modular arithmetic tasks.

Grokking is a reminder that common overfitting intuitions are incomplete. In these algorithmic settings, a model can first interpolate the training set while performing poorly on the test set, then much later discover a representation that generalizes. This does not contradict a simple theorem of classical learning theory; it exposes how weak our usual intuitions can be for overparameterized networks trained far past interpolation. This connects back to the recurring question of representation: the network discovers a better internal representation given enough training time.

Summary

Five questions recur across all eras: representation, learning, search, uncertainty, tractability
The symbolic-connectionist divide was not present at the founding
AI winters were driven by expectation gaps, not purely by technical failure
The Dreyfus critique (1965, 1972) and Lighthill Report (1973) preceded and precipitated the first AI winter
Hopfield (1982) and backpropagation (1986) drove the connectionist resurgence
Watkins (1989) Q-learning and Sutton-Barto (1998) codified reinforcement learning
Model-based RL and neural world models predate the current JEPA/world-model discussion
The statistical reform brought principled uncertainty handling
AlphaGo (2016) demonstrated RL plus deep learning at human-champion scale
Scale (data + compute + architecture) changed what is tractable, but did not eliminate the need for structure
Training dynamics still surprise us (grokking, double descent)

Exercises

ExerciseCore

Problem

For each of the five recurring questions (representation, learning, search, uncertainty, tractability), name the dominant answer in the expert systems era (1974-1990) and the scale era (2012-present).

ExerciseAdvanced

Problem

Minsky and Papert showed single-layer perceptrons cannot compute XOR. Construct a two-layer network (with exact weights) that computes XOR on inputs $x_1, x_2 \in \{0, 1\}$ .

References

Reference note: this page is a pedagogical history, not an exhaustive professional historiography. The references below prioritize primary technical sources and a few standard historical surveys for disputed claims.

Canonical:

Turing, "On Computable Numbers" (1936)
McCulloch & Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity" (1943)
Hebb, The Organization of Behavior (1949), Chapter 4 (synaptic modification rule, p. 62)
Löwel & Singer, "Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity" (1992), Science 255:209-212 (source of the "fire together wire together" paraphrase)
Novikoff, "On Convergence Proofs on Perceptrons" (1962), Symposium on the Mathematical Theory of Automata, 12:615-622 (tight finite-step bound)
Minsky & Papert, Perceptrons (1969), Chapters 1-5 (XOR, order results); Chapters 11-13 (Group Invariance, Parity, Connectedness Theorems)
Rumelhart, Hinton & Williams, "Learning Representations by Back-propagating Errors" (1986)
Hopfield, "Neural networks and physical systems with emergent collective computational abilities" (1982), PNAS 79:2554-2558
LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition" (1989)
LeCun, Bottou, Bengio & Haffner, "Gradient-Based Learning Applied to Document Recognition" (1998)
Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997)
Pearl, Probabilistic Reasoning in Intelligent Systems (1988)
Watkins, "Learning from Delayed Rewards" (1989), Cambridge PhD thesis
Sutton, "Dyna, an Integrated Architecture for Learning, Planning, and Reacting" (1991), ACM SIGART Bulletin 2(4):160-163
Sutton & Barto, Reinforcement Learning: An Introduction (1998, 2nd ed. 2018)
Brooks, "Intelligence Without Representation" (1991), Artificial Intelligence 47:139-159
Silver et al., "Mastering the game of Go with deep neural networks and tree search" (2016), Nature 529:484-489

Model-based RL and world models:

Craik, The Nature of Explanation (1943)
Bryson & Ho, Applied Optimal Control: Optimization, Estimation, and Control (1969; revised printing 1975)
Sutton, "Dyna, an Integrated Architecture for Learning, Planning, and Reacting" (1991), ACM SIGART Bulletin 2(4):160-163
Nguyen & Widrow, "Neural Networks for Self-Learning Control Systems" (1990), IEEE Control Systems Magazine 10(3):18-23
Schmidhuber, "Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments" (1990), FKI-126-90
Ha & Schmidhuber, "World Models" (2018)
LeCun, "A Path Towards Autonomous Machine Intelligence" (2022)

Philosophical critiques:

Dreyfus, "Alchemy and Artificial Intelligence" (1965), RAND Memorandum P-3244
Dreyfus, What Computers Can't Do (1972)
Lighthill, "Artificial Intelligence: A General Survey" (1973), UK Science Research Council

Historical surveys:

Nilsson, The Quest for Artificial Intelligence (2010), Chapters 1-20
Crevier, AI: The Tumultuous History of the Search for Artificial Intelligence (1993)
Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2021), Chapter 1

Current:

Vaswani et al., "Attention Is All You Need" (2017)
Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (2022)

Next Topics

Model timeline: chronological reference of specific architectures
Scaling laws: quantitative relationships between compute, data, and loss
World models and planning: learned dynamics models, imagination, and planning
JEPA and joint embedding: prediction in representation space rather than pixel or token space
Grokking: delayed generalization after memorization

Related Comparisons

Model-based vs. model-free RL: planning with a learned model vs. direct value/policy learning
Autoregressive models vs. JEPA: predicting observed tokens/pixels vs. predicting latent representations

Deeper Threads

Symbolic-neural hybrids: from rule-based QA systems to current attempts to combine neural representations with explicit structure.
Embodied intelligence: from Brooks's subsumption architecture to robotics foundation models and experience-driven agents.
Philosophical critiques: Dreyfus, Searle, Dennett, and the technical responses their arguments forced AI researchers to sharpen.
World models and model-based RL: Dyna, neural emulators, Schmidhuber 1990, Ha-Schmidhuber 2018, LeCun's JEPA program, and the attribution discipline needed to separate concept, implementation, demonstration, and scaling.

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

5

Scaling Lawslayer 4 · tier 1
Grokkinglayer 4 · tier 2
JEPA and Joint Embeddinglayer 4 · tier 2
World Models and Planninglayer 4 · tier 2
Model Timelinelayer 5 · tier 2

Graph-backed continuations

Model Timeline Scaling Laws World Models and Planning JEPA and Joint Embedding Grokking