You Can’t Skip Exploration
Why understanding experimentation and taste is key to understanding AI
This essay is part 1 of a series on the role of exploration in AI and the implications for AI development and governance.
This part introduces exploration and research taste, as well as discussing their role in research and development, and the ways that AI could change that picture. This gives rise to some exciting and underexplored (!) opportunities for beneficial and defensive contributions to research.
A second essay will discuss more implications for AI development and governance, including the potential for AI to accelerate the pace of development of AI itself, and some implications for safety and security.
Neither essay will be especially technical, but I will gesture to the technical and mathematical aspects that I find to be illuminating. As ever when I write I raise more questions than I answer! But I hope to provide some initial useful takeaways as well as productive directions for thinking about these issues.
Introducing exploration and experimentation
Scientific and technological progress are driven by experimentation: that is, doing things to find out how the world works. In the field of AI we call this 'exploration'.
Exploration for learning is not just a human phenomenon: it's ubiquitous in natural systems at various scales (from evolution itself to the play of young animals), in individual human lifetimes (as we learn skills or contribute to novel discoveries) as well as human institutions and societies (which also learn through experience), and in computer science and AI (where exploration for discovery and problem-solving are common).
We care a lot about scientific and technological potential - they can yield enormous risks (from accident, misuse, or societal destabilisation) or enormous benefits (solving major problems in medicine, climate, energy, or even defending against other risky technologies). So exploration isn't just of academic interest.
When we forget to consider how new knowledge is generated, how novel technologies are developed, when we conflate 'knowledge' with 'learning' or 'learning' with 'exploring', mistakes are made. Especially when such predictions are action-guiding we can end up taking misguided or even harmful actions, or missing opportunities to intervene in beneficial ways. So let's do some unpacking!
What factors make exploration (and by extension, research) more or less effective? What are the bottlenecks and limits to exploration? How could AI change the picture? And how can we apply insights from this lens to contribute to a better future?
Why does exploration matter?
Knowledge production loop: activity yields observations, improving knowledge — Exploration drives the loop — Lack of exploration means knowledge stagnation — Exploration is key to understanding technological progress
Learning systems gather new knowledge and insights from observations/data. Random flailing or arbitrary data aren't especially helpful. You want it to be telling you something new that you didn't already know - so it pays to deliberately seek out or gather novelty and informative observations1. This applies at the grandest scales of scientific endeavour as well as in mundane scenarios like navigating an unfamiliar building or learning a new skill.
Owen Cotton-Barratt recently discussed the 'knowledge production loop': activity and observations generate data (captured in datasets and models as 'crystallised intelligence') and combine with thinking algorithms ('fluid intelligence') to in turn drive new activity and observations.
I'd additionally characterise exploration as the way that crystallised world model and novelty taste interact with fluid reasoning and planning to judiciously choose activities yielding the most informative observations... in turn improving world models and taste ad infinitum.
Owen Cotton-Barratt's diagram of crystallized knowledge and fluid reasoning ('capacity for thought') giving rise to a 'knowledge production loop'. Here I discuss exploration as the difference between occasionally chancing upon informative new data and proactively seeking it out (or deliberately producing serendipitous conditions for making new discoveries).
Quality and quantity of exploration mark the vast difference between a civilization with vibrant progress in science and technology and one with a near-static (or even regressing) capability base - and on an individual level, it's often the difference between rapidly developing new skills or knowledge and getting stuck in a rut.
Understanding exploration is therefore key to understanding technological progress, with all the risks and benefits that entails.
Research and taste
Research is world-model-refinement — Exploration quality drives research — Taste is a learned feel for value of information — Reasoning and world modelling augment taste for exploratory planning
'Research' can be thought of broadly as refining one's world model in a particular domain. We want to know things like: how does electromagnetism work and what can it do for us? How can we prevent diseases from ruining lives? Or (more mundane) how can I get better at playing the piano or juggling? When I say 'research', I equally refer to personal learning and skill-building, scientific research, entrepreneurialism, and business development: all involve exploration and learning from experience.
We can describe three factors determining research production:
Throughput: doing more practice, running more experiments, gathering more data faster, etc.
Modelling efficiency: gathering more generalisable insight from a given experiment or observation.
Exploration quality: choosing better experiments and routines to get more informative observations.2
We'll mostly talk about exploration quality here3, which is in turn governed by taste and exploratory planning.
What do I mean by 'taste'? Sometimes people refer to 'research taste' as a sense which develops from domain experience for the types of experiments and other activities which are most likely to be interesting or informative, or otherwise move forward the state of understanding. Clearly this is an essential component of any deliberate exploration - otherwise you're back to flailing randomly!
The taste that's being developed is exactly analogous to a taste for activity which is liable to yield good outcomes of other kinds. We're just considering the value of information as the good in question. So this decomposes into an ability to come up with promising proposals more often, perhaps together with abilities to discriminate more accurately between better and worse proposals or to determine refinements and improvements to proposals4.
Now, imagine - for the sake of the argument - you're a human. Even better, in fact, imagine you're inhumanly fast and detail-sensitive, the best reasoner in the world, and you have general knowledge matching the rest of the world combined. You still need to do research in order to make new discoveries. If you don't know the details yet, experimentation isn't something you can skip!5 Your especially effective reasoning merely acts as another input to exploration quality, alongside domain research taste, perhaps allowing you to choose better experiments, and achieve results sooner. Reasoning applied effectively in this way is exploratory planning.
So present exploration quality depends on your current level of taste, while future exploration quality will also depend on taste accrual6. Reasoning and planning of course also feed into this, as we improve proposals and discard designs in favour of better-looking ones - but this has to ground out in a taste for what makes a good proposal in the first place.
From play to experimentation
Play is proto-exploration — Fun is proto-taste — Humans adaptably accrue taste in novel domains — Taste is domain-specific but exploratory principles generalise
These aspects of taste are discovered and refined through experience. Research taste is domain-specific!
Many humans and animals, especially youngsters, have built in instincts for play, curiosity, and novelty. These have been tuned by painstaking natural selection to aid in orienting to the range of body configurations, environments and communities those animals usually inhabit, precisely by exploring: gathering evidence and information about how things work. In this case, evolution did the slow, gradual work of determining the 'taste', the recognisable hallmarks of good exploratory behaviour, and wired up the 'fun' sense to those hallmarks7.
Two fox cubs play. For diverse animals, discovering the particular ways your body and brain interact, and how those affect and are affected by your surroundings, is a key part of learning adaptable and dynamic behaviours. Individual playfulness delivers novelty and exploration, while group play, especially mock contests, provides a rich 'curriculum' for development (much like the 'self play' of some AI training system designs). (Image from freepik.com)
We call children intelligent because of what they can learn, not what they can do
Playful young animals and humans thereby become adept at controlling their bodies and engaging in effective social interaction. But humans move past mere bodily control and socialisation: we use and develop tools, technologies, and diverse and innovative social structures.
For many researchers and others engaging in creation, experimenting is a lot like playing! - the rich and sophisticated kinds of play that humans engage in somewhat instinctively. But, because research and development and science and industry move beyond the historic realms of human activity, the 'taste' bestowed by evolution is rarely well suited. An untrained human has no instinct at all for the kinds of experiments that are most likely to yield useful information about the behaviour of a new material or the structure of an unseen mathematical object! This applies equally to business activities and entrepreneurialism. Substantial experience is needed.
Do we see areas where 'taste' generalises, pointing against the claim of domain-specificity? The broad principles of science and engineering appear to generalise across domains, and evidence suggests that individual humans and human organisations vary in their latent potential to accrue and apply research taste. This might be down to being more or less motivated to explore, having different capacity to learn from experience, or varying procedures for planning next steps. This gives rise to an appearance of research taste generality. But domain-specific research taste is mastered only through domain-specific experience. Expert researchers in one area may contribute to other areas - but almost always only after gaining some depth of familiarity with the new area as well.
So it's reasonable to think of exploration quality as comprising two subfactors. First, the somewhat transferable general principles of exploration: playfulness, open-mindedness, planning for novelty and interestingness. And second, domain-specific research taste: the experience that guides determination of what situations count as novel or interesting8, and what types of planning are most likely to uncover them.
Exploration in AI, past and future
First: humans curate data — Now: RL allows automatic data generation — Next?: in-context exploration characterises R&D tasks — Perhaps this is ‘AGI’?
In contemporary frontier AI systems, it's been mostly humans responsible for gathering 'high quality' informative data, often in quite hands-off ways like scraping huge datasets from the internet, but latterly with more attention on procurement and curation of especially informative or exemplary data.
With reinforcement learning (RL), the data coming in starts to rely increasingly on the activity of the system itself - together with whatever grading mechanism is in place. That's why lots of RL conversations of the past were so obsessed by exploration: taking judicious actions to get the most informative observations! So earlier AI research actually foregrounded exploration somewhat more. Helen Toner recently discussed the return of RL to centre stage in contemporary frontier AI, asking what properties of a domain make it more or less amenable to gains from reinforcement learning.
Still, in many RL settings, the human engineers are able to curate training environments with high-signal automated feedback systems, as Toner discusses. On the other hand, once we're talking about activities like R&D of various kinds, the task of exploring is inherently most of the task itself, making within-context exploration essential!
This makes 'learning to learn' or in particular 'learning to explore/experiment' among the most useful ways to operationalise 'AGI', from my perspective9. I'm not sure how best to track this, and I'm not aware of any benchmarks or studies which take this view on frontier general AI10. My personal experience with LM agents anecdotally points to them improving over time at orienting to uncertainties within their environment and being a little more creative at trying things out and testing things in 2025 than in 2024 or 2023, but not vastly - progress to date appears much more rapid in 'crystallised' intelligence.
Research by AI: AI with research taste?
Bootstrapping research taste from humans — AI advantages from speed and copying — AI learning by doing — Human advantages and bottlenecks to AI — Human-AI complementary workflows
There may be ways for AI training datasets to 'hoover up' research taste from existing experts and institutions, perhaps from lab notes or interviews, though humans at least usually learn more from actually trying research than merely from reading or talking about it. (This presumably reflects the fact that merely communicating about research experience is a much less rich source of information than actually experiencing it directly: the same issue faced by all kinds of knowledge transfer through limited media like language.)
So research taste in AI is not starting from scratch: already AI can talk in sensible, albeit sometimes basic ways about experiment design. The taste is bootstrapped from the taste implied by all the hints and observations in training data.
Could AI surpass the research taste exhibited by expert humans and human organisations? It's unclear where the ceiling is, but certainly AI would appear to have several advantages in principle: direct sharing of observations and experiences between instances, potentially far larger effective 'researcher headcount', total observation quantity far outstripping the longest-lived human experts (to date)11, all adding up to a far greater opportunity to accrue and accumulate taste. Additionally, due to computer speed, the opportunity to confer and deliberate in far more total depth the implications of each experiment and the appropriate designs of future experiments means that exploratory planning could also be boosted.
Crucially, acquiring frontier-applicable research taste would require either finding ways to bootstrap from existing research taste, which is often implicit (or even proprietary!), or enabling AI to learn by doing, perhaps aided by expert supervisors (just as human trainee researchers are), by instrumenting research processes and equipment with sensors and manipulators. Like hiring junior researchers, this would come with some upfront costs to any organisation attempting it12!
ChatGPT's interpretation of an AI with better research taste than human organisations.
Human researchers begin with some advantages today: easier physical manipulation of experimental materials (for now), a capital base of experimental equipment designed for human use, and an ecosystem designed around the training, retention, and interaction of human experts. These aren't fundamental barriers to researcher AIs, but represent some hurdles or bottlenecks that might take time and other resources to reach past.
Of course, the capacities to interpret evidence, propose experiments, design and refine proposals, and to implement experiments need not reside 'in the same mind', just as human organisations already exhibit this division of labour. But the better fitted these pieces are to each other, the more efficient the overall system will be. Drexler's 'large knowledge models' discussion treats knowledge as a resource, to be combined with planning capacity and discernment from disparate sources. Similar agendas, for example from the UK's Advanced Research and Invention Agency (ARIA) perhaps promise both a more effective and more safely manageable way to integrate AI into research processes than wholesale development of autonomous researcher AIs.
Opportunities
Recapping research, experimentation, exploration, taste — Implications for AI forecasting and ‘intelligence explosion’ — Differentially bootstrapping AI taste — Differentially complementing AI exploration — Detecting dangerous research — Exploring AI applications for flourishing
Deliberate experimentation, consisting of exploratory planning and research taste, is a critical component of efficient learning - which, in R&D-heavy domains at least, because they inherently butt against the boundaries of the known, is foundational to progress.
Much more can and should be said about the implications of an experimentation-oriented view of R&D, both on AI and facilitated by AI in other domains. Here are some initial directions:
First, in forecasting AI capabilities and timelines, we should account for the costs of experimentation. This can include quantifying the relevant variables (iteration speed, quality of simulation, modelling, and exploratory planning, accrual and accumulation of research taste, the cost of experimental resources including compute and real-world interactions, etc.). Of particular interest, this could help to characterise the potential for 'self' improvement and the possibility of an intelligence explosion (which matter by implications for other R&D and for loss of control over AI systems).
You can't skip exploration! But greater intelligences (individual or collective) can be more efficient at it in general, and domain-specific taste in particular certainly yields improved rate of progress.
This cuts both ways for safety. You can't develop dangerous nanotech purely from first principles: you have to experiment, either in vitro or in silico. Unfortunately, nor can you generate new defensive vaccination, sterilisation, or biomonitoring paradigms without putting in the experimental legwork.
This may be revealing for those seeking to differentially drive beneficial and defensive research ahead of risky research. For example, exposing research logs and expert interviews to AI systems may yield a way to bootstrap specific kinds of research taste in AI. Alternatively, recognising the default taste-weakness but speed-advantage and general knowledge breadth of AI systems may suggest strategies for complementary human-AI workflows which could be both more effective and more manageable than naively attempting to create researcher AIs wholesale.
Beyond AI-driven exploratory planning and research taste, we should expect strong synergy with robotics, sensors, simulation, modelling, and other automation technologies, as complementary production factors in R&D progress. This is likely to naturally drive investment into these technologies, but may provide opportunities to differentially unbottleneck AI multipliers in beneficial areas by devoting development to their specific complements in particular.
Further, noting that technology can rarely be developed purely from first principles, intelligence and security organisations concerned about risky research directions may be able to anticipate the kinds of experiments that are likely to be useful, and therefore the kinds of resources and activities required to make progress in those areas. This may include flows and concentrations of certain machines or components, movements of specific rare materials, movement of human talent, or known side-effects of experiments. Where materials are very dual-use (such as concentrations of computing clusters), structured inspection, auditing, or transparency tools may aid in guaranteeing that only safe and sanctioned experiments are being carried out.
Finally, now is a great time to be experimenting with AI systems and their applications, especially for people who haven't traditionally paid attention to AI. Rapid developments mean that the extent of possibilities with current tech remains underexplored, and boosting defensive and beneficial applications ahead of risky ones is a great way to ensure that the future is better than it otherwise would be!
Thanks to Owen Cotton-Barratt and Jay Bailey for feedback and conversations on this topic
This because just flailing, or even just 'doing routine activities', gets you some novelty of observations, but directedly seeking informative circumstances at the boundaries of the known (which includes making novel unpredictable events happen, as well as getting equipped with richer means to observe and record them, and perhaps preparing to deliberatively extract insight) turns out to be able to mine vastly more insight per resource (time, materials, etc.). Hence science, but also hence individual human and animal playfulness, curiosity, adversarial exercises and drills (self-play ish), and whatnot.
Notably, modelling efficiency and exploration quality are sometimes conflated as 'sample efficiency'. In the case of modelling efficiency it's about forming accurate and generalisable models from fewer observations (the classic machine learning sense of sample efficiency). For exploration quality, it's about gathering more informative observations from fewer environment interactions (a kind of 'sample efficiency' familiar from reinforcement learning).
Incidentally, throughput should not be underestimated - this is why industrial expansion often precedes and drives innovation progress as well as being a product of it. There are some very general patterns in 'industrial learning', such as Wright's Law, which describes consistent statistical relationships between the number of units produced and reductions in production cost. We might speculate that Wright's law applies most in domains where the existing human research and development organisations are at the limits of their modelling efficiency and exploration quality, and that the remaining bottlenecks are mostly in experimental throughput.
Speaking of 'taste', this is a little like the difference between a good chef and a good food critic. The chef needs to be able to come up with good recipes, while the critic needs to be able to tell which recipes are good and which are bad. In concert (perhaps adversarially!), they can create and refine recipes that are more likely to be successful.
If you have a perfect simulation of the relevant domain, you can run experiments in the simulation. This looks a bit like skipping experimentation: certainly it can be faster. In a softer sense, a useful but imperfect model can also support reasoning about experiments and potential outcomes. In my taxonomy, both of these are part of the continuum of using world modelling, planning, and some amount of taste to guide exploration.
While we're talking in economic terms, it's worth noting that research taste is a kind of capital. It can even depreciate over time! This happens in two ways. Intrinsically, as the frontier of research moves, what were formerly good intuitions may become outdated. Additionally, individual humans, currently major (though not exclusive) repositories of research taste, age, get distracted, or otherwise lose their edge. In steady fields, depreciation is slow. In fast moving fields, like AI, the frontier is moving fast, and taste depreciation can be very rapid, making accrual and accumulation of taste especially important.
My baby son is evidently thrilled by the challenge of 'balancing' (with some support) upright, a feat he can't yet accomplish, but which is unsurprisingly the kind of activity his brain is eager to get practice at. He instinctively pays close attention to new sights and sounds. His once-flailing hands now grasp interesting objects and begin to manipulate them. When he begins crawling and then toddling, he'll join generations of baby humans in enjoying the most prolonged and diversely playful childhoods of any young animal.
interesting, i.e. carrying high value of information for the domain in question.
(Of course there are nevertheless also many transformative impacts that can come from AI merely with heaps of crystallised intelligence and less R&D ability. For example, we could imagine an interesting possible paradigm in which humans continue for some time to provide input on informative experiment design, while delegating aspects like experiment implementation and interpretation to automated systems. Also note that some crystallised knowledge is currently very rare and concentrated, while if present in AI systems could be much more widely accessible, for better or worse.)
Scattered RL studies set out to evaluate or demonstrate the exploration potential of various RL algorithms, usually in toy environments. The ARC-AGI benchmarks test sample efficiency, which may be an important component of effectively accruing 'taste', but is not directly about exploration.
In fact another relevant comparison may not be between AI and individual humans, but between AI and human research organisations and institutions. Human organisations can of course already outlive individual humans: to say nothing of the broader intergenerational projects of science and research. But communication of research taste and experience between humans is constrained, and while committees of experts sometimes outperform individuals, they are slow and far from able to directly share their relevant experiences. When will AI services be able to supplement or replace particular human research tasks? And what about entire research organisations?
The raw sample efficiency of base machine learning systems like gradient descent are famously apparently much lower than humans, meaning that AI 'junior researchers' could naively be even more costly to upskill than human ones. But as model capacity is scaled up, this may be changing. And speculatively, the possibility of lightweight finetuning, 'in-context' learning, and distillation point towards AI systems matching or exceeding human sample efficiency.