Are you more adventurous or predictable in your decision making?
The explore-exploit dilemma exists in all areas of life. When you open a streaming service to choose a film to watch, you might start exploring different shows to find an enjoyable one to watch, or you might exploit one you’ve already seen and know you’ll enjoy again. Another example is choosing to go to a favorite restaurant over trying a new one that could be better—or worse.
One way to decide is to weigh the benefits of a known option (I know this restaurant does this dish well and I will enjoy it) against the information available for sampling something new (I have never been to this restaurant but the reviews look good). The other way to decide is much more random, say with the toss of a coin. This first example is a ‘directed exploration’ approach, while the second is a ‘random exploration’ approach.
Choosing at Random
Random exploration in human behavior has been studied in less detail than directed exploration. But in a new study conducted by Khalifa University’s Dr. Samuel Feng, Assistant Professor of Mathematics, in collaboration with the Neuroscience of Reinforcement Learning and Decision Making Lab led by Dr. Robert Wilson at the University of Arizona in the United States, researchers are investigating the dynamics of random exploration in humans, and ask questions about how and why the brain makes choices seemingly randomly at times.
Their study was published recently in Nature’s Scientific Reports, and has consequences in connection to reinforcement learning, a rapidly growing paradigm of machine learning and artificial intelligence.
Deciding when to explore more and when to stop and exploit what is available is a key factor in many decisions, and understanding what controls random exploration can help decision making in all aspects of life, from dinner to business. In this research, Dr. Feng and colleagues shed light on the mystery of how to design machines and algorithms that mimic human decision making for the challenging class of explore-exploit decisions.
In their words: “When choosing a class in college, should you exploit the math class you are sure to ace, or explore the photography class you know nothing about? Exploiting math may be the way to a better grade, but exploring photography—and finding that it scratches an itch you never knew you had—could be the path to a better life. As with all such ‘explore-exploit’ decisions, picking the optimal class is hard—explore too much and you’ll never finish your degree. Exploit too much, and like us, you will do math for the rest of your life.”
The explore-exploit trade-off has a rich history in computational neuroscience research. It involves choosing between a familiar option with a known reward value and an unfamiliar option with an unknown or uncertain reward value. Exploitation maximizes rewards in the near-term, while the information obtained during exploration can be used to maximize rewards in the long-term.
But exploration is labor-intensive and time-consuming. How long should we explore? When should we start exploiting? In other words, in an uncertain and changing environment, where values of all potential options are unknown or the values of these options change over time, maintaining efficient performance requires flexibly alternating between exploration and exploitation.
“From a computational perspective, the difficulty of explore-exploit decisions arises due to uncertainty about the outcome of each choice (will I like photography?) and the long time horizon over which the consequences of a choice can play out (if I like photography, should I change my major?),” the researchers question in their paper. “To make an ‘optimal decision’ that maximizes our expected future reward, we need to average overall possible futures out to some time horizon. However, this requires us to mentally simulate all possible futures and that is surely beyond what any human brain can perform.”
Modeling Decision Making
Dr. Feng and the research team used mathematical models to break down the problem: “Mathematical modeling of these decisions provides quantitative assessment of underlying behavioral mechanisms. Inspired by research in machine learning, recent findings in psychology suggest that humans use two strategies to make explore-exploit decisions: an explicit bias for information (‘directed exploration’) and the randomization of choice (‘random exploration’). Despite differences in implementation, both strategies are driven by the same goal: increasing reward in the long run.”
They developed a model of explore-exploit behavior to investigate how random exploration could be controlled, assuming that the decision between exploitation and exploration is accomplished by accumulating evidence over time. In this context, behavioral variability when making choices can be controlled by three different parameters. Current experimental data, however, cannot distinguish which of these parameters is behind the selected choice. Instead, they show that these explore-exploit decisions are likely driven by a ‘signal-to-noise’ mechanism within the brain.
“In directed exploration, a decision is made by comparing the expected values of exploring and exploiting,” explained Dr. Feng. “These expected values combine the predicted short-term payoff from picking an option once, the ‘expected reward,’ with an estimate of the long-term value of the information obtained from choosing that option, the ‘information bonus,’ also known as the future expected value. The information bonus increases the value of exploratory options such that a directed explorer will find exploration more appealing. In contrast, in random exploration, the tendency to exploit the option with the highest short-term expected reward is countered by ‘noise’ in the decision process. This noise introduces random variability to the decision, which leads to exploration by chance.”
Getting through the Noise
Noise is inevitable in human decision making because humans are unreliable decision makers, strongly influenced by irrelevant factors such as their current mood, hunger, and even the weather. Some decisions are noise-free as they follow strict rules that limit subjective judgement. Other decisions are ‘a matter of judgement.’
A key feature of both types of exploration is that they appear to be subject to cognitive control: when it is more valuable to explore, people exhibit more information seeking (directed) and more variability in their behavior (random). Exactly how the brain achieves this control of directed and random exploration is unknown, and the signal-to-noise mechanism suggested by this study sheds some light on how the brain processes these types of decisions.
“When it is valuable to explore, the representation of reward cues—or at least, the extent to which these cues are incorporated into the decision—is reduced, leading to more random exploration overall,” explained Dr. Feng.
These findings are useful for programming machines that mimic human decision making.
“In reinforcement learning, an agent tries to maximize their reward by choosing beneficial actions within some environment. It is a growing paradigm driving many AI applications including gaming (computers playing chess/go), robotics, advertising, and computational chemistry,” explained Dr. Feng. “In such applications, our goal is not only to maximize numeric performance but also to design machines and algorithms that learn and decide like humans.”
15 March 2021