A Brief Introduction to Multi-Armed Bandit Problem
Speaker: Setareh Maghsudi, TU Berlin
Multi-armed bandit (MAB) is a class of sequential optimization problems, where in successive rounds a gambler pulls an arm from a given a set of arms to receive some a priori unknown reward. The gambler observes only the reward of the played arm. As a result of information shortage, there might be a difference between the maximum achievable reward and the reward of the actual played arm, which is referred to as regret. The gambler intends to optimize some regret-based objective function over the horizon by selecting arms according to some online decision-making policy. Therefore, the policy shall find a balance between gathering immediate rewards and obtaining information to achieve a large reward in the future, known as the exploration-exploitation dilemma. The MAB problem finds application in a wide variety of fields, ranging from science to engineering and computational humanities. Although the problem has been under study for near a century now, the research effort has become immense in the past decade by the boost of artificial intelligence, big data, and digitalization in different sectors such as health and education. In this talk, I provide a brief introduction to the multi-armed bandit problem and review its basic settings and applications. I then discuss some of the seminal decision-making strategies that often serve as the basis to solve MAB problems.