Best Arm Identification in Multi-Armed Bandits

Jean-Yves Audibert 1, 2, 3 Sébastien Bubeck 4
1 IMAGINE [Marne-la-Vallée]
CSTB - Centre Scientifique et Technique du Bâtiment, LIGM - Laboratoire d'Informatique Gaspard-Monge, ENPC - École des Ponts ParisTech
2 WILLOW - Models of visual object recognition and scene understanding
CNRS - Centre National de la Recherche Scientifique : UMR8548, Inria Paris-Rocquencourt, DI-ENS - Département d'informatique de l'École normale supérieure
4 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameter-free, and also independent of the scaling of the rewards. As a by-product of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log(K) factor) Σ i 1/Δ2i, where the sum is on the suboptimal arms andΔi represents the difference between the mean reward of the best arm and the one of arm i. This generalizes the well-known fact that one needs of order of 1/Δ2 samples to differentiate the means of two distributions with gap Δ.
Type de document :
Communication dans un congrès
Liste complète des métadonnées

Littérature citée [7 références]  Voir  Masquer  Télécharger

https://hal-enpc.archives-ouvertes.fr/hal-00654404
Contributeur : Jean-Yves Audibert <>
Soumis le : mercredi 21 décembre 2011 - 18:30:59
Dernière modification le : jeudi 21 février 2019 - 10:52:49
Document(s) archivé(s) le : jeudi 22 mars 2012 - 02:30:55

Fichier

COLT10.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00654404, version 1

Citation

Jean-Yves Audibert, Sébastien Bubeck. Best Arm Identification in Multi-Armed Bandits. COLT - 23th Conference on Learning Theory - 2010, Jun 2010, Haifa, Israel. 13 p. ⟨hal-00654404⟩

Partager

Métriques

Consultations de la notice

2421

Téléchargements de fichiers

2045