Regret Bounds and Minimax Policies under Partial Monitoring

Jean-Yves Audibert 1, 2, 3 Sébastien Bubeck 4
1 IMAGINE [Marne-la-Vallée]
LIGM - Laboratoire d'Informatique Gaspard-Monge, CSTB - Centre Scientifique et Technique du Bâtiment, ENPC - École des Ponts ParisTech
3 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique de l'École normale supérieure, Inria Paris-Rocquencourt, CNRS - Centre National de la Recherche Scientifique : UMR8548
4 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal, Inria Lille - Nord Europe
Abstract : This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x)=exp(η x) + γ/K, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. On the other hand with ψ(x)=(η/-x)q + γ/K, which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.
Type de document :
Article dans une revue
Liste complète des métadonnées

Littérature citée [15 références]  Voir  Masquer  Télécharger
Contributeur : Jean-Yves Audibert <>
Soumis le : mercredi 21 décembre 2011 - 17:08:02
Dernière modification le : jeudi 27 juin 2019 - 13:36:06
Document(s) archivé(s) le : jeudi 22 mars 2012 - 02:31:29


Accord explicite pour ce dépôt


  • HAL Id : hal-00654356, version 1


Jean-Yves Audibert, Sébastien Bubeck. Regret Bounds and Minimax Policies under Partial Monitoring. Journal of Machine Learning Research, Microtome Publishing, 2010, 11, pp.2785-2836. ⟨hal-00654356⟩



Consultations de la notice


Téléchargements de fichiers