Regret Bounds and Minimax Policies under Partial Monitoring

Jean-Yves Audibert 1, 2, 3 Sébastien Bubeck 4
1 IMAGINE [Marne-la-Vallée]
LIGM - Laboratoire d'Informatique Gaspard-Monge, CSTB - Centre Scientifique et Technique du Bâtiment, ENPC - École des Ponts ParisTech
3 WILLOW - Models of visual object recognition and scene understanding
CNRS - Centre National de la Recherche Scientifique : UMR8548, Inria Paris-Rocquencourt, DI-ENS - Département d'informatique de l'École normale supérieure
4 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x)=exp(η x) + γ/K, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. On the other hand with ψ(x)=(η/-x)q + γ/K, which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.
Type de document :
Article dans une revue
Journal of Machine Learning Research, Journal of Machine Learning Research, 2010, 11, pp.2785-2836
Liste complète des métadonnées

Littérature citée [15 références]  Voir  Masquer  Télécharger

https://hal-enpc.archives-ouvertes.fr/hal-00654356
Contributeur : Jean-Yves Audibert <>
Soumis le : mercredi 21 décembre 2011 - 17:08:02
Dernière modification le : jeudi 7 février 2019 - 17:18:54
Document(s) archivé(s) le : jeudi 22 mars 2012 - 02:31:29

Fichier

JMLR10.pdf
Accord explicite pour ce dépôt

Identifiants

  • HAL Id : hal-00654356, version 1

Citation

Jean-Yves Audibert, Sébastien Bubeck. Regret Bounds and Minimax Policies under Partial Monitoring. Journal of Machine Learning Research, Journal of Machine Learning Research, 2010, 11, pp.2785-2836. 〈hal-00654356〉

Partager

Métriques

Consultations de la notice

949

Téléchargements de fichiers

349