D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 2015.

Y. Bengio, J. Louradour, R. Collobert, and J. Weston, Curriculum learning, Proceedings of the 26th annual international conference on machine learning, pp.41-48, 2009.

H. Bilen and A. Vedaldi, Weakly supervised deep detection networks, CVPR, 2016.

J. Carreira and . Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, CVPR, 2017.

J. Ramazan-gokberk-cinbis, C. Verbeek, and . Schmid, Weakly supervised object localization with multi-fold multiple instance learning, IEEE transactions on pattern analysis and machine intelligence, vol.39, pp.189-203, 2017.

R. Girdhar and D. Ramanan, Attentional pooling for action recognition, Advances in Neural Information Processing Systems, pp.33-44, 2017.

G. Gkioxari, R. Girshick, and J. Malik, Contextual action recognition with r* cnn, Proceedings of the IEEE international conference on computer vision, pp.1080-1088, 2015.
DOI : 10.1109/iccv.2015.129
URL : http://arxiv.org/pdf/1505.01197

W. Hamilton, Z. Ying, and J. Leskovec, Inductive representation learning on large graphs, Advances in Neural Information Processing Systems, pp.1025-1035, 2017.

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, Activitynet: A large-scale video benchmark for human activity understanding, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7298698
URL : https://repository.kaust.edu.sa/bitstream/10754/556141/1/ActivityNet_CVPR2015.pdf

K. Moritz-hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay et al., Teaching machines to read and comprehend, Advances in Neural Information Processing Systems, pp.1693-1701, 2015.

Y. Jiang, J. Liu, A. Zamir, G. Toderici, I. Laptev et al., THUMOS challenge: Action recognition with a large number of classes, 2014.

V. Kantorov, M. Oquab, M. Cho, and I. Laptev, ContextLocNet: Context-aware deep network models for weakly supervised localization, ECCV, 2016.
DOI : 10.1007/978-3-319-46454-1_22
URL : https://hal.archives-ouvertes.fr/hal-01421772

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The kinetics human action video dataset, 2017.

Y. Kim, C. Denton, L. Hoang, and A. Rush, Structured attention networks, 2017.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

S. Kong and C. Fowlkes, Low-rank bilinear pooling for fine-grained classification, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.7025-7034, 2017.
DOI : 10.1109/cvpr.2017.743
URL : http://arxiv.org/pdf/1611.05109

A. Mensch and M. Blondel, Differentiable dynamic programming for structured prediction and attention, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01809550

P. Nguyen, T. Liu, G. Prasad, and B. Han, Weakly supervised action localization by sparse temporal pooling network, 2018.
DOI : 10.1109/cvpr.2018.00706
URL : http://arxiv.org/pdf/1712.05080

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free?-weaklysupervised learning with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.685-694, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01015140

O. Pedro, R. Pinheiro, and . Collobert, From image-level to pixel-level labeling with convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1713-1721, 2015.

A. Richard and J. Gall, Temporal action detection using a statistical language model, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3131-3140, 2016.
DOI : 10.1109/cvpr.2016.341

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision, vol.115, issue.3, pp.211-252, 2015.
DOI : 10.1007/s11263-015-0816-y
URL : http://arxiv.org/pdf/1409.0575

S. Sharma, R. Kiros, and R. Salakhutdinov, Action recognition using visual attention, 2015.

Z. Shou, D. Wang, and S. Chang, Temporal action localization in untrimmed videos via multi-stage cnns, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1049-1058, 2016.

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang, CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos, CVPR, 2017.

Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S. Chang, Autoloc: Weaklysupervised temporal action localization in untrimmed videos, Proceedings of the European Conference on Computer Vision (ECCV), pp.154-171, 2018.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, Advances in neural information processing systems, pp.568-576, 2014.

G. Singh and F. Cuzzolin, Untrimmed video classification for activity detection: submission to activitynet challenge, 2016.

K. K. Singh and Y. Lee, Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization, The IEEE International Conference on Computer Vision (ICCV, 2017.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE international conference on computer vision, pp.4489-4497, 2015.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, Advances in Neural Information Processing Systems, pp.6000-6010, 2017.

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset, 2011.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal segment networks: Towards good practices for deep action recognition, European Conference on Computer Vision, pp.20-36, 2016.

L. Wang, Y. Xiong, D. Lin, and L. Van-gool, Untrimmednets for weakly supervised action recognition and detection, 2017.

R. Wang and D. Tao, Acitivitynet large scale activity recognition challenge, UTS at Activitynet, 2016.

Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao et al., Object region mining with adversarial erasing: A simple classification to semantic segmentation approach, IEEE CVPR, 2017.

Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang, A pursuit of temporal accuracy in general activity detection, 2017.

H. A. Xu, A. Das, and K. Saenko, R-c3d: Region convolutional 3d network for temporal activity detection, 2017.

S. Yeung, O. Russakovsky, G. Mori, and L. Fei-fei, End-to-end learning of action detection from frame glimpses in videos, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2678-2687, 2016.

J. Yuan, B. Ni, X. Yang, and A. A. Kassim, Temporal action localization with pyramid of score distribution features, CVPR, 2016.

Y. Yuan, X. Liang, X. Wang, D. Yeung, and A. Gupta, Temporal dynamic graph LSTM for action-driven video object detection, ICCV, pp.1819-1828, 2017.

Z. Yuan, J. Stroud, T. Lu, and J. Deng, Temporal action localization by structured maximal sums, CVPR, 2017.

D. Zhang, D. Meng, and J. Han, Co-saliency detection via a self-paced multipleinstance learning framework, IEEE transactions on pattern analysis and machine intelligence, vol.39, pp.865-878, 2017.

J. Zhang, X. Shi, J. Xie, H. Ma, I. King et al., Gaan: Gated attention networks for learning on large and spatiotemporal graphs, 2018.

X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang, Adversarial complementary learning for weakly supervised object localization, 2018.
DOI : 10.1109/cvpr.2018.00144
URL : http://arxiv.org/pdf/1804.06962

Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang et al., Temporal action detection with structured segment networks, ICCV, 2017.
DOI : 10.1109/iccv.2017.317
URL : http://arxiv.org/pdf/1704.06228

B. Zhou, A. Khosla, L. A. , A. Oliva, and A. Torralba, Learning Deep Features for Discriminative Localization. CVPR, 2016.
DOI : 10.1109/cvpr.2016.319
URL : http://arxiv.org/pdf/1512.04150

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Object detectors emerge in deep scene cnns, 2014.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, Computer Vision and Pattern Recognition, 2016.
DOI : 10.1109/cvpr.2016.319
URL : http://arxiv.org/pdf/1512.04150

Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, Soft proposal networks for weakly supervised object localization, 2017.
DOI : 10.1109/iccv.2017.204
URL : http://arxiv.org/pdf/1709.01829