Javier Selva Castelló

PhD Student at Universitat de Barcelona


I received my Bachelor degree in Computer Science at UPV in 2015. Two years later I got my MSc in Artificial Intelligence from UB, UPC and URV. Currently doing my PhD on Learning Video representations at HuPBA lab. under Prof. Sergio Escalera's supervision. My main research interests include Machine Learning, Natural Language Processing and Computer Vision. I am especially interested in learning algorithms, understanding and analyzing their perks for a more informed development of new techniques. I find unsupervised, multi-modal and generative approaches particularly interesting. More...

Publications

(TPAMI 2023) Video Transformers: A Survey
Javier Selva, Anders S. Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund and Albert Clapés
In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
[Xplore] [arXiv]
[Bibtex]
@article{selva2022video,
 title={Video Transformers: A Survey},
 author={Selva, Javier and Johansen, Anders S. and Escalera, Sergio and Nasrollahi, Kamal and Moeslund, Thomas B. and Clap{\'e}s, Albert},
 year={2023},
 journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
 doi={10.1109/TPAMI.2023.3243465}
}
(ICCV 2021 - DYAD Workshop) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions
David Curto*, Albert Clapés*, Javier Selva*, Sorina Smeureanu, Julio C. S. Jacques Junior, David Gallardo-Pujol, Georgina Guilera, David Leiva, Thomas B. Moeslund, Sergio Escalera and Cristina Palmero
We present the Dyadformer, a novel multi-modal multi-subject Transformer architecture to model individual and interpersonal features in dyadic interactions using variable time windows, thus allowing the capture of long-term interdependencies. Our proposed cross-subject layer allows the network to explicitly model interactions among subjects through attentional operations. This proof-of-concept approach shows how multi-modality and joint modeling of both interactants for longer periods of time helps to predict individual attributes.
[PDF] [Supp.] [arXiv]
[Bibtex]
@InProceedings{Curto_2021_ICCV,
 author = {Curto, David and Clap{\'e}s, Albert and Selva, Javier and Smeureanu, Sorina and Junior, Julio C. S. Jacques and Gallardo-Pujol, David and Guilera, Georgina and Leiva, David and Moeslund, Thomas B. and Escalera, Sergio and Palmero, Cristina},
 title = {Dyadformer: A Multi-Modal Transformer for Long-Range Modeling of Dyadic Interactions},
 booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
 month = {October},
 year = {2021},
 pages = {2177-2188}
}
(WACV 2021 - HBU Workshop) Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset
Cristina Palmero*, Javier Selva*, Sorina Smeureanu*, Julio C. S. Jacques Junior, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo-Pujol, Georgina Guilera, David Leiva and Sergio Escalera
This paper introduces UDIVA, a new non-acted dataset of face-to-face dyadic interactions, where interlocutors perform competitive and collaborative tasks with different behavior elicitation and cognitive workload. The dataset consists of 90.5 hours of dyadic interactions among 147 participants distributed in 188 sessions, recorded using multiple audiovisual and physiological sensors. Currently, it includes sociodemographic, self- and peer-reported personality, internal state, and relationship profiling from participants.
[PDF] [Supp.] [arXiv] [Website]
[Bibtex]
@inproceedings{palmero2021context,
 title={Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset},
 author={Palmero, Cristina and Selva, Javier and Smeureanu, Sorina and Junior, Julio CS Jacques and Clap{\'e}s, Albert and Mosegu{\'\i}, Alexa and Zhang, Zejian and Gallardo-Pujol, David and Guilera, Georgina and Leiva, David and Escalera, Sergio},
 booktitle={2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)},
 pages={1--12},
 year={2021},
 organization={IEEE}
}
(BMVC 2018) Recurrent CNN for 3D Gaze Estimation using Appearance and Shape Cues
Cristina Palmero, Javier Selva, Mohammad Ali Bagheri and Sergio Escalera
In this paper, we tackle the problem of person- and head pose-independent 3D gaze estimation from remote cameras, using a multi-modal recurrent convolutional neural network (CNN). We propose to combine face, eyes region, and face landmarks as individual streams in a CNN to estimate gaze in still images. Then, we exploit the dynamic nature of gaze by feeding the learned features of all the frames in a sequence to a many-to-one recurrent module that predicts the 3D gaze vector of the last frame.
[PDF] [arXiv] [Code]
[Bibtex]
@inproceedings{palmero2018recurrent,
 title={Recurrent CNN for 3D Gaze Estimation using Appearance and Shape Cues},
 author={Palmero, Cristina and Selva, Javier and Bagheri, Mohammad Ali and Escalera, Sergio},
 booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
 year={2018}
}
(ECCV 2018) Folded Recurrent Neural Networks for Future Video Prediction
Marc Oliu, Javier Selva, and Sergio Escalera
This work introduces double-mapping Gated Recurrent Units (dGRU), an extension of standard GRUs where the input is considered as a recurrent state. An extra set of logic gates is added to update the input given the output. Stacking multiple such layers results in a recurrent auto-encoder: the operators updating the outputs comprise the encoder, while the ones updating the inputs form the decoder. Since the states are shared between corresponding encoder and decoder layers, the representation is stratified during learning: some information is not passed to the next layers.
[PDF] [arXiv] [Code]
[Bibtex]
@inproceedings{oliu2018folded,
 title={Folded recurrent neural networks for future video prediction},
 author={Oliu, Marc and Selva, Javier and Escalera, Sergio},
 booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
 pages={716--731},
 year={2018}
}


Teaching

Lecture Slides

Bio and Interests (Resume)

From a very early age I wanted to be a scientist, and the madder the better. As I grew older I gradually narrowed the path to robots: I was especially interested in making their brains go. From there I went to study Computer Science and got my bachelor degree at Universitat Politècnica de València (UPV - 2015), where I started working on NLP, performing sentiment analysis on Twitter content. From there I came to Barcelona, where I’m currently based, to study a Master on Artificial Intelligence. I was super passionate about the program jointly offered by Universitat Politècnica de Catalunya (UPC), Universitat de Barcelona (UB) and Universitat Rovila i Virgili (URV), as it ranged from traditional symbolic AI to deep learning, going through multi-agent systems. This gave me an understanding of a wide range of AI techniques and finally allowed me to better define and understand what I was interested in: allowing machines to learn. This got me looking back to neuroscience, and how we humans actually learn. This interest in neuro steered me towards my Master Thesis, which was a survey on deep video frame prediction (2018), a task which follows predictive coding ideas to train neural networks for computer vision. After that I jumped on a PhD, which I am currently working on under Prof. Sergio Escalera’s supervision, within the HuPBA lab: I am trying to find ways to learn better representations from video. I am curious in nature and being able to integrate knowledge from different fields into our ML algorithms, as well as working on any interdisciplinary project is something I really look forward to. So far I have been working on the UDIVA project, which involved very enriching work with psychologists. While I’ve found that human analysis may not be my preferred line of research, I have strongly enjoyed working with researchers from other fields, as it has helped me to widen my research views outside my own field.

My interests lie mostly within the NLP and CV fields as I think the joint use of both could help machines not only get a better understanding of the world, but also provide them with symbolic reasoning capabilities. I am especially interested in learning representations, a process which I think could be greatly benefited from deeply analyzing and understanding current models. I am particularly drawn towards approaches leveraging reinforcement or (self/un)supervised techniques, possibly aided with strong generative models. I think we should be working towards more computationally efficient architectures which are better capable of generalizing while requiring less data. This in turn, I think, is a step towards democratizing AI, which currently is heavily dependent on data, which in turn is not so much accessible by everyone. Within applied research, I would love to work towards helping humanity make a better use of our resources and try to solve any problem that helps us advance while taking a better care of our environment.