Motion Embedded Images: An Approach to Capture Spatial and Temporal Features for Action Recognition
DOI:
https://doi.org/10.31449/inf.v47i3.4755Abstract
The demand for human activity recognition (HAR) from videos has witnessed a significant surge in various real-life applications, including video surveillance, healthcare, elderly care, among others. The explotion of short-form videos on social media platforms has further intensified the interest in this domain. This research endeavors to focus on the problem of HAR in general short videos. In contrast to still images, video clips offer both spatial and temporal information, rendering it challenging to extract complementary information on appearance from still frames and motion between frames. This research makes a two-fold contribution. Firstly, we investigate the use of motion-embedded images in a variant of two-stream Convolutional Neural Network architecture, in which one stream captures motion using combined batches of frames, while another stream employs a normal image classification ConvNet to classify static appearance. Secondly, we create a novel dataset of Southeast Asian Sports short videos that encompasses both videos with and without effects, which is a modern factor that is lacking in all currently available datasets used for benchmarking models. The proposed model is trained and evaluated on two benchmarks: UCF-101 and SEAGS-V1. The results reveal that the proposed model yields competitive performance compared to prior attempts to address the same problem.References
Carreira, J. and Zisserman, A. (2017). Quovadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750.
Feichtenhofer, C. (2020). X3D: expanding architectures for efficient video recognition. CoRR, abs/2004.04730.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2018). Slowfast networks for video recognition. CoRR, abs/1812.03982.
Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. CoRR, abs/1604.06573.
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fr ̈und, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. (2017). The ”something something” video database for learning and evaluating visual common sense. CoRR, abs/1706.04261.
Han, C., Wang, C., Mei, E., Redmon, J., Divvala, S. K., Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., and Xue, X. (2017). Yolo-based adaptive window two-stream convolutional neural network for video classification.
Hara, K., Kataoka, H., and Satoh, Y. (2017). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? CoRR, abs/1711.09577.
Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970.
Heng Wang, Alexander Kl ̈aser, C. S. L. C.-L. (2011). Action recognition by dense trajectories.
Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231.
Kalfaoglu, M. E., Kalkan, S., and Alatan, A. A. (2020). Late temporal modeling in 3d cnn architectures with bert for action recognition.
Karpathy, A., Toderici, G., Shetty, S., Le-ung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer vision and Pattern Recognition, pages 1725–1732.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F.,Green, T., Back, T., Natsev, P., Suleyman, M.,and Zisserman, A. (2017). The kinetics human action video dataset. CoRR, abs/1705.06950.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Pereira, F.,Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.
Laptev and Lindeberg (2003). Space-time interest points. In Proceedings Ninth IEEE International Conference on Computer Vision pages 432–439 vol.1.
Lin, J., Gan, C., and Han, S. (2018). Temporal shift module for efficient video understanding. CoRR, abs/1811.08383.
Ng, J. Y., Choi, J., Neumann, J., and Davis, L. S. (2016). Actionflownet: Learning motion representation for action recognition. CoRR, abs/1612.03052.
Ng, J. Y., Hausknecht, M. J., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. CoRR, abs/1503.08909.
Rodriguez, M. D., Ahmed, J., and Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 18.
Simonyan, K. and Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199.
Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402.
Wang, L., Xiong, Y., Wang, Z., and Qiao, Y. (2015). Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159.
Zach, C., Pock, T., and Bischof, H. (2007). A duality based approach for realtime tv-l1 optical flow. volume 4713, pages 214–223.
Zhang, B., Wang, L., Wang, Z., Qiao, Y., and Wang, H. (2016). Real-time action recognition with enhanced motion vector cnns. CoRR, abs/1604.07669.
Zhu, Y., Lan, Z., Newsam, S. D., and Hauptmann, A. G. (2017). Hidden two-stream convolutional networks for action recognition. CoRR, abs/1704.00389.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika