Author:

Supervisor:Prof. Gudrun Klinker, Ph.D.
Advisor:Sandro Weber 
Submission Date:15.04.2024

Abstract

This paper aims to solve the problem of recognizing user-performed actions in a Virtual Reality environment. Using the Unity game engine, a dataset of six actions was recorded by human beings. These six actions consisted of three movements, each performed with the left and right hands. The actions were chosen to be throwing, waving, and pointing. The dataset was constructed using the OpenVR Unity plugin API, which provided the position, velocity, and rotation of the VR headset and the handheld controllers. These were saved in a CSV and fed into a Neural Network with an LSTM (Long-Short-Term-Memory) Layer and three Dense Layers. After training the Neural Network multiple times with varying parameters, such as a higher Node count in the layers, adding Dropout layers, and a more extensive dataset, the Network was exported in an ONNX format and imported back into Unity, where the Barracuda framework was used to utilize the Neural Network. After importing the Neural Network, it can be fed live inputs. It then predicts the performed action and prints the result to the console. This paper was used as a proof of concept to show that action recognition in VR is possible. Possible topics that could be researched further include instructing a robot to perform actions based on the Neural Network's output or extending the Neural Network to support and recognize more actions.

Conclusion

This thesis aims to construct a dataset and propose a proof of concept for a machine-learning approach regarding action recognition in VR. The dataset was chosen to be three actions with a differentiation between left and right. The actions chosen were throwing, waving, and pointing. A Unity script was created that sampled the position, velocity, and rotation of the handheld controllers and the VR headset with a frequency of 30 Hz. The recorded information was then saved to a CSV file. Seven participants recorded actions for the dataset. At first, one of the participants, already in the training set, also recorded the test dataset, but later, the "leave-one-subject-out" approach was used instead. With this approach, one participant was selected as the test set while the others were used as the training set. This was repeated for each participant.

The machine learning approach chosen was a Neural Network due to its common usage in recognition and classification tasks. Due to its ability to recognize spatial and temporal contexts, an LSTM layer was used in the Neural Network. The network was constructed and trained in the Google Colab Cloud environment using an Nvidia T4 GPU. Multiple tweaks were made to increase the network accuracy.

After multiple iterations of different dataset sizes and hyperparameter tuning, the network achieved a peak accuracy of 97\%, using the "leave-one-subject-out" cross-validation technique. Although this is an excellent result, improvements could be made to the dataset and the network. The dataset could be extended by adding more actions and extending it to include more people. The network could also be expanded upon with more and larger layers. Further research could be done on using multi-modal approaches.

A Unity script was created that uses the trained Neural Network to perform live predictions in the VR space. The performed action is sent to the network in Unity, and the prediction is displayed in the console. This script can later be used as a basis and expanded on by other researchers. It could also be ported to other game engines.

This thesis creates an excellent proof of concept for VR-based action recognition that only uses information provided by the VR headset and handheld controllers. It lays the foundation for future work in the direction of using VR-based action recognition to perform tasks in dangerous working environments, as well as for communication in a VR space.


PDF Thesis

Repository

https://gitlab.lrz.de/vyno/bachelor-thesis


Presentation Slides