Markerless Versus Marker-Based 3D Human Pose Estimation for Strength and Conditioning Exercise Identification

Computer vision is how computers analyse image and video content to gather meaningful information. Recent advancements in deep learning have significantly improved computer vision, allowing computers to rival human vision. This progress is due to the availability of extensive training data and increased processing power from modern graphics processing units. Given enough data and computational resources, these deep learning models can excel at tasks like image classification, object segmentation, action recognition, and object localisation.

In the research discussed here, Dr Michael Deyzel, under supervision of Dr Rensu Theart, both of the Department of Electrical and Electronic Engineering, aims to contribute to the engineering of a virtual AI-based personal trainer that leverages modern advancements in computer vision.

An artificially intelligent personal trainer has the potential to identify, record and assess performances of strength and conditioning exercises where a human trainer is not available. By viewing users through cameras, it could perform 3D human pose estimation to extract the motion of key joints through space. This results in a sequence of skeletons that is a compact representation of the exercise action being performed.

Background to the Research

This research contributes to the abovementioned concept by developing a motion capture system that records a subject and reconstructs their 3D pose. This can be called markerless motion capture to distinguish it from optical marker-based motion capture used in research and industry.

Markerless solutions have the potential to make motion capture more accessible because they are non-intrusive, easy to use and affordable. This would be especially valuable in medical settings where clinicians use motion capture to diagnose and treat neuromuscular disorders.

Dr Deyzel worked with two motion capture systems – one traditional, using markers, and a newer, markerless one. He used these to gather data from seven different types of strength and conditioning exercises. Then, he compared how well each system recreated the 3D poses.

He found some issues with the markerless system that need sorting out before it can be used more widely. From his research, he came up with ways to make the markerless system more accurate, factoring in the normal movement path of body joints.

With this method, he was able to cut down the major errors in the markerless system by over 25%.

 Figure 1: Real-world applications of human pose estimation

Research Method

From the skeleton sequences of different exercises he captured, he developed a skeleton-based exercise recognition system using deep learning models. Then,  to learn spatial-temporal features for action identification, Dr Deyzel used a powerful graph convolutional network (GCN) architecture. 

First, he explored transfer learning by pre-training on a large skeleton-based action dataset, which achieves perfect or near-perfect classification accuracy in our seven exercise classes.

Second, he explored the more challenging task of one-shot action recognition. This will be a more useful exercise identification system since enrolment in exercises will only require one example per exercise. He used the GCN model as a feature extractor to learn a metric that projects similar actions closer together in an embedding space and dissimilar actions further apart.

Research Results

Our model achieved a classification accuracy of 87.4% on the seven never-before-seen exercise classes. Our research proves that a markerless motion capture system is sufficient for the capture of 3D pose for application where accuracy and consistency are not of utmost importance, such as for exercise identification, but that more research and development is required before markerless methods can replace the traditional maker-based motion capture used in clinical settings.

Read the full research at: