Logo Egovision4Health: Assessing Activities of Daily Living from a Wearable RGB-D Camera for In-Home Health Care Applications


Abstract. Camera miniaturization and mobile computing now makes it feasible to capture and process videos from body-worn cameras such as the Google Glass headset. This egocentric perspective is particularly well-suited to recognizing objects being handled or observed by the wearer, as well as analysing the gestures and tracking the activities of the wearer. Egovision4Health is a joint research project between the University of Zaragoza, Spain and the University of California, in Irvine, USA. The objective of this three-year project, currently in its second year, is to investigate new egocentric computer vision techniques to automatically provide health professionals with an assessment of their patients’ ability to manipulate objects and perform daily activities.

Keywords. Egocentric Vision. Wearable Cameras. Activity Analysis.

Description of the work performed .

In the first months of the project, we created a first prototype of wearable RGB-D camera by chest-mounting an Intel Creative camera using a GoPro harness. We then collected and annotated (full 3D hand poses) the first RGB-D benchmark dataset of real egocentric object manipulation scenes. We developed a semi-automatic labelling tool which allows to accurately annotate partially occluded hands and fingers in 3D.

Figure 1. Annotated test dataset.


Then, we developed our own rendering engine which synthesizes photorealistic RGB-D images of egocentric object manipulation scenes. This led to the creation of a large scale training dataset of synthetic egocentric RGBD images. In a second phase, we used this dataset to train several new computer vision algorithms for detection and recognition of hands during everyday object manipulations.

Figure 2. Egocentric pose estimation. (a). Synthetic egocentric camera mounted on a virtual avatar and egocentric workspace. (b) Examples of synthetic training depth images. (c) Depth features computed on the whole egocentric workspace for classification. (d) Our prototype (upper-left) of wearable RGB-D camera and pose estimation results in real egocentric RGB-D images.

In a second phase, we analyzed functional object manipulations, this time focusing on fine-grained hand-object interactions. We made use of a recently developed fine-grained taxonomy covering everyday interactions and created a large dataset of 12000 RGB-D images covering 71 everyday grasps in natural interactions.

Figure 3. GUN-71 dataset.

In the last period, we addressed the more general problem of full-body 3D pose estimation in third- person RGB images and developed a new data synthesis technique to generate large-scale (2 millions images) training data that were later used to train Deep Convolutional Neural Networks.

Figure 4. Mocap-guided data augmentation.


Description of the main results achieved so far .

We introduced the use of wearable RGB-D cameras and advanced existing knowledge on hand and object detection in first-person views. In particular, we defined and developed the new concept of Egocentric Workspace and the associated spherical encoding of depth features. This concept allowed developing a new computer vision based method that estimates the 3D pose of an individual’s upper limbs (arms+hands) from a chest mounted depth-camera reaching state-of-the-art results in real-time.

We then analyzed functional object manipulations during daily activities and explored the problem of contact and force prediction (crucial concepts in functional grasp analysis) from perceptual cues. This analysis reveals the importance of depth for segmentation and detection, and the effectiveness of state-of-the-art deep RGB features for detailed grasp understanding. Finally, we artificially augmented a dataset of real images with new synthetic images and showed that Convolutional Neural Networks (CNN) can be trained on artificial images and generalize well to real images. This end-to-end CNN classifier for 3D pose estimation outperforms state-of-the-art results in terms of 3D pose estimation in controlled environments and shows promising results in the wild.


Perspective-aware binary depth features computed on egocentric workspace and estimated first person 3D pose


First person 3D pose estimation in egocentric RGB-D video.


Understanding Everyday Hands in Action from Egocentric (First-person) RGB-D Images


This research is supported by the European Commission under FP7-PEOPLE-2012-IOF – Marie Curie Action: “International Outgoing Fellowships for Career Development”. Grant “Egovision4Health”. (PIOF-GA-2012-328288).