Reconstructing Hands in 3D with Transformers
Georgios Pavlakos¹ Dandan Shan² Ilija Radosavovic¹ Angjoo Kanazawa¹ David Fouhey³ Jitendra Malik¹

¹University of California, Berkeley ²University of Michigan ³New York University

CVPR 2024

We present HaMeR, an approach for Hand Mesh Recovery from a single image. Given a bounding box of the hand and the hand side (left/right), we use a deep network to reconstruct the hand in 3D in the form of a MANO mesh. Our approach is applied on each frame independently, yet it recovers temporally smooth results. HaMeR is accurate and particularly robust in cases with occlusions, truncations, different skin colors, hands with different appearances (e.g., wearing gloves), as well as hands interacting with other hands or objects.

Abstract

We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data, we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model, we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings, we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations, HInt, we demonstrate significant improvements over existing baselines. We will make our code, data and models publicly available upon publication.

HaMeR Approach

HaMeR uses a fully transformer-based network design. HaMeR takes as input a single image of a hand and predicts the MANO model parameters, which are used to get the 3D hand mesh.

HInt Dataset

We introduce HInt, a dataset focusing on Hand Interactions. We sample frames from New Days of Hands, EpicKitchens-VISOR and Ego4D and we annotate the hands with 2D keypoints. Since the frames are coming from video, they capture more natural interactions of the hands.

Results

Comparisons to existing approaches

We present comparisons to previous state-of-the-art approaches, FrankMocap and Mesh Graphormer for 3D hand mesh reconstruction. HaMeR is more robust and accurate across a variety of hand poses and viewpoints. The baselines suffer from failure cases under occlusions, truncations or unexpected illuminations, while being significantly more jittery than HaMeR.

Side-View Visualizations

In this video, we also visualize the reconstructed hand meshes from a novel viewpoint (specifically a top view, compared to the camera view). The reconstruction from HaMeR are temporally stable even when observed from this novel viewpoint. As in all videos, the results are estimated on a per-frame basis, without any additional smoothing.

Limitations

The main failure modes include spurious hand detections, errors in left/right hand classification, and extremely hard poses. You can see more results on long videos here.

Similar to prior work, HaMeR requires the hand side (left/right) information for the input image. When the given hand side is correct (left), the reconstructions align well with the 2D hands; when the given hand side is incorrect (right), the reconstructions are expected to be incorrect (i.e., since the model reconstructs a hand of the opposite side), but HaMeR often returns a reasonable interpretation of the input image.

We include representative failure cases of our approach. HaMeR may fail under extreme finger poses, unnatural appearance, extreme occlusions, or unnatural shape (e.g., robotic hand with finger sizes that do not follow the typical human proportions).

Citation

@inproceedings{pavlakos2024reconstructing,
  title={Reconstructing Hands in 3{D} with Transformers},
  author={Pavlakos, Georgios and Shan, Dandan and Radosavovic, Ilija and Kanazawa, Angjoo and Fouhey, David and Malik, Jitendra},
  booktitle={CVPR},
  year={2024}
}

Acknowledgements

This research was supported by the DARPA Machine Common Sense program, ONR MURI, as well as BAIR/BDD sponsors. We thank members of the BAIR community for helpful discussions. We also thank StabilityAI for supporting us through a compute grant. DF and DS were supported by the National Science Foundation under Grant No. 2006619. This webpage template was borrowed from some colorful folks. Music credits: SLAHMR. Icons: Flaticon.