|
|
Paper |
Data |
Code |
Demo |
Colab |
|
We present HaMeR, an approach for Hand Mesh Recovery from a single image. Given a bounding box of the hand and the hand side (left/right), we use a deep network to reconstruct the hand in 3D in the form of a MANO mesh. Our approach is applied on each frame independently, yet it recovers temporally smooth results. HaMeR is accurate and particularly robust in cases with occlusions, truncations, different skin colors, hands with different appearances (e.g., wearing gloves), as well as hands interacting with other hands or objects. |
HaMeR uses a fully transformer-based network design. HaMeR takes as input a single image of a hand and predicts the MANO model parameters, which are used to get the 3D hand mesh. |
We introduce HInt, a dataset focusing on Hand Interactions. We sample frames from New Days of Hands, EpicKitchens-VISOR and Ego4D and we annotate the hands with 2D keypoints. Since the frames are coming from video, they capture more natural interactions of the hands. |
We present comparisons to previous state-of-the-art approaches, FrankMocap and Mesh Graphormer for 3D hand mesh reconstruction. HaMeR is more robust and accurate across a variety of hand poses and viewpoints. The baselines suffer from failure cases under occlusions, truncations or unexpected illuminations, while being significantly more jittery than HaMeR. |
|
In this video, we also visualize the reconstructed hand meshes from a novel viewpoint (specifically a top view, compared to the camera view). The reconstruction from HaMeR are temporally stable even when observed from this novel viewpoint. As in all videos, the results are estimated on a per-frame basis, without any additional smoothing. |
The main failure modes include spurious hand detections, errors in left/right hand classification, and extremely hard poses. You can see more results on long videos here. |
Similar to prior work, HaMeR requires the hand side (left/right) information for the input image. When the given hand side is correct (left), the reconstructions align well with the 2D hands; when the given hand side is incorrect (right), the reconstructions are expected to be incorrect (i.e., since the model reconstructs a hand of the opposite side), but HaMeR often returns a reasonable interpretation of the input image. |
We include representative failure cases of our approach. HaMeR may fail under extreme finger poses, unnatural appearance, extreme occlusions, or unnatural shape (e.g., robotic hand with finger sizes that do not follow the typical human proportions). |
Citation |
AcknowledgementsThis research was supported by the DARPA Machine Common Sense program, ONR MURI, as well as BAIR/BDD sponsors. We thank members of the BAIR community for helpful discussions. We also thank StabilityAI for supporting us through a compute grant. DF and DS were supported by the National Science Foundation under Grant No. 2006619. This webpage template was borrowed from some colorful folks. Music credits: SLAHMR. Icons: Flaticon. |