🍐 :Pixel-aligned Expressive humAn mesh Recovery

International Digital Economy Academy (IDEA)
PEAR teaser image
Figure 1: We propose PEAR, which achieves pixel-level alignment over previous human mesh recovery methods and demonstrates stronger generalization. It captures more accurate facial details, adapts to diverse inputs, and performs pose recovery within 0.01s from a single image without body-part cropping, providing significant convenience for downstream real-time animation tasks.
PEAR pipeline image
Figure 2: PEAR is trained in two stages. In the first stage, we use single-frame images (employing only the target flow) to train a ViT-based architecture for estimating EHM parameters, including SMPLX body and FLAME head parameters. In the second stage, we use image pairs (from video datasets) as training units and incorporate a neural renderer to provide photometric supervision, thereby enabling joint training of the two modules and enhancing human detail reconstruction and rendering quality.

Abstract

Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR—a unified framework for human mesh recovery and rendering. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient photometric supervision during self-reconstruction. Specifically, we train a clean, unified ViT-based model capable of recovering expressive 3D human geometry (SMPLX + FLAME) from a single image without cropping any specific body parts. This preprocessing-free design enables real-time inference at over 100 FPS. In addition, we integrate the model with a neural renderer to jointly optimize geometry and appearance, significantly improving the reconstruction accuracy of fine-grained human geometry and producing higher-quality rendering results. Finally, we propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves substantial improvements in both geometric reconstruction accuracy and rendering quality.

TL;DR

We propose PEAR, a unified framework for real-time expressive 3D human mesh recovery. It is the first method capable of simultaneously predicting SMPL-X and FLAME parameters at 100 FPS, and it exhibits strong robustness to diverse human-part image inputs, accurately handling face-only, upper-body, and full-body images.

PEAR VS SAM3D

PEAR's fine-grained body pose estimation for the upper body is more accurate than SAM3D's, resulting in an estimated human mesh that is more pixel-aligned. Most importantly, PEAR boasts an inference speed 100x faster than SAM3D, which supports real-time processing/inference.

Multi-HMR
Multi-HMR

SAM3D-Body

Ours

☺️ Head mesh recovery

Our approach attains highly detailed facial alignment, enabling the capture of more nuanced expressions.

OSX

SMPLest

Multi-HMR

Multi-HMR (failed)

Ours

🤷‍♂️ Ubody mesh recovery

Our method achieves more accurate alignment with actual motion in both the face and hands.

OSX

SMPLest

Multi-HMR

Ours

🏃WholeBody mesh recovery

Our method achieves finer pixel-level alignment across the entire human motion, rather than exhibiting the large offsets seen in other approaches.

OSX

SMPLest

Multi-HMR

Ours

⚙️ Downstream application

Benefiting from PEAR’s fast inference speed (100 FPS), the system functions as a real-time animation interface, estimating SMPL-X and FLAME parameters from video streams and driving animations at 50 FPS.

Realtime Animation.

Drive a wider variety of identities

Cartoon Animation

Some extreme cases

We showcase several extreme cases, such as motion blur, occlusions, strong illumination, as well as loose clothing and long hair.

Loose clothing and hair

Extracted image from PDF

BibTeX

@article{}