A Plug-and-Play Network for Refining Human Poses in Videos

European Conference on Computer Vision (ECCV), 2022
Ailing Zeng 1 Lei Yang 2 Xuan Ju 1 Jiefeng Li 3 Jianyi Wang 4 Qiang Xu 1
1 The Chinese University of Hong Kong 2 Sensetime Group Ltd.
3 Shanghai Jiao Tong University 4 Nanyang Technological University


When analyzing human motion videos, the output jitters from existing pose estimators are highly-unbalanced with varied estimation errors across frames. Most frames in a video are relatively easy to estimate and only suffer from slight jitters. In contrast, for rarely seen or occluded actions, the estimated positions of multiple joints largely deviate from the ground truth values for a consecutive sequence of frames, rendering significant jitters on them. To tackle this problem, we propose to attach a dedicated temporal-only refinement network to existing pose estimators for jitter mitigation, named SmoothNet. Unlike existing learning-based solutions that employ spatio-temporal models to co-optimize per-frame precision and temporal smoothness at all the joints, SmoothNet models the natural smoothness characteristics in body movements by learning the long-range temporal relations of every joint without considering the noisy correlations among joints. With a simple yet effective motion-aware fully-connected network, SmoothNet improves the temporal smoothness of existing pose estimators significantly and enhances the estimation accuracy of those challenging frames as a side-effect. Moreover, as a temporal-only model, a unique advantage of SmoothNet is its strong transferability across various types of estimators and datasets. Comprehensive experiments on five datasets with eleven popular backbone networks across 2D and 3D pose estimation and body recovery tasks demonstrate the efficacy of the proposed solution.

Demo of SmoothNet

Qualitative Results

  • Figure 1. The first row illustrates the comparison of the existing 2D pose estimation method with adding the proposed SmoothNet (smoothing on 2d positions) on Subject 9, Sitting Down, Camera 1 of the Human3.6M dataset. The second row shows the Accel (mean acceleration error) for each frame. The third row demonstrates the comparison of MPJPE (mean per joint position error) for each frame.

  • Figure 2. Comparison of 2D Pose Estimation (smoothing on 2d positions) on the Subject 11, Sitting Down, Camera 1 of the Human3.6M dataset.

  • Figure 3. Comparison of 3D Pose Estimation (smoothing on 3d positions) on the Subject 9, Posing, Camera 0 of the Human3.6M dataset. The details in actions are kept well.

  • Figure 4. Comparison of 3D Pose Estimation (smoothing on 3d positions) on the Subject 9, Walking Together, Camera 3 (lower) of the Human3.6M dataset. The details in actions are kept well.

  • Figure 5. Comparison of human body recovery (SPIN) with adding the SmoothNet (smoothing on 6d rotation matrices) on the 3DPW dataset.

  • Figure 6. Comparison of human body recovery (EFT) with adding the SmoothNet (smoothing on 6d rotation matrices) on the 3DPW dataset.

  • Figure 7. Comparison of human body recovery (VIBE) with adding the SmoothNet (smoothing on 6d rotation matrices) on the AIST++ dataset.

Downstream task: Skeleton-based action recognition

Figure 8. (a) Existing 3D Position Ground Truth on NTU-RGBD-60.

Figure 8. (b) Refined 3D Positions with SmoothNet.

      title={SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos},
      author={Zeng, Ailing and Yang, Lei and Ju, Xuan and Li, Jiefeng and Wang, Jianyi and Xu, Qiang},
      booktitle={European Conference on Computer Vision},

For any questions, please contact Ailing Zeng (