A Hybrid Photorealistic Architecture Based on Generating Facial Features and Body Reshaping for Virtual Try-on Applications

Tran Van Duc; Pham Quang Tien; Hoang Duc Minh Trieu; Nguyen Thi Ngoc Anh; Dat Tien Nguyen

doi:10.13164/mendel.2023.2.097

Tran Van Duc Viettel High Technology Industries Corporation, Hanoi, Vietnam
Pham Quang Tien Viettel High Technology Industries Corporation, Hanoi, Vietnam
Hoang Duc Minh Trieu Viettel High Technology Industries Corporation, Hanoi, Vietnam
Nguyen Thi Ngoc Anh VNU University of Engineering and Technology, Hanoi, Vietnam
Dat Tien Nguyen Viettel High Technology Industries Corporation, Hanoi, Vietnam

DOI: https://doi.org/10.13164/mendel.2023.2.097

Keywords: Adaptive Skin Color, Body Reshaping, Head Swapping, Photorealistic, Virtual Try-on

Abstract

Online shopping using virtual try-on technology is becoming popular and widely used for digital transformation because of sustainably sourced materials and enhancing customers’ experience. For practical applicability, the process is required for two main factors: (1) accuracy and reliability, and (2) the processing time. To meet the above requirements, we propose a state-of-the-art technique for generating a user’s visualization of model costumes using only a single user portrait and basic anthropometrics. To start, this research would summarize different methods of most virtual try-on clothes approaches, including (1) Interactive simulation between the 3D models, and (2) 2D Photorealistic Generation. In spite of successfully creating the visualization and feasibility, these approaches have to face issues of their efficiency and performance. Furthermore, the complexity of input requirements and the users’ experiments are leading to difficulties in practical application and future scalability. In this regard, our study combines (1) a head-swapping technique using a face alignment model for determining, segmenting, and swapping heads with only a pair of a source and a target image as inputs (2) a photorealistic body reshape pipeline for direct resizing user visualization, and (3) an adaptive skin color models for changing user’s skin, which ensures remaining the face structure and natural. The proposed technique was evaluated quantitatively and qualitatively using three types of datasets which include: (1) VoxCeleb2, (2) Datasets from Viettel collection, and (3) Users Testing to demonstrate its feasibility and efficiency when used in real-world applications

References

Bender, J., Muller, M., Otaduy, M. A., Teschner, M., and Macklin, M. A survey on position-based simulation methods in computer graphics. In Computer graphics forum (2014), vol. 33, Wiley Online Library, pp. 228–251.

Bulat, A., and Tzimiropoulos, G. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on ComputerVision (2017).

Burkov, E., Pasechnik, I., Grigorev, A., and Lempitsky, V. Neural head reenactment with latent pose descriptors. In IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) (June 2020).

Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., and Sheikh, Y. A. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).

Chung, J. S., Nagrani, A., and Zisserman, A. Voxceleb2: Deep speaker recognition. In INTERSPEECH (2018).

Fratarcangeli, M., Tibaldo, V., and Pellacini, F. Vivace: A practical gauss-seidel method for stable soft body dynamics. ACM Transactions on Graphics (TOG) 35, 6 (2016), 1–9.

Hong, F.-T., Zhang, L., Shen, L., and Xu, D. Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 3397–3406.

Li, Y., et al. N-Cloth: Predicting 3D cloth deformation with mesh-based networks. Computer Graphics Forum (Proceedings of Eurographics) 41, 2 (May 2022), 547–558.

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., and Black, M. J. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2019).

Perov, I., et al. Deepfacelab: Integrated, flexible and extensible face-swapping framework. arXiv preprint arXiv:2005.05535 (2020).

Ranzato, M., Huang, F. J., Boureau, Y.-L., and LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In 2007 IEEE conference on computer vision and pattern recognition (2007), IEEE, pp. 1–8.

Razafindrazaka, F. H. Delaunay triangulation algorithm and application to terrain generation. International Institute for Software Technology, United Nations University, Macao (2009).

Ribeiro, S., Maximo, A., Bentes, C., Oliveira, A., and Farias, R. Memory-aware and efficient ray-casting algorithm. In XX Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI 2007) (2007), IEEE, pp. 147–154.

Santesteban, I., Otaduy, M. A., and Casas, D. Learning-based animation of clothing for virtual try-on. In Computer Graphics Forum (2019), vol. 38, Wiley Online Library, pp. 355–366.

Sara, U., Akter, M., and Uddin, M. S. Image quality assessment through fsim, ssim, mse and psnr—a comparative study. Journal of Computer and Communications 7, 3 (2019), 8–18.

Sarkar, K., Golyanik, V., Liu, L., and Theobalt, C. Style and pose control for image synthesis of humans from a single monocular view. arXiv preprint arXiv:2102.11263 (2021).

Shu, C., et al. Few-shot head swapping in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 10789–10798.

Siarohin, A., Lathuili`ere, S., Tulyakov, S., Ricci, E., and Sebe, N. First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS) (December 2019).

Siarohin, A., Woodford, O., Ren, J., Chai, M., and Tulyakov, S. Motion representations for articulated animation. In CVPR (2021).

Xiang, D., et al. Dressing avatars: Deep photorealistic appearance for physically simulated clothing. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–15.

Xiao, T., Hong, J., and Ma, J. Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In Proceedings of the European Conference on Computer Vision (ECCV) (September 2018), pp. 172–187.

Yang, H., Zhang, R., Guo, X., Liu, W., Zuo, W., and Luo, P. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020).

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR (2018).

Zhao, J., and Zhang, H. Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 3657–3666.