Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
Authors
Abstract
Figure 1. Diffuse to Choose (DTC) allows users to virtually place any e-commerce item in any setting, ensuring detailed, semantically coherent blending with realistic lighting and shadows.
SECTION
as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based imageconditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-
1. Introduction
The ever-growing demand for online shopping underscores the need for a more immersive shopping experience, allowing shoppers to virtually 'try' any product from any category (clothes, shoes, furniture, decoration, etc.) within their personal environments. The concept of a Virtual Try-All (Vit-All) model hinges on its functionality as an advanced semantic image composition tool. In practice, this involves taking an image from a user, selecting a region within that image, and using a reference product image from an online catalog to semantically insert the product into the selected area while preserving its details. For such a model to be effective, it must fulfill three primary conditions: 1) operate in any 'in-the-wild' user image (not only on staged studios or professional human model images with predefined poses), and reference image, 2) integrate the reference product harmoniously with the surrounding context while maintaining the product's identity (not replacing the product with a generic image of a product from similar category), and 3) perform fast inference to facilitate real-time usage across billions of products and millions of users.
Existing solutions tend to be specialized. Instead of aiming for a general purpose Vit-All approach, models are often developed for specific tasks and domains (model for clothing, model for furniture, model for eyeglasses, etc.). For example, early GAN-based works focused primarily on virtual try-on of clothing on human models in limited contexts or controlled environment (such as only certain clothing segments and no in-the-wild user images, or product images) [3, 5, 12, 17, 18, 38] . Other approaches to the problem utilize somewhat expensive 3D AR/VR technologies for items like furniture in rooms [1, 29] , which is hard to scale to every single item on catalogs of billions of products that often lack 3D models. Consequently, a unified model offering a comprehensive Vit-All experience -one that enables consumers to digitally interact with any product from any category in any setting -is currently not available.
The emergence of diffusion models has marked a significant breakthrough in the generative capabilities of complex image modeling [25, 27, 33] . Unlike GANs, Diffusion models inherently grasp the nuances of the 3D world, exhibiting a degree of geometry and physics awareness, as demonstrated in inpainting tasks by [41] , establishing their usefulness for Vit-All applications. A DreamBooth-based [26] technique, called DreamPaint [28] , showed that Stable Diffusion [25] can be few-shot fine-tuned for the Vit-All use case. It can infer how to warp clothes to a person's body, or how to place a certain furniture on a certain spot in a semantically correct manner without being explicitly trained to do so. While DreamPaint meets the first two criteria to be an effective Vit-All model, it requires few-shot fine-tuning for each product separately, compromising its suitability for real-time applications thus failing to meet the third criterion.
A recently introduced image-referenced inpainting model, Paint By Example (PBE) [40] , operates in a zeroshot setting and can handle in-the-wild images, meeting criteria one and three. However it encounters a limitation due to its reliance on an information bottleneck in its conditioning process, utilizing only the [CLS] token of the reference image. This constraint leads to an over-generalization of the reference image, degrading the model's ability to maintain the fine-grained details essential for the Vit-All context, thus PBE fails to meet criterion two. Additionally, operating within a latent space, PBE struggles to retain fine-grained details of each item, underscoring the necessity for incorporating some form of pixel-level guidance.

In this work, we introduce "Diffuse to Choose" (DTC), a novel diffusion inpainting approach designed for the Vit-All application. DTC, a latent diffusion model, effectively incorporates fine-grained cues from the reference image into the main U-Net decoder using a secondary U-Net encoder. Inspired by ControlNet [42] , we integrate a pixellevel "hint" into the masked region of an empty image, which is then processed through a shallow convolutional network, ensuring dimensional alignment with the masked image processed by the Variational Autoencoder (VAE). DTC harmoniously blends the source and reference images, maintaining the integrity and details of the reference image. To further enhance alignment of basic features such as color, we employ perceptual loss using a pre-trained VGG model [7] . The complete architecture is illustrated in Fig. 2 , with examples showcased in Fig. 1 and Fig. 5 .
DTC effectively fulfills all three criteria for the Vit-All use case: 1) It efficiently handles in-the-wild images and references, 2) It adeptly preserves the fine-grained details of products while ensuring their seamless integration into the scene, and 3) It facilitates rapid zero-shot inference. We trained DTC on an in-house training dataset with sampled 1.2M source-reference pairs and a smaller public dataset, VITON-HD-NoFace [5] 1 . Our quantitative evaluations and human studies demonstrate that DTC surpasses all PBE variants -for which we implemented several enhancements to facilitate a fair comparison against DTC -and matches the performance of non-real-time, few-shot personalization models like DreamPaint [28] , within the Vit-All context.
2. Related Work
Virtual Try-On. The primary goal of virtual try-on approaches is to create an image of a person wearing a tar- aligned with the Vit-All task. Image Editing, particularly inpainting, has been extensively explored in diffusion models. Initially, there were text-based image editing models [2, 9, 13, 14, 25] . However, it is evident that text alone cannot capture the fine-grained details necessary for accurately describing a product, necessitating the use of image conditioning. DCCF [38] introduced pyramid filters for image composition, followed by Paint by Example [40] , which conditions the diffusion model using CLIP embeddings of the reference image. However, relying solely on the [CLS] token often leads to an overgeneralization of the reference image, making it unsuitable for the Vit-All task. Similarly, ObjectStitch [34] combines image and text embeddings to guide the model but faces challenges in conveying finegrained details. In response to these challenges, we introduce Diffuse to Choose, a novel latent diffusion inpainting model. Our model effectively leverages fine-grained details from the reference image, ensuring both the preservation of the product's fine-grained details and its seamless integration into the chosen location, while working in a zero-shot setting with any in-the-wild image.
3. Method
We formulate Vit-All as an image-conditioned inpainting task, wherein a single product image is integrated into a user-specified region within a user-specified image, ensuring the preservation of the product's fine-grained details and its harmonious blend with the target image. A naive approach would be using the conventional PBE method, shown in Fig. 3 . However, due to the information bottleneck caused by PBE's reliance on only the [CLS] token for image conditioning, it tends to lose significant details of the reference image, resulting in unsatisfactory performance.
To rectify the shortcomings of the PBE in preserving the reference image's details, we introduce the "Diffuse to Choose" (DTC) method. DTC leverages an auxiliary U-Net alongside the primary U-Net within a latent diffusion model, specifically Stable Diffusion v1.5 [25] . The purpose of the auxiliary U-Net is to protect the details of the reference image that might be lost due to both the latent nature of the Stable Diffusion model and the limitations of image conditioning. To this end, we directly infuse fine-grained details of the reference image into the main U-Net's decoder via affine transformations, ensuring preservation of the reference product's details in the generated image. Our pipeline is shown in Fig. 2 .
3.1. Diffusion Inpainting Models
For the Vit-All inpainting task, our objective is as follows: Given a user provided source image, x s ∈ R H×W ×3 , and a user-defined mask, m, (is a binary matrix of dimension 0, 1 HxW ), with zeros indicating editable regions, and a reference image x r , showcasing the desired product, the objective is to seamlessly incorporate the product image x r within the mask-defined region of x s , ensuring the preservation of x r 's details. Diffusion models provide unparalleled success in image generation and specific tasks such as inpainting [24, 25, 27, 31] . These models follow a Markovian process, gradually introducing noise, denoted as ϵ from N (0, 1), to x s over various timesteps t to transform it into an isotropic Gaussian distribution z t . The process is then reversed by iteratively predicting and subtracting the added noise to convert z t back to x s , conditioned by c. In the context of inpainting, this can be mathematically expressed as:
L = E zt,ϵ,t ∥ϵ θ ((m ⊙ x s ), z t , t, c) − ϵ∥ 2 2 (1)
Here, x s is the source image, m the user-defined mask, z t the noise-added version of x s , and c denotes the embeddings of x r . PBE uses the [CLS] token of CLIP [23] for c, a deliberate information bottleneck, because it relies on self-referencing, and additional patches often leads to copypaste artifacts. However, for the Vit-All task, it is practical to compile a dataset with distinct source and reference images of the same object, thereby eliminating this bottleneck. Consequently, we introduced a series of enhancements to PBE to explore the upper limits of basic imageconditioned inpainting models in the Vit-All context and establish a stronger baseline. Our modifications included Figure 4 . Hint signal is stitched into a blank image within the masked region, then summed up with latent masked input before fed into Auxiliary U-Net. using all CLIP patches instead of just the [CLS] token, employing a larger image encoder DINOV2 [21] , and adding a refinement loss similar to [7] alongside the diffusion loss given in Eq. 1. Each of these alterations incrementally improved the performance of the PBE approach.
3.2. Design Of Diffuse To Choose
Creating the Hint Signal. Drawing inspiration from ControlNet [42] , we propose the incorporation of a secondary U-Net encoder, which serves as a trainable replica of the main U-Net encoder. In the ControlNet architecture however, the secondary U-Net is integrated directly into the main U-Net decoder, providing spatial conditioning. In contrast, DTC demonstrates that the secondary U-Net, rather than providing a spatial layout, can serve to guide the main U-Net by exerting a potent pixel-wise influence from the reference image during the decoding process. To generate the hint signal, we start by creating an image of zeros, identical in size to the source image, x s ∈ R H×W ×3 . Subsequently, we resize the reference image and insert it within the designated mask coordinates within the image of zeros. The same mask is then applied to x s , and this masked source image undergoes processing by the VAE encoder to yield a latent representation, sized 64×64×4. The Hint image is subsequently processed by the Adapter module -a shallow convolutional network comprising four layers-to match with the dimensions of the masked latent inpaint image. Finally, the Hint image and the masked source are added elementwise to produce the final representation of the hint image, which is then processed by the replicated U-Net encoder. This process is not shown in Fig. 2 to keep it concise, but is illustrated in Fig. 4 for clarity. Through a series of ablation studies, we demonstrate that maintaining a distinct representation for the hint image at the pixel level, while keeping the inpaint image in a latent form, provides complementary signals that yield superior results.
Combining Hint Signal with Main U-Net. The Stable Diffusion U-Net Encoder generates feature maps of varying resolutions at each level, consisting of 13 blocks, including the middle layer. The direct addition of the Hint Encoder's outputs to the skip connections of the Main U-Net Encoder at every level tends to result in a pronounced spatial influence from the Hint signal, which is often not spatially aligned with the source image, thus negatively affecting the performance. In addition to direct addition, we explore two strategies for integrating the Hint Signal into the main U-Net: Using Feature-wise Linear Modulation (FiLM) [22] , and Cross Attention, computed in a manner akin to [43] . Among these three approaches-direct addition, FiLM, and Cross Attention-FiLM emerges as the most effective. We argue this is due to the image conditioning already capturing the majority of low-level details from the reference image, with mostly fine-grained details being absent. FiLM specifically enhances those feature channels that are essential for preserving the fine-grained details of the reference image.
Hinting Strategies, Refinement Loss and Image Encoder. Our objective is to convey pixel-level, fine-grained details from the reference image into the main U-Net, and there are several methods to achieve this. One approach is to focus on high-frequency details by employing techniques like Canny Edges or Holistically-Nested Edge Detection (HED) features. Alternatively, we can directly use the reference image itself. In our experiments, we tested Canny edge extraction using the implementation in the OpenCV library [4] , with minimum and maximum threshold values of 30 and 140, and for slightly softer edges, we used the HED model [36] . Despite these strategies yielding comparable results, directly using the reference image proved to be the most effective as it conveys the entire spectrum of details from the reference image, rather than focusing solely on high-frequency details. Thus, instead of pre-filtering to only convey the high-frequency details, it is a better approach to let the FiLM layer decide the most important channels, thus capturing the essential nuances of the reference image.
For our image encoder, we use DINOV2 [21] , which outputs 256×1536 dimensional embeddings to represent a reference image, which is then reduced to 256×768 by a trainable MLP layer. Finally, we utilized a perceptual loss using a pre-trained VGGNet [30] , which is computed by comparing the feature maps from the first five layers of VGGNet for both the source and the generated images, thereby implicitly ensuring the alignment of basic features like color.
4.1. Dataset And Implementation Details
Data. We compiled an in-house training dataset composed of product images. Fortunately, e-commerce products often have multiple images available, so, during training, we do not need to adhere to the self-reference setting employed by PBE, where the reference image is derived from x s , leading to potential overfitting. However not all products yield useful x s , x r pairs, as many product images feature only the product against a white background. While these images are apt for use as x r , they are unsuitable for x s since we require images of products in contextual settings (with a natural background). To address this, we employed an inhouse model to identify products that have at least one x s image depicting the product in a natural setting, interacting with other elements in the scene, and one image with the product itself x r , where we collect images against a white background if it exists. Finally, we use GroundingDINO [19] and SAM [15] alongside with the product type of the x s to create the inpainting mask within x s . From the resulting dataset, we sampled a sample training dataset of 1.2M samples, evenly split between wearables and furniture. To ensure accessibility and reproducibility, we also train and test our model on a public dataset modified to remove model faces, VITONHD-NoFace [5] , which provides x r against a white background, masks, and x s where individuals (with removed faces) are wearing x r . Implementation details. We use a latent diffusion model, Stable Diffusion [25] V1.5 as our backbone in our experiments. Our image resolution is 512 × 512 and we train our model with DDPM [11] using a constant learning rate of 1e − 5 in both PBE and DTC implementations. We use simple augmentations like rotation and flip but avoid strong augmentations given in [40] , as we don't rely on self-referencing. We also use classifier free guidance [10] in similar fashion to [42] . During inference, we use DDIM [32] with guidance scale of 5, and for the hint input we stitch the reference image into the largest rectangular bounding box within the arbitrary-shape binary mask. We use 8 NVIDIA A100 40G GPUs to train our model for 40 epochs.
4.2. Paint By Example Ablations
To ascertain the optimal performance of the naive imageconditioned inpainting approach [40] and create the strongest possible baseline, we implemented a series of modifications to the architecture illustrated in Fig. 3 . Originally, PBE utilized self-reference conditioning, which involved cropping the x r from within x s . However, in the Vit-All context, we circumvent this limitation by having separate images for x r . As a result, instead of using only the [CLS] token from CLIP [23] in the Image Encoder, we incorporated all CLIP patches alongside the [CLS] token.
Subsequently, we increased the capacity of the image encoder by adopting DINOV2 [21] , a larger and purely imagebased model. Furthermore, akin to [7] , we integrated a perceptual loss with the diffusion loss. This involved using an ImageNet pre-trained VGGNet [30] to foster alignment of basic features, such as color and certain textures, between the generated and the source images. A qualitative example of this implementation is presented in Fig. 6 .
Without Perceptual
With Perceptual Loss Figure 6 . Effect of Perceptual Loss.
4.3. Diffuse-To-Choose Ablations
PBEbest | 85.43 | 6.65 |
OurSaddition | 86.94 | 6.19 |
Oursca | 88.01 | 5.68 |
OurSFiLM | 88.14 | 5.72 |
Hint Pathway Ablations. It is possible to directly insert the reference image x r into the masked source image m ⊙ x s and process it with a VAE, circumventing the use of the adapter network to align the hint image's resolution with the latent masked image prior to addition. However, this approach produce suboptimal results, yielding a Clip Score of 86.97 and an FID of 6.26 on our Vit-All dataset, in contrast to the non-latent Hint insertion's Clip Score of 88.14 and FID of 5.72. We hypothesize that maintaining the hint signal at the pixel level could introduce additional information that is overlooked in its VAE encoded latent counterpart. This indicates that the VAE might be excluding certain features during the encoding process. Ablations on Hinting Strategies. We explored alternatives to directly inserting the reference image into the masked source image. These alternatives included using Canny Edges and HED features, both of which are designed to convey the high-frequency details that are absent in image-only conditioning. However, we observed a slight underperformance with both HED and Canny edges compared to the direct use of the reference image. This was evidenced by the CLIP scores, which were 87.85 for Canny and 86.98 for HED, compared to 88.14 for direct usage on our Vit-All dataset. Similarly, the FID scores were 6.11 for Canny and 6.57 for HED, against 5.72 for direct insertion. Ablations of Techniques for Integrating Hint Signal into the Main U-Net. There are multiple ways to merge the signals from the hint U-Net and the main U-Net before incorporating the combined signal into the main U-Net decoder. We explored three approaches: direct addition, the use of an affine transformation layers FiLM [22] , and the integration of more computationally expensive Cross Attention layers [43] . Results shown on Tab. 1 revealed that both FiLM and Cross Attention layers outperform direct addition. Also, Cross Attention and FiLM yield comparable results, and FiLM is cheaper to compute, therefore we chose to use FiLM in our final model.