Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
Authors
Abstract
Figure 1. Diffuse to Choose (DTC) allows users to virtually place any e-commerce item in any setting, ensuring detailed, semantically coherent blending with realistic lighting and shadows.
SECTION
as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based imageconditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-
1. Introduction
The ever-growing demand for online shopping underscores the need for a more immersive shopping experience, allowing shoppers to virtually 'try' any product from any category (clothes, shoes, furniture, decoration, etc.) within their personal environments. The concept of a Virtual Try-All (Vit-All) model hinges on its functionality as an advanced semantic image composition tool. In practice, this involves taking an image from a user, selecting a region within that image, and using a reference product image from an online catalog to semantically insert the product into the selected area while preserving its details. For such a model to be effective, it must fulfill three primary conditions: 1) operate in any 'in-the-wild' user image (not only on staged studios or professional human model images with predefined poses), and reference image, 2) integrate the reference product harmoniously with the surrounding context while maintaining the product's identity (not replacing the product with a generic image of a product from similar category), and 3) perform fast inference to facilitate real-time usage across billions of products and millions of users.
Existing solutions tend to be specialized. Instead of aiming for a general purpose Vit-All approach, models are often developed for specific tasks and domains (model for clothing, model for furniture, model for eyeglasses, etc.). For example, early GAN-based works focused primarily on virtual try-on of clothing on human models in limited contexts or controlled environment (such as only certain clothing segments and no in-the-wild user images, or product images) [3, 5, 12, 17, 18, 38] . Other approaches to the problem utilize somewhat expensive 3D AR/VR technologies for items like furniture in rooms [1, 29] , which is hard to scale to every single item on catalogs of billions of products that often lack 3D models. Consequently, a unified model offering a comprehensive Vit-All experience -one that enables consumers to digitally interact with any product from any category in any setting -is currently not available.
The emergence of diffusion models has marked a significant breakthrough in the generative capabilities of complex image modeling [25, 27, 33] . Unlike GANs, Diffusion models inherently grasp the nuances of the 3D world, exhibiting a degree of geometry and physics awareness, as demonstrated in inpainting tasks by [41] , establishing their usefulness for Vit-All applications. A DreamBooth-based [26] technique, called DreamPaint [28] , showed that Stable Diffusion [25] can be few-shot fine-tuned for the Vit-All use case. It can infer how to warp clothes to a person's body, or how to place a certain furniture on a certain spot in a semantically correct manner without being explicitly trained to do so. While DreamPaint meets the first two criteria to be an effective Vit-All model, it requires few-shot fine-tuning for each product separately, compromising its suitability for real-time applications thus failing to meet the third criterion.
A recently introduced image-referenced inpainting model, Paint By Example (PBE) [40] , operates in a zeroshot setting and can handle in-the-wild images, meeting criteria one and three. However it encounters a limitation due to its reliance on an information bottleneck in its conditioning process, utilizing only the [CLS] token of the reference image. This constraint leads to an over-generalization of the reference image, degrading the model's ability to maintain the fine-grained details essential for the Vit-All context, thus PBE fails to meet criterion two. Additionally, operating within a latent space, PBE struggles to retain fine-grained details of each item, underscoring the necessity for incorporating some form of pixel-level guidance.
In this work, we introduce "Diffuse to Choose" (DTC), a novel diffusion inpainting approach designed for the Vit-All application. DTC, a latent diffusion model, effectively incorporates fine-grained cues from the reference image into the main U-Net decoder using a secondary U-Net encoder. Inspired by ControlNet [42] , we integrate a pixellevel "hint" into the masked region of an empty image, which is then processed through a shallow convolutional network, ensuring dimensional alignment with the masked image processed by the Variational Autoencoder (VAE). DTC harmoniously blends the source and reference images, maintaining the integrity and details of the reference image. To further enhance alignment of basic features such as color, we employ perceptual loss using a pre-trained VGG model [7] . The complete architecture is illustrated in Fig. 2 , with examples showcased in Fig. 1 and Fig. 5 .
DTC effectively fulfills all three criteria for the Vit-All use case: 1) It efficiently handles in-the-wild images and references, 2) It adeptly preserves the fine-grained details of products while ensuring their seamless integration into the scene, and 3) It facilitates rapid zero-shot inference. We trained DTC on an in-house training dataset with sampled 1.2M source-reference pairs and a smaller public dataset, VITON-HD-NoFace [5] 1 . Our quantitative evaluations and human studies demonstrate that DTC surpasses all PBE variants -for which we implemented several enhancements to facilitate a fair comparison against DTC -and matches the performance of non-real-time, few-shot personalization models like DreamPaint [28] , within the Vit-All context.
2. Related Work
Virtual Try-On. The primary goal of virtual try-on approaches is to create an image of a person wearing a tar- aligned with the Vit-All task. Image Editing, particularly inpainting, has been extensively explored in diffusion models. Initially, there were text-based image editing models [2, 9, 13, 14, 25] . However, it is evident that text alone cannot capture the fine-grained details necessary for accurately describing a product, necessitating the use of image conditioning. DCCF [38] introduced pyramid filters for image composition, followed by Paint by Example [40] , which conditions the diffusion model using CLIP embeddings of the reference image. However, relying solely on the [CLS] token often leads to an overgeneralization of the reference image, making it unsuitable for the Vit-All task. Similarly, ObjectStitch [34] combines image and text embeddings to guide the model but faces challenges in conveying finegrained details. In response to these challenges, we introduce Diffuse to Choose, a novel latent diffusion inpainting model. Our model effectively leverages fine-grained details from the reference image, ensuring both the preservation of the product's fine-grained details and its seamless integration into the chosen location, while working in a zero-shot setting with any in-the-wild image.
3. Method
We formulate Vit-All as an image-conditioned inpainting task, wherein a single product image is integrated into a user-specified region within a user-specified image, ensuring the preservation of the product's fine-grained details and its harmonious blend with the target image. A naive approach would be using the conventional PBE method, shown in Fig. 3 . However, due to the information bottleneck caused by PBE's reliance on only the [CLS] token for image conditioning, it tends to lose significant details of the reference image, resulting in unsatisfactory performance.
To rectify the shortcomings of the PBE in preserving the reference image's details, we introduce the "Diffuse to Choose" (DTC) method. DTC leverages an auxiliary U-Net alongside the primary U-Net within a latent diffusion model, specifically Stable Diffusion v1.5 [25] . The purpose of the auxiliary U-Net is to protect the details of the reference image that might be lost due to both the latent nature of the Stable Diffusion model and the limitations of image conditioning. To this end, we directly infuse fine-grained details of the reference image into the main U-Net's decoder via affine transformations, ensuring preservation of the reference product's details in the generated image. Our pipeline is shown in Fig. 2 .
3.1. Diffusion Inpainting Models
For the Vit-All inpainting task, our objective is as follows: Given a user provided source image, x s ∈ R H×W ×3 , and a user-defined mask, m, (is a binary matrix of dimension 0, 1 HxW ), with zeros indicating editable regions, and a reference image x r , showcasing the desired product, the objective is to seamlessly incorporate the product image x r within the mask-defined region of x s , ensuring the preservation of x r 's details. Diffusion models provide unparalleled success in image generation and specific tasks such as inpainting [24, 25, 27, 31] . These models follow a Markovian process, gradually introducing noise, denoted as ϵ from N (0, 1), to x s over various timesteps t to transform it into an isotropic Gaussian distribution z t . The process is then reversed by iteratively predicting and subtracting the added noise to convert z t back to x s , conditioned by c. In the context of inpainting, this can be mathematically expressed as:
L = E zt,ϵ,t ∥ϵ θ ((m ⊙ x s ), z t , t, c) − ϵ∥ 2 2 (1)
Here, x s is the source image, m the user-defined mask, z t the noise-added version of x s , and c denotes the embeddings of x r . PBE uses the [CLS] token of CLIP [23] for c, a deliberate information bottleneck, because it relies on self-referencing, and additional patches often leads to copypaste artifacts. However, for the Vit-All task, it is practical to compile a dataset with distinct source and reference images of the same object, thereby eliminating this bottleneck. Consequently, we introduced a series of enhancements to PBE to explore the upper limits of basic imageconditioned inpainting models in the Vit-All context and establish a stronger baseline. Our modifications included Figure 4 . Hint signal is stitched into a blank image within the masked region, then summed up with latent masked input before fed into Auxiliary U-Net. using all CLIP patches instead of just the [CLS] token, employing a larger image encoder DINOV2 [21] , and adding a refinement loss similar to [7] alongside the diffusion loss given in Eq. 1. Each of these alterations incrementally improved the performance of the PBE approach.
3.2. Design Of Diffuse To Choose
Creating the Hint Signal. Drawing inspiration from ControlNet [42] , we propose the incorporation of a secondary U-Net encoder, which serves as a trainable replica of the main U-Net encoder. In the ControlNet architecture however, the secondary U-Net is integrated directly into the main U-Net decoder, providing spatial conditioning. In contrast, DTC demonstrates that the secondary U-Net, rather than providing a spatial layout, can serve to guide the main U-Net by exerting a potent pixel-wise influence from the reference image during the decoding process. To generate the hint signal, we start by creating an image of zeros, identical in size to the source image, x s ∈ R H×W ×3 . Subsequently, we resize the reference image and insert it within the designated mask coordinates within the image of zeros. The same mask is then applied to x s , and this masked source image undergoes processing by the VAE encoder to yield a latent representation, sized 64×64×4. The Hint image is subsequently processed by the Adapter module -a shallow convolutional network comprising four layers-to match with the dimensions of the masked latent inpaint image. Finally, the Hint image and the masked source are added elementwise to produce the final representation of the hint image, which is then processed by the replicated U-Net encoder. This process is not shown in Fig. 2 to keep it concise, but is illustrated in Fig. 4 for clarity. Through a series of ablation studies, we demonstrate that maintaining a distinct representation for the hint image at the pixel level, while keeping the inpaint image in a latent form, provides complementary signals that yield superior results.
Combining Hint Signal with Main U-Net. The Stable Diffusion U-Net Encoder generates feature maps of varying resolutions at each level, consisting of 13 blocks, including the middle layer. The direct addition of the Hint Encoder's outputs to the skip connections of the Main U-Net Encoder at every level tends to result in a pronounced spatial influence from the Hint signal, which is often not spatially aligned with the source image, thus negatively affecting the performance. In addition to direct addition, we explore two strategies for integrating the Hint Signal into the main U-Net: Using Feature-wise Linear Modulation (FiLM) [22] , and Cross Attention, computed in a manner akin to [43] . Among these three approaches-direct addition, FiLM, and Cross Attention-FiLM emerges as the most effective. We argue this is due to the image conditioning already capturing the majority of low-level details from the reference image, with mostly fine-grained details being absent. FiLM specifically enhances those feature channels that are essential for preserving the fine-grained details of the reference image.
Hinting Strategies, Refinement Loss and Image Encoder. Our objective is to convey pixel-level, fine-grained details from the reference image into the main U-Net, and there are several methods to achieve this. One approach is to focus on high-frequency details by employing techniques like Canny Edges or Holistically-Nested Edge Detection (HED) features. Alternatively, we can directly use the reference image itself. In our experiments, we tested Canny edge extraction using the implementation in the OpenCV library [4] , with minimum and maximum threshold values of 30 and 140, and for slightly softer edges, we used the HED model [36] . Despite these strategies yielding comparable results, directly using the reference image proved to be the most effective as it conveys the entire spectrum of details from the reference image, rather than focusing solely on high-frequency details. Thus, instead of pre-filtering to only convey the high-frequency details, it is a better approach to let the FiLM layer decide the most important channels, thus capturing the essential nuances of the reference image.
For our image encoder, we use DINOV2 [21] , which outputs 256×1536 dimensional embeddings to represent a reference image, which is then reduced to 256×768 by a trainable MLP layer. Finally, we utilized a perceptual loss using a pre-trained VGGNet [30] , which is computed by comparing the feature maps from the first five layers of VGGNet for both the source and the generated images, thereby implicitly ensuring the alignment of basic features like color.
4.1. Dataset And Implementation Details
Data. We compiled an in-house training dataset composed of product images. Fortunately, e-commerce products often have multiple images available, so, during training, we do not need to adhere to the self-reference setting employed by PBE, where the reference image is derived from x s , leading to potential overfitting. However not all products yield useful x s , x r pairs, as many product images feature only the product against a white background. While these images are apt for use as x r , they are unsuitable for x s since we require images of products in contextual settings (with a natural background). To address this, we employed an inhouse model to identify products that have at least one x s image depicting the product in a natural setting, interacting with other elements in the scene, and one image with the product itself x r , where we collect images against a white background if it exists. Finally, we use GroundingDINO [19] and SAM [15] alongside with the product type of the x s to create the inpainting mask within x s . From the resulting dataset, we sampled a sample training dataset of 1.2M samples, evenly split between wearables and furniture. To ensure accessibility and reproducibility, we also train and test our model on a public dataset modified to remove model faces, VITONHD-NoFace [5] , which provides x r against a white background, masks, and x s where individuals (with removed faces) are wearing x r . Implementation details. We use a latent diffusion model, Stable Diffusion [25] V1.5 as our backbone in our experiments. Our image resolution is 512 × 512 and we train our model with DDPM [11] using a constant learning rate of 1e − 5 in both PBE and DTC implementations. We use simple augmentations like rotation and flip but avoid strong augmentations given in [40] , as we don't rely on self-referencing. We also use classifier free guidance [10] in similar fashion to [42] . During inference, we use DDIM [32] with guidance scale of 5, and for the hint input we stitch the reference image into the largest rectangular bounding box within the arbitrary-shape binary mask. We use 8 NVIDIA A100 40G GPUs to train our model for 40 epochs.
4.2. Paint By Example Ablations
To ascertain the optimal performance of the naive imageconditioned inpainting approach [40] and create the strongest possible baseline, we implemented a series of modifications to the architecture illustrated in Fig. 3 . Originally, PBE utilized self-reference conditioning, which involved cropping the x r from within x s . However, in the Vit-All context, we circumvent this limitation by having separate images for x r . As a result, instead of using only the [CLS] token from CLIP [23] in the Image Encoder, we incorporated all CLIP patches alongside the [CLS] token.
Subsequently, we increased the capacity of the image encoder by adopting DINOV2 [21] , a larger and purely imagebased model. Furthermore, akin to [7] , we integrated a perceptual loss with the diffusion loss. This involved using an ImageNet pre-trained VGGNet [30] to foster alignment of basic features, such as color and certain textures, between the generated and the source images. A qualitative example of this implementation is presented in Fig. 6 .
Without Perceptual
With Perceptual Loss Figure 6 . Effect of Perceptual Loss.
4.3. Diffuse-To-Choose Ablations
PBEbest | 85.43 | 6.65 |
OurSaddition | 86.94 | 6.19 |
Oursca | 88.01 | 5.68 |
OurSFiLM | 88.14 | 5.72 |
Hint Pathway Ablations. It is possible to directly insert the reference image x r into the masked source image m ⊙ x s and process it with a VAE, circumventing the use of the adapter network to align the hint image's resolution with the latent masked image prior to addition. However, this approach produce suboptimal results, yielding a Clip Score of 86.97 and an FID of 6.26 on our Vit-All dataset, in contrast to the non-latent Hint insertion's Clip Score of 88.14 and FID of 5.72. We hypothesize that maintaining the hint signal at the pixel level could introduce additional information that is overlooked in its VAE encoded latent counterpart. This indicates that the VAE might be excluding certain features during the encoding process. Ablations on Hinting Strategies. We explored alternatives to directly inserting the reference image into the masked source image. These alternatives included using Canny Edges and HED features, both of which are designed to convey the high-frequency details that are absent in image-only conditioning. However, we observed a slight underperformance with both HED and Canny edges compared to the direct use of the reference image. This was evidenced by the CLIP scores, which were 87.85 for Canny and 86.98 for HED, compared to 88.14 for direct usage on our Vit-All dataset. Similarly, the FID scores were 6.11 for Canny and 6.57 for HED, against 5.72 for direct insertion. Ablations of Techniques for Integrating Hint Signal into the Main U-Net. There are multiple ways to merge the signals from the hint U-Net and the main U-Net before incorporating the combined signal into the main U-Net decoder. We explored three approaches: direct addition, the use of an affine transformation layers FiLM [22] , and the integration of more computationally expensive Cross Attention layers [43] . Results shown on Tab. 1 revealed that both FiLM and Cross Attention layers outperform direct addition. Also, Cross Attention and FiLM yield comparable results, and FiLM is cheaper to compute, therefore we chose to use FiLM in our final model.
4.4. Evaluation And Comparisons
Comparison Against Paint by Example Variants. We implemented a series of enhancements to PBE and trained each variant on VITONHD-NoFace dataset. The results are presented in Table 2 . As anticipated, using all CLIP patches surpasses the performance of using only the [CLS] token, which is limited to encoding a generalized version of x r . Furthermore, augmenting the size of the image encoder by using DINOv2 notably enhances performance. Notably, the addition of perceptual loss provides a marginal improvement in scenarios where the model initially struggled with basic features, such as color. While PBE, particularly with DINOv2 and perceptual loss, is adept at handling basic items with minimal details, it often falls short in the inpainting of detailed items. In contrast, DTC exhibits superior performance, especially in preserving the fine-grained details of items. Figure 9 illustrates the outcomes achieved with certain enhancements. Comparisons Against Few-Shot Personalization Meth- ods. While personalization methods such as DreamBooth [26] do not support inpainting, the recently introduced DreamPaint approach [28] enables similar few-shot finetuning of the U-Net in a masked setting, allowing for the generation of specified concepts at user-defined locations. However, DreamPaint requires few-shot fine-tuning with multiple product images, taking about 40 minutes per product to be trained. We manually selected 30 samples to compare DTC with DreamPaint and PBE. Visual comparisons are presented in Fig. 8 . Furthermore, we conducted a subjective human survey, the results of which are tabulated in Table 3 . A total of 20 participants scored each image on a scale from 1 to 5, with 1 being the best, based on both the inpainted region's similarity to the reference image and its contextual blending. The results show that DTC, despite being a zero-shot model, performs on par with DreamPaint, which requires few-shot fine-tuning with multiple x r .
5. Conclusion And Limitations
Limitations. DTC has limitations. Despite our efforts to inject fine-grained details, the model may still overlook finegrained details, particularly in text engravings, a challenge inherent to Stable Diffusion (see Fig. 7) . Additionally, the model might alter human poses since it doesn't consider pose, leading to discrepancies with pose-agnostic masking, especially for full-body coverage (see Fig. 10 in the Appendix). Introducing pose conditioning could mitigate this, but we prioritized a general-purpose model over taskspecific auxiliary inputs for broader applicability. Conclusion. In this paper, we introduced "Diffuse to Choose," a novel image-conditioned diffusion inpainting model designed for the Virtual Try-All, aiming to integrate e-commerce items into user images while preserving item details. Our main contributions include employing a secondary U-Net to infuse fine-grained signals from the reference image into the primary U-Net decoder using basic affine transformation layers within a latent diffusion model. Moreover, we refined the PBE model for peak performance achievable with straightforward image-conditioned inpainting models. We compared DTC with upgraded PBE variants and a few-shot personalization methods using datasets like VITONHD-NoFace and a larger in-house dataset and show that DTC outperforms existing diffusion based in- During training, with equal probability, we alternate between a fine-grained mask (where we only mask the item specifically) and bounding box shaped masks (covering the largest bounding box spanned by the fine-grained mask).
For each case, we stitch the reference image within the largest rectangular shape inside the mask. This approach is straightforward in the case of rectangular masks. However, for fine-grained masks, we calculate the largest rectangular area within the binary mask. Initially, we construct a histogram for each row of the matrix, with each entry in the histogram representing the cumulative height of masked areas in the column up to that row. We then calculate the maximum area rectangle that can be formed in each histogram, updating the coordinates of the largest rectangle as we iterate through the rows. This process ultimately yields the top-left and bottom-right coordinates of the largest rectangle fitting inside the mask. An example is shown in Fig. 11 . During inference, we stitch the hint image within the largest rectangular region of the mask.
Garment-Only Mask
Pose Agnostic Mask Figure 10 . Pose agnostic masking case.
5.2. Implementation Details And Inference Performance
In all our experiments, we used Stable Diffusion v1-5 [25] . For our image encoder, we employed DINOV2 [21] , which outputs 1536-dimensional vectors for every patch of the reference image, of shape 224 × 224 × 3. Thus, it yields 256 × 1536-dimensional outputs. Additionally, we appended the CLS token to obtain 257 × 1536 image conditioning vectors. Subsequently, these vectors were processed through a single layer of a fully connected network, which was trained from scratch, to reduce them to 257 × 768 dimensions. We trained our model using AdamW [20] with a constant learning rate of 1e − 5 and used horizontal flip and rotation as augmentations. To calculate the CLIP score, we used ViT-B/32 [23] . Finally, the model is efficient in inference, taking ≈ 6 seconds to run a single pass on an A100 (40GB) GPU with 40 DDIM steps.
5.3. The Effect Of Masking
Since our approach relies on collages, the mask serves as a strong prior for the DTC model. As illustrated in Fig. 12 , the use of masking enables users during inference to manipulate clothing styles. Consequently, users can guide the model to generate a t-shirt in a tucked-in style, or with sleeves rolled up, among other variations.
5.4. Iterative Inpainting
DTC enables a range of enjoyable applications. For instance, users can begin with an empty room and iteratively decorate it, designing as shown in Fig. 13 . The same principle applies to clothing; users can generate multiple items of clothing in combination with one another to experiment with different outfit combinations, as shown in Fig. 14 .
5.5. Visualization Of Hint Signal
As mentioned, in addition to direct stitching, we also utilized Canny Edges and HED edges on our hint pathway, as demonstrated in Fig. 16 . For Canny Edges, we used sobel filters for each color channel independently and then combined the results to obtain RGB edge information, which we believed could more effectively convey the details of ecommerce items.
5.6. More On Limitations
For certain items, such as shoes, the model frequently fails to generate satisfactory results. We argue that this issue stems from SAM's [15] inability to generate appropriate masks specifically for shoes or, more broadly, for items presented in pairs. SAM often masks only one shoe of a pair, leading the model to learn shortcut features from the unmasked shoe during training, rather than acquiring useful, generalizable features. And as mentioned, since we use a latent diffusion model as our backbone, no matter how much extra information we guide it with, we are subject to the capacity of VAE decoder, which often fails to generate very fine grained concepts like detailed engravings etc.
5.7. Comparison Against Other Methods.
Most state-of-the-art GAN-based methods are tailored for single-domain applications, such as virtual try-ons in con- Figure 11 . We find the largest rectangular bounding box inside an fine-grained binary mask. Then the same coordinates are used to stitch the reference image into an image of zeros to create the initial hint signal. Figure 12 . DTC allows users to manipulate different styles of the same clothing by adjusting the mask (given in the row above for each image). The first two columns display variations of the same t-shirt, showcasing it both tucked out and tucked in. The third and fourth columns illustrate the same shirt with normal sleeves and with sleeves rolled up.
trolled environments with sanitized backgrounds, and often necessitate additional inputs like pose or depth maps. Also, it is already established that diffusion-based approaches are superior to GANs in performance, possessing more comprehensive world models [41] . Consequently, diffusion-based models are more apt for the Vit-All use case.
The scope of our comparison models is intentionally limited. Personalization models such as ELITE [35] , Custom Diffusion [16] , DreamBooth [26] , and Textual Inversion [6] lack inpainting capabilities, as they aim to directly generate entire views. DreamPaint [28] is the only exception with inpainting support. Among the models that facilitate inpaint- ing, including PBE [40] and DreamPaint, we attempted to employ DCCF [38] . However, its tendency to create copypaste artifacts made it unsuitable for the Vit-All task, where semantically blending the item with its environment is as crucial as preserving its detailed features.
5.8. More Examples
We provide more qualitative examples to showcase DTC's capabilities. Please see Fig. 17, 18 , 15 for more examples.
SECTION
The VITON-HD public dataset was modified to remove and crop all model faces from the images in the dataset.