Pre-Trained Image Processing Transformer

Hanting Chen
Yunhe Wang
Tianyu Guo
Chang Xu
Yiping Deng
Zhenhua Liu
Siwei Ma
Chunjing Xu
Chao Xu
Wen Gao
Computer Vision and Pattern Recognition
2020
View in Semantic Scholar

Abstract

As the computing power of modern hardware is increasing strongly, pre-trained deep learning models (e.g., BERT, GPT-3) learned on large-scale datasets have shown their effectiveness over conventional methods. The big progress is mainly contributed to the representation ability of transformer and its variant architectures. In this paper, we study the low-level computer vision task (e.g., denoising, super-resolution and deraining) and develop a new pre-trained model, namely, image processing transformer (IPT). To maximally excavate the capability of transformer, we present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs. The IPT model is trained on these images with multi-heads and multi-tails. In addition, the contrastive learning is introduced for well adapting to different image processing tasks. The pre-trained model can therefore efficiently employed on desired task after fine-tuning. With only one pre-trained model, IPT outperforms the current state-of-the-art methods on various low-level benchmarks. Code is available at https://github.com/huawei-noah/Pretrained-IPT and https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/IPT

1. Introduction

Image processing is one component of the low-level part of a more global image analysis or computer vision system. Results from the image processing can largely influence the subsequent high-level part to perform recognition and understanding of the image data. Recently, deep learning has been widely applied to solve low-level vision tasks, such as image super-resolution, inpainting, deraining and colorization. As many image processing tasks are related, it is nat- ural to expect a model pre-trained on one dataset can be helpful for another. But few studies have generalized pretraining across image processing tasks. Pre-training has the potential to provide an attractive solution to image processing tasks by addressing the following two challenges: First, task-specific data can be limited. This problem is exacerbated in image processing task that involves the paid-for data or data privacy, such as medical images [8] and satellite images [83] . Various inconsistent factors (e.g. camera parameter, illumination and weather) can further perturb the distribution of the captured data for training. Second, it is unknown which type of image processing job will be requested until the test image is presented. We therefore have to prepare a series of image processing modules at hand. They have distinct aims, but some underlying operations could be shared.

It is now common to have pre-training in natural language processing and computer vision [12] . For example, the backbones of object detection models [98, 97] are often pre-trained on ImageNet classification [18] . A num-ber of well-trained networks can now be easily obtained from the Internet, including AlexNet [43] , VGGNet [63] and ResNet [34] . The seminal work Transformers [70] have been widely used in many natural language processing (NLP) tasks, such as translation [73] and questionanswering [66] . The secret of its success is to pre-train transformer-based models on a large text corpus and finetune them on the task-specific dataset. Variants of Transformers, like BERT [19] and GPT-3 [5] , further enriched the training data and improved the pre-training skills. There have been interesting attempts on extending the success of Transformers to the computer vision field. For example, Wang et al. [71] and Fu et al. [25] applied the self-attention based models to capture global information on images. Carion et al. [7] proposed DERT to use transformer architectures for an end-to-end object detection. Most recently, Dosovitskiy et al. [22] introduced Vision Transformer (ViT) to treat input images as 16×16 words and attained excellent results on image recognition.

The aforementioned pre-training in computer vision and natural language mostly investigate a pretest classification task, but both the input and the output in an image processing task are images. A straightforward application of these existing pre-training strategies might not be feasible. Further, how to effectively address different target image processing tasks in the pre-training stage remains a hard challenge. It is also instructive to note that the pre-training of image processing models enjoys a convenience of selfgenerating training instances based on the original real images. The synthetically manipulated images are taken for training, while the original image itself is the ground-truth to be reconstructed.

In this paper, we develop a pre-trained model for image processing using the transformer architecture, namely, Image Processing Transformer (IPT). As the pre-trained model needs to be compatible with different image processing tasks, including super-resolution, denoising, and deraining, the entire network is composed of multiple pairs of head and tail corresponding to different tasks and a single shared body. Since the potential of transformer needs to be excavated using large-scale dataset, we should prepair a great number of images with considerable diversity for training the IPT model. To this end, we select the Im-ageNet benchmark which contains various high-resolution with 1,000 categories. For each image in the ImageNet, we generate multiple corrupted counterparts using several carefully designed operations to serve different tasks. For example, training samples for the super-resolution task are generated by downsampling original images. The entired dataset we used for training IPT contains about over 10 millions of images.

Then, the transformer architecture is trained on the huge dataset as follows. The training images are input to the specific head, and the generated features are cropped into patches (i.e., "words") and flattened to sequences subsequently. The transformer body is employed to process the flattened features in which position and task embedding are utilized for encoder and decoder, respectively. In addition, tails are forced to predict the original images with different output sizes according to the specific task. Moreover, a contrastive loss on the relationship between patches of different inputs is introduced for well adopting to different image processing tasks. The proposed image processing transformer is learned in an end-to-end manner. Experimental results conducted on several benchmarks show that the pre-trained IPT model can surpass most of existing methods on their own tasks by a significant enhancement after fine-tuning.

2.1. Image Processing

Image processing consists of the manipulation of images, including super-resolution, denoising, dehazing, deraining, debluring, etc. There are a variety of deep-learningbased methods proposed to conduct on one or many kinds of image processing tasks. For the super-resolution, Dong et al. propose SRCNN [20, 21] which are considered as pioneering works introducing end-to-end models that reconstructs HR images from their LR counterparts. Kim et al. [41] further explore the capacity of deep neural network with a more deeper convolutional network. Ahn et al. [2] and Lim et al. [50] propose introduce residual block into SR task. Zhang et al. [92] and Anwar and Barnes [3] utilize the power of attention to enhance the performance on SR task. A various excellent works are also proposed for the other tasks, such as denoising [68, 32, 37, 45, 24] , dehazing [6, 46, 85, 80] , deraining [36, 78, 62, 29, 74, 47] , and debluring [67, 53, 23, 10] . Different from above methods, we dig the capacity of both big models and huge volume of data. Then a pre-training model handling several image processing tasks is introduced.

2.2. Transformer

Transformer [70] and its variants have proven its success being powerful unsupervised or self-supervised pretraining frameworks in various natural language processing tasks. For example, GPTs [59, 60, 5] are pre-trained in a autoregressive way that predicting next word in huge text datasets. BERT [19] learns from data without explicit supervision and predicts a masking word based on context. Colin et al. [61] proposes a universal pre-training framework for several downstream tasks. Yinhan et al. [52] proposes a robust variant for original BERT.

Due to the success of Transformer-based models in the NLP field, there are many attempts to explore the benefits [96] propose pre-training methods for teansformer-based models for image recognition task. Jiang et al. [39] propose the TransGAN to generate images using Transformer. However, few related works focus on low-level vision tasks. In this paper, we explore a universal pre-training approach for image processing tasks.

3. Image Processing Transformer

To excavate the potential use of transformer on image processing tasks for achieving better results, here we present the image processing transformer by pre-training on large-scale dataset.

3.1. Ipt Architecture

The overall architecture of our IPT consists of four components: heads for extracting features from the input corrupted images (e.g., images with noise and low-resolution images), an encoder-decoder transformer is established for recovering the missing information in input data, and tails are used formapping the features into restored images. Here we briefly introduce our architecture, details can be found in the supplementary material.

Heads. To adjust different image processing task, we use a multi-head architecture to deal with each task separately, where each head consists of three convolutional layers. Denote the input image as x ∈ R 3×H×W (3 means R, G, and B), the head generates a feature map f H ∈ R C×H×W with C channels and same height and width (typical we use C = 64). The calculation can be formulated as

f H = H i (x), where H i (i = {1, . . . , N t })

denote the head for the ith task and N t denotes the number of tasks.

Transformer encoder. Before input features into the transformer body, we split the given features into patches and each patch is regarded as a "word". Specifically, the features f H ∈ R C×H×W are reshaped into a sequence of patches, i.e., f pi ∈ R P 2 ×C , i = {1, . . . , N }, where N = HW P 2 is the number of patches (i.e., the length of sequence) and P is patch size. To maintain the position information of each patch, we add learnable position encodings E pi ∈ R P 2 ×C for each patch of feature f pi following [22, 7] , and E pi + f pi will be directly input into the transformer encoder. The architecture of encoder layer is following the original structure in [70] , which has a multihead self-attention module and a feed forward network. The output of encoder f Ei ∈ R P 2 ×C for each patch has the same size to that of the input patch f pi . The calculation can be formulated as

EQUATION (1): Not extracted; please refer to original document.

where l denotes the number of layers in the encoder, MSA denotes the multi-head self-attention module in the conventional transformer model [70] , LN denotes the layer normalization [4] and FFN denotes the feed forward network, which contains two fully connected layers.

Transformer decoder. The decoder also follows the same architecture and takes the output of decoder as input in the transformer body, which consists of two multi-head self-attention (MSA) layers and one feed forward network (FFN). The difference to that of the original transformer here is that we utilize a task-specific embedding as an additional input of the decoder. These task-specific embeddings

E i t ∈ R P 2 ×C , i = {1, .

. . , N t } are learned to decode features for different tasks. The calculation of decoder can be formulated as:

EQUATION (2): Not extracted; please refer to original document.

where f Di ∈ R P 2 ×C denotes the outputs of decoder. The decoded N patched features with size P 2 × C are then reshaped into the features f D with size C × H × W . Tails. The properties of tails are same as those of heads, we use multi tails to deal with different tasks. The calculation can be formulated as

f T = T i (f D ), where T i (i = {1, .

. . , N t }) denote the head for the ith task and N t denotes the number of tasks. The output f T is the resulted images size of 3 × H × W which is determined by the specific task. For example, H = 2H, W = 2W for a 2× super-resolution task.

3.2. Pre-Training On Imagenet

Besides the architecture of transformer itself, one of the key factors for successfully training an excellent transformer is that the well use of large-scale datasets. Compared with image classification, the number of available data used for image processing task is relatively small (e.g., only 2000 images on DIV2K dataset for the image super-resolution task), we propose to utilize the well-known ImageNet as the baseline dataset for pre-training our IPT model, then we generate the entire dataset for several tasks (e.g., superresolution and denosing) as follows.

As the images in the ImageNet benchmark are of high diversity, which contains over 1 million of natural images from 1,000 different categories. These images have abundant texture and color information. We first remove the semantic label and manually synthesize a variety of corrupted images from these unlabeled images with a variety of degradation models for different tasks. Note that synthesized dataset is also usually used in these image processing tasks and we use the same degeneration methods as suggested in [31, 1] . For example, super-resolution tasks often take bicubic degradation to generate low-resolution images, denoising tasks add Gaussian noise in clean images with different noise level to generate the noisy images. These synthesized images can significantly improve the performance of learned deep networks including both CNN and transformer architectures, which will be shown in the experiment part. Basically, the corrupted images are synthesized as:

EQUATION (3): Not extracted; please refer to original document.

where f denotes the degradation transformation, which is depended on the specific task: for the super-resolution task, f sr is exactly the bicubic interpolation; for image denoising, f noise (I) = I + η, where η is the additive Gaussian noise; for deraining, f rain (I) = I + r in which r is a handcrafted rain streak. The loss function for learning our IPT in the supervised fashion can be formulated as:

EQUATION (4): Not extracted; please refer to original document.

where L 1 denote the conventional L1 loss for reconstructing desired images and I i corrupted denote the corrupted image for task i, respectively. In addition, Eq. 4 implies that the proposed framework is trained with multiple image process tasks simultaneously. Specifically, for each batch, we randomly select one task from N t supervised tasks for training and each task will be processed using the corresponding head, tail and task embedding, simultaneously. After the pre-training the IPT model, it will capture the intrinsic features and transformations for a large variety of image processing tasks thus can be further fine-tuned to apply on the desired task using the new provided dataset. Moreover, other heads and tails will be dropped for saving the computation costs and parameters in the remained head, tail and body will be updated according to the back-propagation.

However, due to the variety of degradation models, we cannot synthesize images for all image processing tasks.

For example, there is a wide range of possible noise levels in practice. Therefore, the generalization ability of the resulting IPT should be further enhanced. Similar to the pre-training natural language processing models, the relationship between patches of images is also informative. The patch in image scenario can be considered as a word in natural language processing. For example, patches cropped from the same feature map are more likely to appear together, which should be embedded into similar positions. Therefore, we introduce contrastive learning [13, 33] for learning universal features so that the pre-trained IPT model can be utilized to unseen tasks. In practice, denote the output patched features generated by IPT decoder for the given input x j as f j Di ∈ R P 2 ×C , i = {1, . . . , N }, where x j is selected from a batch of training images X = {x 1 , x 2 , . . . , x B }. We aims to minimize the distance between patched features from the same images while maximize the distance between patches from different images. The loss function for contrastive learning is formulated as:

l(f j Di 1 , f j Di 2 ) = −log exp(d(f j Di 1 , f j Di 2 )) B k=1 I k =j exp(d(f j Di 1 , f k Di 2 ))

L constrastive = 1 BN 2 N i1=1 N i2=1 B j=1 l(f j Di 1 , f j Di 2 ), (5) where d(a, b) = a T b a b

denotes the cosine similarity. Moreover, to make fully usage of both supervised and selfsupervised information, we reformulate the loss function as:

EQUATION (6): Not extracted; please refer to original document.

Wherein, we combine the λ-balanced contrastive loss with the supervised loss as the final objective function of IPT. Thus, the proposed transformer network trained using Eq. 6 can be effectively exploited on various existing image processing tasks.

4. Experiments

In this section, we evaluate the performance of the proposed IPT on various image processing tasks including super-resolution and image denoising. We show that the pre-trained IPT model can achieve state-of-the-art performance on these tasks. Moreover, extensive experiments for ablation study show that the transformer-based models perform better than convolutional neural networks when using the large-scale dataset for solving the image processing problem.

Datasets. To obtain better pre-trained results of the IPT model, we use the well-known ImageNet dataset, which consists of over 1M color images of high diversity. The training images are cropped into 48 × 48 patches with 3 channels for training, i.e., there are over 10M patches for training the IPT model. We then generate the corrupted images with 6 types of degradation: 2×, 3×, 4× bicubic interpolation, 30, 50 noise level Gaussian noise and adding rainstreaks, respectively. For the rain-streak generation, we follow the method described in [79] . During the test, we crop the images in the test set into 48 × 48 patches with a 10 pixels overlap. Note that the same testing strategy is also adopted for CNN based models for a fair comparison, and the resulting PSNR values of CNN models are the same as that of their baselines.

Training & Fine-tuning. We use 32 Nvidia NVIDIA Tesla V100 cards to train our IPT model using the conventional Adam optimizer with β 1 = 0.9, β 2 = 0.999 for 300 epochs on the modified ImageNet dataset. The initial learning rate is set as 5e −5 and decayed to 2e −5 in 200 epoch with 256 batch size. Since the training set consists of different tasks, we cannot input all of them in a single batch due to the expensive memory cost. Therefore, we stack a batch of images from a randomly selected task in each iteration. After pre-training on the entire synthesized dataset, we fine-tune the IPT model on the desired task (e.g., ×3 single image super-resolution) for 30 epochs with a learning rate of 2e −5 . Note that SRCNN [20] also found that using ImageNet training can bring up the performance of the super-resolution task, while we propose a model fitting general low-level vision tasks.

4.1. Super-Resolution

Table 1. Quantitative results on image super-resolution. Best and second best results are highlighted and underlined.
Method	Scale	Set5	Set14	B100	Urban100
VDSR [41]	x2	37.53	33.05	31.90	30.77
EDSR [51]	X x2	38.11	33.92	32.32	32.93
RCAN [92]	X x2	38.27	34.12	32.41	33.34
RDN [94]	X x2	38.24	34.01	32.34	32.89
OISR-RK3 [35]	X x2	38.21	33.94	32.36	33.03
RNAN [93]	X x2	38.17	33.87	32.32	32.73
SAN [17]	X x2	38.31	34.07	32.42	33.10
HAN [55]	X x2	38.27	34.16	32.41	33.35
IGNN [99]	X x2	38.24	34.07	32.41	33.23
IPT (ours)	X x2	38.37	34.43	32.48	33.76
VDSR [41]	X x3	33.67	29.78	28.83	27.14
EDSR [51]	X x3	34.65	30.52	29.25	28.80
RCAN [92]	X x3	34.74	30.65	29.32	29.09
RDN [94]	X x3	34.71	30.57	29.26	28.80
OISR-RK3 [35]	X x3	34.72	30.57	29.29	28.95
RNAN [93]	X x3	34.66	30.52	29.26	28.75
SAN [17]	X x3	34.75	30.59	29.33	28.93
HAN [55]	X x3	34.75	30.67	29.32	29.10
IGNN [99]	X x3	34.72	30.66	29.31	29.03
IPT (ours)	X x3	34.81	30.85	29.38	29.49
VDSR [41]	X x4	31.35	28.02	27.29	25.18
EDSR [51]	X x4	32.46	28.80	27.71	26.64
RCAN [92]	X x4	32.63	28.87	27.77	26.82
SAN [17]	x4	32.64	28.92	27.78	26.79
RDN [94]	X x4	32.47	28.81	27.72	26.61
OISR-RK3 [35]	X x4	32.53	28.86	27.75	26.79
RNAN [93]	x4	32.49	28.83	27.72	26.61
HAN [55]	X x4	32.64	28.90	27.80	26.85
IGNN [99]	X x4	32.57	28.85	27.77	26.84
IPT (ours)	x4	32.64	29.01	27.82	27.26

We compare our model with several state-of-the-art CNN-based SR methods. As shown in Table 1 , our pretrained IPT outperforms all the other methods and achieves the best performance in ×2, ×3, ×4 scale on all datasets. It is worth to highlight that our model achieves 33.76dB PSNR on the ×2 scale Urban100 dataset, which surpasses other methods with more than ∼0.4dB, while previous SOTA methods can only achieve a <0.2dB improvement compared with others, which indicates the superiority of the proposed model by utilizing large scale pre-training.

Figure 1. Comparison on the performance of the proposed IPT and the state-of-the-art image processing models on different tasks.

Figure 2. The diagram of the proposed image processing transformer (IPT). The IPT model consists of multi-head and multi-tail for different tasks and a shared transformer body including encoder and decoder. The input images are first converted to visual features and then divided into patches as visual words for subsequent processing. The resulting images with high visual quality are reconstructed by ensembling output patches.

Figure 3. Visual results with bicubic downsampling (×4) from Urban100. The proposed method recovers more details. Compared images are derived from [99].

We further present the visualization results on our model in 4× scale on Urban100 dataset. As shown in Figure 3 , it is difficult for recover the original high resolution images since lots of information are lost due to the high scaling factor. Previous methods generated blurry images, while the super-resolution images produced by our model can well recover the details from the low-resolution images.

4.2. Denoising

Since our pre-trained model can be well adapt to many tasks, we then evaluate the performance of our model on image denoising task. The training and testing data is generated by adding Gaussian noise with σ = 30, 50 to the clean images.

Figure 4. Color image denoising results with noise level σ = 50. Compared images are derived from [90].

Table 2. Quantitative results on color image denoising. Best and second best results are highlighted and underlined.
Method	BSD68		Urban100
	30	50	30	50
CBM3D [16]	29.73	27.38	30.36	27.94
TNRD [14]	27.64	25.96	27.40	25.52
DnCNN [87]	30.40	28.01	30.28	28.16
MemNet [65]	28.39	26.33	28.93	26.53
IRCNN [88]	30.22	27.86	30.28	27.69
FFDNet [89]	30.31	27.96	30.53	28.05
SADNet [9]	30.64	28.32	N/A	N/A
RDN [95]	30.67	28.31	31.69	29.29
IPT (ours)	30.75	28.39	32.00	29.71

To verify the effectiveness of the proposed method, GT Noisy (σ=50) CBM3D [16] TNRD [14] RDN [94] DnCNN [87] MemNet [65] IRCNN [88] FFDNet [89] IPT (ours) we compare our results with various state-of-the-art models. Table 2 reported the color image denoising results on BSD68 and Urban100 dataset. As a result, our IPT achieves the best results among all denoising methods on different Gaussian noise level. Moreover, we surprisingly found that our model improve the state-of-the-art performance by ∼0.3dB on the Urban100 dataset, which demonstrate the effectiveness of pre-training and the superiority of our transformer-based model. Figure 4 shows the visualization of the resulted images. As shown in the figure, noisy images are hard to be recognized and it is difficult to recover the clean images. Therefore, existing methods fail to reconstruct enough details and generate abnormal pixels. As a result, our pre-trained model can well recover several details in the hair of this cat and our visual quality beats all the previous models obviously.

4.3. Deraining

Figure 5. Image deraining results on the Rain100L dataset. Compared images are derived from [72].

Table 3. Quantitative results of image deraining on the Rain100L dataset. Best and second best results are highlighted and underlined.
Method	Input	DSC [28]	GMM [49]	JCAS [31]	Clear [27]	DDN [28]
PSNR	26.90	27.34	29.05	28.54	30.24	32.38
SSIM	0.8384	0.8494	0.8717	0.8524	0.9344	0.9258

Table 3. Quantitative results of image deraining on the Rain100L dataset. Best and second best results are highlighted and underlined.
RESCAN [48]	PReNet [62]	JORDER E [79]	SPANet [74]	SSIR [76]	RCDNet [72]	IPT (ours)
38.52	37.45	38.59	35.33	32.37	40.00	41.62
0.9812	0.9790	0.9834	0.9694	0.9258	0.9860	0.9880

For the image deraining task, we evaluate our model on the synthesized Rain100L dataset [79] , which consists of 100 rainy images. Quantitative results can be viewed in Table 3 . Compared with the state-of-the-art methods, we achieve the best performance (41.62dB) with an 1.62dB improvement. Figure 5 shows the visualization results. Previous methods are failed to reconstruct the original clean images since they lack of image prior. As a result, our IPT model can present exactly the same image as the ground-truth and sur- passes all the previous algorithms in visual quality. This result substantiates the generality of the proposed model.

4.4. Generalization Ability

Although we can generate various corrupted images, natural images are of high complexity and we cannot synthesize all possible images for pre-training the transformer model. However, a good pre-trained model should have the capacity for well adapting other tasks as those in the field of NLP. To this end, we then conduct several experiments to verify the generalization ability of our model. In practice, we test corrupted images that did not include in our synthesized ImageNet dataset, i.e., image denoising with noisy level 10 and 70, respectively. We use the heads and tails for image denoising tasks as the pre-trained model.

Figure 6. Not extracted; please refer to original document.

Table 4. Generation ability of our IPT model on color image denoising with different noise levels. Best and second best results are highlighted and underlined.
Method	BSD68			Urban100
	10	70	10	70
CBM3D [16]	35.91	26.00	36.00	26.31
TNRD [14]	33.36	23.83	33.60	22.63
DnCNN [87]	36.31	26.56	36.21	26.17
MemNet [65]	N/A	25.08	N/A	24.96
IRCNN [88]	36.06	N/A	35.81	N/A
FFDNet [89]	36.14	26.53	35.77	26.39
RDN [95]	36.47	26.85	36.69	27.63
IPT (ours)	36.53	26.92	36.99	27.90

The detailed results are shown in Table 4 , we compare the performance of using the pre-trained IPT model and the state-of-the-art methods for image denoising. Obviously, IPT model outperforms other conventional methods, which Figure 6 . The performance of CNN and IPT models using different percentages of data.

demonstrates that the pre-trained model can capture more useful information and features from the large-scale dataset.

4.5. Ablation Study

Table 5. Impact of λ for contrastive learning.
Table	5. Impact	of X for	contrastive	learning.
l	0	0.05	0.1	0.2	0.5
PSNR	38.27	38.32	38.37	38.33	38.26

Impact of data percentage. To evaluate the effectiveness of the transformer architecture, we conduct experiments to analyse the improvement of pre-training on CNNbased model and transformer-based model. We use 20%, 40%, 60%, 80% and 100% percentages of the synthesized ImageNet dataset to analyse the impact on the number of used data for resulting performance. Figure 6 shows the results of different pre-trained models. When the models are not pre-trained or pre-trained with small amount (< 60%) of the entire dataset, the CNN models achieve better performance. In contrast, when using large-scale data, the transformer-based models overwhelming CNN models, which demonstrates that the effectiveness of our IPT model for pre-training. Impact of contrastive learning. As discussed above, to improve the representation ability of our pre-trained model, we embed the contrastive learning loss (Eq. 6) into the training procedure. We then evaluate its effectiveness on the ×2 scale super-resolution task using the Set4 dataset. Table 5 shows the impact of the hyper-parameter λ for balancing the two terms in Eq. 6. When λ=0, the IPT model is trained using only a supervised learning approach, the resulting PSNR value is 38.27dB. When employing the contrastive loss for self-supervised learning, the model can achieve a 38.37dB PSNR value (λ = 0.1), which is about 0.1dB higher than that of the model trained with λ = 0. These results further demonstrate the effectiveness of the contrastive learning for learning better pre-trained IPT model.

5. Conclusions And Discussions

This paper aims to address the image processing problems using a pre-trained transformer model (IPT). The IPT model is designed with multi-heads,multi-tails a shared transformer body for serving different image processing task such as image super-resolution and denoising. To maximally excavate the performance of the transformer architecture on various tasks, we explore a synthesized ImageNet datesets. Wherein, each original image will be degraded to a series of counterparts as paired training data. The IPT model is then trained using supervised and self-supervised approaches which shows strong ability for capturing intrinsic features for low-level image processing. Experimental results demonstrate that our IPT can outperform the stateof-the-art methods using only one pre-trained model after a quickly fine-tuning. In the future work, we will extend our IPT model to more tasks such as inpainting, dehazing, etc.

A. Results On Deblurring

Figure 7. Not extracted; please refer to original document.

Figure 8. Image deblurring results on the GoPro dataset. Compared images are derived from [69].

Table 6. Quantitative results on image deblurring. Best and second best results are highlighted and underlined.
Method	MSCNN [54]	SRN [67]	DSD [30]	DeblurGANv2 [44]	DMPHN [84]	LEBMD [40]	EDSD [81]
PSNR	30.40	30.25	30.96	29.55	31.36	31.79	29.81

Table 6. Quantitative results on image deblurring. Best and second best results are highlighted and underlined.
DBGAN [86]	MTRNN [57]	RADN [58]	SAPHN [64]	BANET [69]	MB2D [56]	IPT (Ours)	IPT + (Ours)
31.10	31.13	31.85	32.02	32.44	32.16	32.58	32.91

We further evaluate the performance of our model on image deblurring task. We use the GoPro dataset [54] to finetune and test our model. We modify the patch size as 256, patch dim as 8 and number of features as 9 to achieve a higher receptive field. Table 6 reported deblurring results, where + denotes applying self-ensemble technique. As a result, our IPT achieves the best results among all deblurring methods. Figure 8 shows the visualization of the resulted images. As shown in the figure, our pre-trained model can well achieve the best visual quality among all the previous models obviously.

B. Architecture Of Ipt

In the main paper, we propose the image processing transformer (IPT). Here we show the detailed architecture of IPT, which consists of heads, body and tails. Each head has one convolutional layer (with 3 × 3 kernel size, 3 input channels and 64 output channels) and two ResBlock. Each ResBlock consists of two convolutional layers (with 5 × 5 kernel size, 64 input channels and 64 output channels) which involved by a single shortcut. The body has 12 encoder layers and 12 decoder layers. The tail of denoising or deraining is a convolutional layer with 3 × 3 kernel size, 64 input channels and 3 output channels. For super-resolution, the tail consists of one pixelshuffle layer with upsampling scale 2 and 3 for ×2 and ×3 SR, two pixelshuffle layer with upsampling scale 2 for ×4 SR.

The whole IPT has 114M parameters and 33G FLOPs, which have more parameters while fewer FLOPs compared with traditional CNN models (e.g., EDSR has 43M parameters and 99G FLOPs).

C. Impact Of Multi-Task Training

We train IPT following a multi-task manner and then fine-tune it on 6 different tasks including ×2, ×3, ×4 superresolution, denoising with noise level 30,50 and deraining. We find that this training strategy would not harm the performance on these tasks which have been pre-trained on large scale dataset (ImageNet). In other words, the performance of multi-task training and single-task training remains almost the same. However, when transferring to other tasks (e.g., Section 4.4 in the main paper), the pre-trained model using multi-task training is better than that of singletask training for about 0.3dB, which suggests the multi-task training would learn universal representation of image processing tasks.

D. Visualization Of Embeddings

We visualize the learned embeddings of IPT. Figure 7 shows the visualization results of position embeddings. We Figure 7 . Visualization of cosine similarity of position embeddings.

find that patches with similar columns or rows have similar embeddings, which indicate that they learn useful information for discovering the position on image processing. We also test to use fixed embeddings or do not use embeddings, whose performance are lower than that of using learnable position embeddings (vary from 0.2dB to 0.3dB for different tasks).

Moreover, we visualize the task embeddings in figure 9. We can find that for ×2 super-resolution task, the similarity between the embeddings on each position and their neighbours are higher than ×3 super-resolution, while that of ×4 super-resolution is the smallest. This results indicates that each patches in ×2 super-resolution can focus on other patches with farther distance than ×3 and ×4, since their downsampling scale are smaller and the relationship between different patches are closer. The similarity of task embedding for deraining in figure 9 (d) shows that the patches pay more attention on the vertical direction than horizontal direction, which is reasonable as the rain is dropped vertically. The similarity of task embedding for denoising is similar with Gaussian noise, and figure 9 (f) with higher (50) noise level shows higher similarity between neighbours than figure 9 (e) with 30 noise level. The visualization results suggests that our task embeddings can indeed learn some information for different tasks. We also test to not use task embeddings, which results in significant accuracy drop (vary from 0.1dB to 0.5dB for different tasks).