Author Guidelines for 8

VISUAL ATTENTION NETWORK FOR LOW DOSE CT Wenchao Du1,2, Hu Chen,1,2,*, Peixi Liao3, Hongyu Yang1,2, Ge Wang4, Fellow, IEEE, Yi Zhang1,*, Senior Member, IEEE 1 College of Computer Science, Sichuan University, China 2 National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, China 3 Department of Scientific Research and Education, The Sixth People’s Hospital of Chengdu, China 4 Department of Biomedical Engineering, Rensselaer Polytechnic Institute, USA ABSTRACT Noise and artifacts are intrinsic to low dose CT (LDCT) data acquisition, and will significantly affect the imaging performance. Perfect noise removal and image restoration is intractable in the context of LDCT due to the statistical and technical uncertainties. In this paper, we apply the generative adversarial network (GAN) framework with a visual attention mechanism to deal with this problem in a data-driven/machine learning fashion. Our main idea is to inject visual attention knowledge into the learning process of GAN to provide a powerful prior of the noise distribution. By doing this, both the generator and discriminator networks are empowered with visual attention information so they will not only pay special attention to noisy regions and surrounding structures but also explicitly assess the local consistency of the recovered regions. Our experiments qualitatively and quantitatively demonstrate the effectiveness of the proposed method with clinic CT images. Index Terms — Low-dose CT (LDCT), visual attention, generative adversarial network 1. INTRODUCTION Recently, improving image quality of low-dose CT (LDCT) scans has been a hot topic. There were a large number of papers on this topic. The early methods were based on filtering in the sinogram where the noise property is statistically known. However, any structure distortions in the sinogram domain might lead to disturbing artifacts and resolution loss in the image domain. On the other hand, iterative reconstruction methods can mitigate this problem to a certain degree by optimizing an objective function, which contains handcrafted prior terms, such as roughness penalty and nuclear norm. The involvement of extensive computational cost and the difficulty in designing proper regularization terms and relaxation coefficients present challenges for practical use. As a competitive alternative, post-processing methods [1-4] need not to access the raw data and are more convenient to be deployed into current CT systems.

Recently, deep learning is recognized as a promising postprocessing strategy for low-dose CT (LDCT) image denoising/restoration. Generally, the deep-learning-based methods attempt to learn a nonlinear mapping from a LDCT image to an improved counterpart by minimizing the mean squared error (MSE) loss function, which could, however, over-smooth structural details [3,4]. In this paper, inspired by the human visual perception, we propose to incorporate a visual attention network in the GAN framework (VAOGAN) to remove image noise and preserve structural details. The rest of the paper is organized as follows. Section 2 introduces the proposed method. Section 3 presents experimental results. Finally, the conclusion is drawn in Section 4. 2. METHOD 2.1. CT Image Restoration Formation A LDCT image can be modeled as a combination of a normal-dose CT (NDCT) image and noise: (1) L  H  n, where L  R N  N denotes a LDCT image and H  R N  N is the corresponding NDCT image. n represents the noise mainly from Poisson data and detector electron fluctuations. To characterize the impacts of noise on different regions in an image, Eq. (1) can be refined as L  (1 M )  H  n, (2) where M is a binary 2D mask, M ( x )  1 means the pixel x is contaminated by noise, and otherwise noiseless; and the operator  denotes the element-wise multiplication. Then, our goal is converted to find the H from a given L . To achieve this goal, we need to estimate the binary mask M . Note that during the inference stage we do not have a reference H to estimate M . Hence, the mean value of H  L from training data is thresholded to generate a proper M. 2.2. Visual Attention Oriented GAN (VAOGAN)

Fig.1. Architecture of the proposed visual attention network.

The proposed network is based on GAN [5], which mainly consists of two parts: the generator network and the discriminator network (GN and DN). As illustrated in Fig.1, given a LDCT image, GN attempts to predict an image as close as possible to the corresponding NDCT image. In competition, DN will evaluate if the image generated by GN is counterfeit. The loss function of GAN is defined as:

min max  H  PNDCT [log( D ( H ))] G

D

  L  PLDCT [log(1  D (G ( L )))],

(3)

where G stands for GN, D denotes DN, L and H are LDCT and NDCT samples respectively. 2.2.1. Generative Network with Visual Attention GN seeks a function G that maps L to H : G：L  H , (4) However, GN is a weakly supervised generative model, which is only guided by DN. It is difficult to learn G very accurately. To address this problem, inspired by the idea behind the visual attention network [6,7], we introduce the visual attention information to assist GN. Then, Eq. (4) is rewritten as G : ( L  m)  H , (5) where m is an attention feature map corresponding to M in Eq. (2), and operator  denotes a channel-wise concatenation operation. As illustrated in Fig. 1, the proposed GN consists of two components: an attentive block and a generator. The attentive block aims to locate the contaminated pixels and extract surrounding structures in the LDCT image. The output of the attentive block is the prior knowledge to guide GN for noise suppression and detail preservation. Meanwhile, this attention map also helps DN to focus on noisy regions. As demonstrated in the left part of Fig. 1, the attentive block contains six convolutional layers and one LSTM unit [8], which is a special case of RNN to avoid the long-term dependency problem. Different from the binary mask M , the generated attention map is a matrix with continuous elemental values ranging from 0 to 1, which means greater the value is, noisier the corresponding region is.

In a time step t , we concatenate the input and generated attention map to feed them into the next attentive block. Since the binary mask M has been acquired, it serves as a prior to supervise the generation of attention feature map m . Therefore, a mean squired error (MSE) loss function is defined between m and M at each time step t and we apply T steps to form the complete loss function as: T

LA ({m}, M )    T t LMSE (mt , M ), t 1

(7)

where mt is the attention map produced by the attentive block at a time step t . Considering the computational efficiency, we set T to 4 and weight  to 0.8 respectively. The U-Net architecture has been proven effective in image denoising and deblurring or super-resolution analysis [9]. Thus, we use a similar architecture as the backbone of GN. A hybrid loss function is proposed for GN as follows: LG  LGAN (G (L ))  LA ({m}, M ) (8)  LM ({S },{H })  LP (G ( L ), H ), where LGAN (G ( L ))  log(1  D (G ( L ))) , LM is the multiscale loss used to capture more structural and contextual information on different scales, which is defined as M

LM ({S },{H })   i LMSE ( Si , H i ), i 1

where S i denotes the

(9)

ith output extracted from the

corresponding deconvolutional layer, and H i is the ground truth on the same scale S i , i is the weight for the ith scale, which gradually increases as the scale increases. Specifically, The outputs of 1st , 3rd and 5th layers are extracted and the corresponding weights are set to 0.6, 0.8 and 1.0 in this study. LP is the perceptual loss implemented by a pre-trained VGG model and usually employed to measure the similarity in the feature space: LP (G(L), H)  LMSE (V(G(L)),V(H)), (10) where V is the pre-trained VGG model.

2.2.2. Discriminative Network with Visual Attention The local discriminator is designed to perform regionspecific validation [10], which is particularly useful for restoration of texture-rich regions. However, a problem here is that in the testing stage the noise distribution is unknown. To deal with this problem, the attention map AT , which is generated by the attentive block in GN, is used to construct the loss function as LD (G(L), R, AT )   log(D(R))  log(1 D(G(L))) (11)   LA (G(L), R, AT ),

where L A is the loss between the features extracted from interior layers of DN and AT as: LA (G(L), R, AT )  LMSE (Dmap (G(L)), AT )  LMSE (Dmap (R), 0), (12)

where Dmap

represents the process of generating the

attention map in DN. In our study,  was empirically set to 0.05, R is a randomly selected NDCT sample and 0 denotes a map only containing 0. Thus, the second term of Eq. (12) implies that for R there is no region necessary to focus on. Specifically, in this work the proposed attentive DN has eight convolution layers (kernel size 3  3 ) followed by a fully connected layer with 512 neurons and a sigmoid activation function. 3. EXPERIMENTS 3.1. Dataset In this study, the Mayo public clinical CT dataset was used [11], which was prepared for “the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Ground Challenge” to evaluate competing LDCT image reconstruction algorithms. The dataset consists of 10 anonymous patients’ NDCT images and corresponding simulated LDCT (1/4 dose) images after realistic noise insertion. In our experiments, we randomly extracted 140,000 pairs of image patches from 4,000 slices as our training set. The patch size was 64 × 64. Also, we extracted 7,744 pairs of patches from another 1,936 images for testing. 3.2. Parameters In our experiments, the initial attention map was set to 0.5. The networks were optimized using the Adam algorithm [12]. The batch size was set to 32. The learning process started at a rate of 0.0005. The method was implemented in Pytorch [13]. An NVIDIA Titan V GPU was used. To optimize the effectiveness of the proposed method, we compare different network structures: GN (a CNN-based generative network), GN+DN (a typical GAN structure), GN+ADN (a GAN with an attentive discriminator), AGN+ADN (the proposed visual attentive network (VAOGAN)). In addition, BM3D [1] was selected as a typical post-processing technique.

Fig. 2. Results from a transverse thoracic CT image. The display window is [-160, 240]HU.

3.3. Results Figs. 2 and 3 show the results from two selected slices processed using different methods. It can be seen that the results with BM3D were over-smoothened with some waxy artifacts. On the other hand, all the networks had superior abilities in image denoising/restoration. Two zoomed regions-of-interest (ROIs) are in Figs. 4 and 5. Specially, the red circles indicate the lesions while the arrows point to representative regions where only our method revealed some critical details. Compared to BM3D, GN obtained better visual effects but blurred some details resolved in the NDCT images. The networks with DN performed more or less similarly, and the proposed VAOGAN achieved best results. For further evaluation of the proposed methods, quantitative results, including the peak-to-noise (PSNR) and structural similarity (SSIM) [14], are summarized in Tables 1 and 2 respectively for the selected slices and all the images in the testing set. It can be seen that VAOGAN achieved the best scores in terms of both indices.

Fig. 3. Results from a transverse abdomen CT image. The display window is [-160, 240]HU.

Fig. 4. Results from the zoomed ROIs in Fig. 2. The display window is [160, 240] HU.

Fig. 6. Visualization of the attention maps generated by the attentive block. The first row shows LDCT images, the second row the corresponding difference images between NDCT and LDCT images, the third row the attention maps acquired by the attentive block, and the last row denoised images.

4. DISCUSSIONS AND CONCLUSION Fig. 5. Results from the zoomed ROIs in Fig.3. The display window is [160, 240] HU.

To demonstrate the benefits from the visual attention mechanism, Fig. 6 shows four cases with estimated attention maps. It can be observed that the attention maps are quite similar with the real noise, which can guide the denoising procedure efficiently. TABLE 1. Quantitative results associated with different methods for two selected slices in Fig. 3. Method

Slice 1.

Slice 2.

PSNR

SSIM

PSNR

SSIM

LDCT

38.55

0.9671

35.77

0.9411

BM3D

42.04

0.9846

40.71

0.9790

Inspired by the visual attention mechanism [7], in this paper we have introduced the visual attention information into the GAN framework for low-dose CT image denoising/restoration. Considering that GAN is a weakly supervised generative model, it is difficult to precisely recover corresponding NDCT images without additional information. As reported above, our GAN-based denoising results have been encouraging, aided by learned visual attention clues. The experimental results have demonstrated the proposed method outperforms competing methods both qualitatively and quantitatively. Compared with the other methods, our method seems superior in both noise suppression and detail preservation. Systematic evaluation and task-based optimization are in progress. TABLE 2. Quantitative results obtained using different methods for the testing set.

GN

42.37

0.9874

40.39

0.9790

Method

PSNR

SSIM

GN+DN

42.68

0.9876

40.55

0.9795

LDCT

38.08

0.9600

GN+ADN

41.75

0.9859

39.84

0.9750

42.80

0.9882

40.78

0.9811

BM3D

41.92

0.9820

VAOGAN

GN

42.17

0.9855

GN+DN

42.27

0.9852

GN+ADN

41.40

0.9822

VAOGAN

42.54

0.9864

5. REFERENCES [1] F. P. Fumene, C. Vinegoni, J. Gros, A. Sbarbati, and R. Weissleder, “Block matching 3D random noise filtering for absorption optical projection tomography,” Phys. Med. Biol., vol.55, no.18, pp.5401-5415, 2010. [2] Y. Chen et al., “Artifact suppressed dictionary learning for lowdose CT image processing,” IEEE Trans. Med. Imaging, vol. 33, no. 12, pp. 2271-2292, 2014. [3] E. Kang, J. Min, and J. C. Ye, “A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction,” Med. Phys., vol. 44, no. 10, pp. e360–e375, 2017. [4] H, Chen, Y, Zhang, M. K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang, “Low-dose CT with a residual encoderdecoder convolutional neural network,” IEEE Trans. Med. Imaging, vol. 36, no. 12, pp. 2524-2535, 2017. [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2014, pp.2672-2680. [6] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2014, pp. 2204-2212. [7] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified visual attention networks for fine-grained object classification,” IEEE Trans. Multimedia, vol.19, no.6, pp.12451256, 2017. [8] X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. K. Wong, and W. C. Woo, “Convolutional LSTM network: a machine learning approach for precipitation nowcasting”, Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2015, pp.802-810. [9] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” Proc. Int. Conf. Med. Image Comput. Comput. Assist. Int. (MICCAI), 2015, pp.234-241. [10] S. Lizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Trans. Graphics, vol.36, no.4, article 107, 2017. [11] AAPM, “Low dose CT grand challenge,” 2017. [Online]. Available:http://www.aapm.org/GrandChallenge/LowDoseCT/# [12] D. P., Kingma, and J., Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014. [13] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Anitiga and A. Lerer, “Automatic differentiation in PyTorch”, in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017. [14] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600612, 2004.