Style Neutralization Generative Adversarial Classifier

Style Neutralization Generative Adversarial Classifier Haochuan Jiang1 , Kaizhu Huang1 , Rui Zhang2 , and Amir Hussain3 1

Dept. of EEE, 2 Dept. of MS, Xi’an Jiaotong - Liverpool University, 111 Ren’ai Rd., Suzhou, Jiangsu, P.R.China; 3 Div. of Computing Science and Maths, University of Stirling, Stirling, FK9 4LA, Scotland, U.K.

Abstract. Breathtaking improvement has been seen with the recently proposed deep Generative Adversarial Network (GAN). Purposes of most existing GAN-based models majorly concentrate on generating realistic and vivid patterns by a pattern generator with the aid of the binary discriminator. However, few study were related to the promotion of classification performance with merits of those generated ones. In this paper, a novel and generalized classification framework called Style Neutralization Generative Adversarial Classifier (SN-GAC), based on the GAN framework, is introduced to enhance the classification accuracy by neutralizing possible inconsistent style information existing in the original data. In the proposed model, the generator of SN-GAC is trained by mapping the original patterns with certain styles (source) to their style-neutralized or standard counterparts (standard-target), capable of generating the targeted style-neutralized one (generated-target). On the other hand, pairs of both standard (source + standard-target) and generated (source + generated-target) patterns are fed into the discriminator, optimized by not only distinguishing between real and fake, but also classifying the input pairs with correct class label assignment. Empirical experiments fully demonstrate the effectiveness of the proposed SN-GAC framework by achieving so-far the highest accuracy on two benchmark classification databases including the face and the Chinese handwriting character, outperforming several relevant state-of-the-art baseline approaches.

1

Introduction

Traditional Generative Adversarial Network (GAN) [5] based approaches aim at generating realistic patterns with the discriminative model by implicitly approximating the high-dimensional real data distribution. Distinctively, in this paper, a novel GAN based classifier named Style Normalization Generative Adversarial Classifier (SN-GAC) is investigated to neutralize diverse style information embedded in the original patterns, promoting classification performance with the aid of the generated samples from the generator. Relevant problems were mostly considered previously when data are generated from multiple sources with each one equipped with a specific style information, and different across different groups. It is solved with two major approaches, including the Multi-task Learning [4](MTL, one classifier is obtained from each group while considering inter-relationship between them), and the field

2

Style Neutralization Generative Adversarial Classifier

classification [21, 8, 10]. (style-free data is produced by the style normalization transformation, represented by both linear or nonlinear kernelized mapping) As a generalized framework, the proposed SN-GAC model is capable of obtaining standard patterns by neutralizing style information attached to the original data. Importantly, the generation process (of standard patterns) is designed for and integrated with classifier optimization. It can hence neutralize styles from data and consequently benefit the classification performance in many real applications. Such scenarios can be found in cases including the face recognition task when photos are assigned into several groups while ones from each group are taken with a specific head pose [6], or the handwriting character classification task when they are written by multiple writers according to their own writing habits [15]. Traditional approaches may suffer from degraded performance because of multiple, diverse, and inconsistent style information.

(a) Traditional Classification Model

(b) Proposed SN-GAC Model

Fig. 1: Traditional Classifier and the proposed SN-GAC Model

Specifically, inspired by the recently GAN-based proposed Pix2Pix framework introduced in [9], the proposed SN-GAC neutralizes diverse styles from data by learning standard patterns with the final purpose of promoting the classification performance. In more details, the SN-GAC model consists of two independent networks, a U-Net [9] based generator (G) and a discriminator with an auxiliary classifier (D-C [16]). 1 G is responsible for obtaining highquality style-neutralized or standard patterns given the input ones with various style information, while D-C assigns class labels given the patterns with multiple styles and style-neutralized pattern pairs, as depicted in Fig. 1(b). The proposed classification framework differs significantly from many traditional approaches (Fig. 1(a)) where all the samples are simply fed into the classifier. Additionally, in the proposed SN-GAC model, the style-neutralization is fulfilled by the nonlinear G neural network, enabling representation of sufficiently 1

The discriminator with auxiliary classifier is termed as D-C in this paper since it differs from the D of traditional GAN as in [5]. Moreover, the proposed D-C is also different from [16] since the classifier in the SN-GAC model can be directly applied for normal classification after well trained. However, the auxiliary classifier in [16] is only utilized to provide supervising information for better GAN training.


3

complicated style information. Moreover, as an inherent merit of the GAN approaches, no data distribution assumption is required. The optimization of the proposed SN-GAC model is a two-stage effort. Initially, G is trained adversarially to generate realistic images, while D-C is optimized with both adversarial and categorical losses. The D-C will be fine-tuned with only the categorical objective to further improve the classification accuracy when G is saturated to produce high-quality style-neutralized patterns. Both the steps are fulfilled with clear purposes, necessary for high-quality style-neutralized examples and accurate classification. The proposed SN-GAC model is an end-to-end framework capable of improving the recognition accuracy jointly with the adversarial optimization, meanwhile producing realistic samples, saving both time for model learning and storage respectively. It is a generalized framework not only capable of transforming groups of patterns, it can also be applied in a more generalized way for any kind of classification situation for examples with multiple styles.2 Major contributions of this paper are listed as follows: – A novel classification framework named SR-GAC is introduced, which is significantly different from traditional classification models; – A two-step training strategy is specifically designed for the purpose of generating high-quality style-neutralized patterns as well as achieving high classification accuracy; – The classification performance is promoted without any extra training effort except the GAN optimization itself.

2

Model Architecture

The SN-GAC model is built on the GAN-based Pix2Pix framework [9], while the discriminator is attached with an auxiliary classifier to assign class labels. Several preliminaries will be briefly defined firstly in this section. The detailed model architecture will then be demonstrated, followed by the two-stage training strategy. The SN-GAC model is illustrated in Fig. 2. 2.1 Preliminaries Definition 1. Source is noted as data equipped with style information, namely, x. It is associated with a class label y. Definition 2. Standard-Target is defined as the corresponding pattern equipped with the standard style given the source x, denoted as x∗ . Noted that the standard target style needs specified before training. The proposed SN-GAC model builds the neural nonlinear mapping from those multiple and diverse styles to this standard target. Definition 3. Pair of Source & Standard-Target is then denoted as {x, x∗ }. 2

The proposed SN-GAC model is evaluated only with dataset specifying groups of style patterns in this paper for the simplification purpose.

4


Fig. 2: The SN-GAC Architecture: includes a generator (G) and a discriminator with an auxiliary classifier (D-C ). G consists an embedder network (E ) for style vector inference, a convolutional encoder, and a deconvolutional decoder. It generates a Generated-Target (G(x)) when given a Source (x). D-C is a convolutional network, capable to distinguish the input pair coming from the real or from the generated data with the discriminative output; while assigning the class label of the input pair by the categorical output.

Factually, each target (x∗ ) can be corresponded with multiple sources (x), originates from over one data generator. Definition 4. Generated-Target is defined as the style-neutralized output of G in the proposed SN-GAC model given the source x. It is noted as G(x). Definition 5. Pair of Source & Generated-Target is denoted as {x, G(x)} to represent the correspondence.

2.2

U-Net based Generator

Similar to [9], the G network of the SN-GAC model is based on the U-Net [9] with skipping connections. Given that a source pattern x is defined in Definition 1, the G network is capable of generating a Generated-Target pattern G(x) (as defined in Definition 4 ) with high quality. The G network consists of a convolutional encoder network (Enc), mapping x to the high-level encoded features (denoted as Enc(x)), as well as an upsampling decoder network, transforming high-level encoded features to the targetted style-neutralized counterpart G(x). The deconvolutional operation, seen as the reversed operation of the convolution, is employed as the upsampling function. Moreover, skipping connections from the encoder to the decoder are also applied to align structures and features on the equivalent level. It leads that the input feature of each decoding layer comes not only from the previous decoding layer but also the encoding one at the same level. The quality of the Generated-Target G(x) is well maintained by penalizing the adversarial loss proposed in [5] maximally confusing D. Additionally, the L1 reconstruction error between x∗ and G(x), namely,


5

Ll1 = ||x∗ −G(x)||1 , encouraging to generate sharp and clear image details [9], is also applied. Moreover, the constant loss introduced in [18] is also engaged as additional restriction to encourage high-quality output patterns. Specifically, it regulates with the L2 difference between encoded spaces of two input patterns. In the proposed SN-GAC model, two constant losses, including Lconst1 = ||Enc(x) − Enc(G(x))||2 and Lconst2 = ||Enc(x∗ ) − Enc(G(x))||2 , are summed together to form the total constant error, namely, Lconst = Lconst1 + Lconst2 . Instead of explicit random noise fed into G in the traditional GANs, the dropout [?], severed as the implicit random noise [9], is applied to several layers of both encoder and decoder during training. It is shutdown when performing network inference. 2.3 Embedder Network for Style Representation An extra embedder network (noted as E ) is employed as part of G. It represents the embedded style information of the input source pattern to incorporate with the multi-to-one mapping model. According to [11], patterns in the same style tend to be clustered closer in the deep feature space. In such sense, E can be realized with an extra deep model, fine-tuned from a pre-trained model optimized based on a similar classification task or trained from scratch. It fulfills the function by selecting features from the final layer (inferred logits before the sigmoid or softmax function) as the style vector (E(x)). They are then concatenated with the output of the encoder (Enc(x)) before fed into the decoder together. 2.4 Discriminator with Auxiliary Classifier The discriminator with an auxiliary classifier [16] (D-C network) is applied in the proposed SN-GAC model. It is a CNN classifier embedded in the DC-GAN framework. As depicted in Fig. 2, in each training iteration, pattern pair batches consisting of both the Source & Standard-Target and Source & Generated-Target are fed into D-C optimized by not only differentiating between both pairs (same with vanilla GAN [5], noted as D training), but also assigning the correct class label (y) of the given pair (C training, as depicted in Fig. 2. The input pair of D (x, x∗ ) can be considered as an implicit regularization, penalizing over-flexible style transformation between the Source (with style) and the style-neutralized targets. Similar ideas are implemented in [8, 21, 10] with explicit expressions.3 2.5 Two-Phase Training Strategy with Multiple Losses Initial training for both G and D-C both networks are updated in an iterative fashion. G is trained by minimizing the summation of losses as follows for both sufficiently confusing the D and assigning correct label by C : LG = (LC − LD ) + α · Ll1 + β · Lconst 3

(1)

Paired input is not evaluated for conventional baselines in Section 3 since styleneutralization cannot be achieved with traditional approaches.

6


where α and β are hyper-parameters. The adversarial and the categorical losses are given as Eq.(2) and (3) respectively. LD = Ex,x∗ [log D(x, x∗ )] + Ex [log(1 − D(x, G(x)))]

(2)

LC = Ex,x∗ [log C(x, x∗ )] + Ex [log C(x, G(x))]

(3)

Meanwhile, the D-C network is optimized not to be fooled by G, meanwhile to assign correct class labels by maximizing the combined loss: LD−C = (LC +LD ). In each training iteration, G is only accessible to one batch of pairs (Source & Standard-Target), while two pair batches including (Source & Standard-Target) and (Source & Generated-Target) are fed into D-C. As suggested in [13], G will be updated twice while D-C once for balanced training. Fine-tuning for C when G is stabilized, it is capable of generating high-quality style-neutralized patterns. The D-C network is then further fine-tuned by fixing the G network while minimizing only the categorical objective LC .

3

Experiments

Two benchmark data sets including the Point’04 [6] for face recognition, and the CASIA offline database [15] for Chinese handwriting character classification, are used to evaluate the proposed SN-GAC model. In this section, relevant baselines such as Support Vector Machine (SVM) [3], the Mean Regularized MTL (MR-MTL) model [4], and several field classification approaches including one special case of the Field Bayesian (F-BM) model [21], namely, the Field Nearest Class Mean (F-NCM), the Field Support Vector Classification (F-SVC) model [8] are compared for both sets in this comparison. Moreover, several conventional state-of-the-art techniques not considering style information are also implemented and compared. These models include the Nearest Class Mean (NCM), the Support Vector Classification (SVC) [3] and two specific deep convolutional neural networks, i.e., the Vgg-Face [2] and the Alexnet [14] for the face and handwriting data respectively. Performance of the SVC-based models (including SVC, F-SVC, and MR-MTL) are only reported with the lowest obtained error (for both Linear and RBF Kernel, Ln and RBF for short respectively). The F-SVC model is compared for both style-transferred or not (ST and Non-ST respectively) for the face set. For each set, some state-of-the-art models are also compared. They are the Style Mixture Model (SMM) [17], the Bilinear Model (BM) [19], and the Fisherface Discriminant Analysis (FDA) [12] for the face set, and the F-BM [21], the Field Modified Quadratic Discriminant Function (F-MQDF) and the Modified Quadratic Discriminant Function (MQDF) for the Chinese handwriting task. The basic Pix2Pix framework can be referred to in [20], while the choice of E depends on different sets. For each set, the Standard-Target needs specified, as demonstrated in details in the following sections. The whole model is built on the Google Tensorflow Deep Learning Library (r1.4) [1].


7

Fig. 3: Examples from the Point’ 04 Database. Each column represents a specific head pose (a style). 1st Row: Source (x), where the image with a red box is chosen as the Standard-Target (x∗ ); 2nd Row: Generated-Target (G(x)) generated by the generator G.

3.1

Face Classification across Head Yaw Poses

Table 1: Error Rate on The Point’ 04 Database Method Error Rate FDA SMM BM NCM CNN (Vgg-Face) F-NCM SVM (RBF) MR-MTL (Ln / RBF) F-SVC (Non-ST, RBF) F-SVC (ST, Ln / RBF) SR-GAC

30.67% 26.67% 40.00% 40.00% 9.33% 21.33% 14.67% 14.67% 12.00% 0.00% 0.00%

The experiment involves 15 people in total in the Point’ 04 Database [6]. For each one, only the zero pitch pose faces are selected. 13 different yaw angles in the range of [−90◦ ,+90◦ ] partitioned with 15◦ from each other are chosen, resulting in 195 images. The experiment setting can be referred to in [8], For the proposed SN-GAC model, the images are resized to 256 × 256 so that they can be easily incorporated with both the G and the D-C network of the proposed SN-GAC model. It is straightforward to select images with zero yaw angles as style-neutralized Standard-Targets. Images taken of each yaw pose is regarded to be equipped with a consistent style information (as each column in Fig. 3). The classification is conducted based on faces, as examples displayed on the first row in Fig. 3. Images from the first 8 poses (left 8 columns in Fig. 3)) are put into the training set, while the remaining 5 ones are placed into the testing set. For the SN-GAC model, the E network is obtained by fine-tuning the last fully connected and the first convolutional layers from the Vgg-Face model [2].

8


It can be seen clearly from the result in TABLE 1 that the proposed SNGAC model and the F-SVC model with style transfer achieve the zero error rate. However, F-SVC obatin the performance by taking advantage of the test data with a self-training strategy to transfer the trained style to the unseen one. There is no such setting in the proposed SN-GAC framework. Moreover, by looking into the images in the second row of Fig. 3, the nonlinearly mapped generated style-neutralized images by the proposed SN-GAC model can be readily understood by human observers with only insignificant defects. In comparison, in the F-SVC model [8], the obtained standard images may usually be less similar to a real image. In addition, it can only produce stylenormalized data by the linear kernel, insufficient to represent multiple, diverse, and complicated style information in real scenarios. 3.2

Chinese Handwriting Classification across Writers

The offline version of the CASIA dataset [15] is also exploited for evaluation of the proposed SN-GAC model. The original data include 3,755 categories of different Chinese characters. As described in [21], 100 writers (no.1,101 - no.1,200) are involved in this experiment. For simplicity, only the first 30 characters are chosen in this experiment. Since people are more likely to write texts cursively than isolated characters, the isolated set is chosen as the training set (CASIA-HWDB1.1), while the cursive text set (CASIA-HWDB-2.1) is used for testing. The total number of samples is 2,995 for the training set and 288 for the testing set. It is noted that each testing sample shares a certain training style. However, there shall be a style difference between isolated characters and their corresponding cursive counterparts. Different from the Point’ 04 data, the Standard-Target in this evaluation is not coming from the CASIA base. Instead, the standard ’Heiti’ font is chosen, as illustrated in the second column of Fig. 4. The Alexnet [14] is introduced with both batch-norm and dropout tricks to form E. It is optimized from scratch without any pre-training strategy. Pixel values are directly put into the Alexnet after resized to 227 × 227. Similarly, they are resized to 256 × 256 for the SN-GAC model. For other baselines, original 256 × 256 features are compressed to be 512-d with PCA. As seen in TABLE 2, the proposed SN-GAC model attains the highest accuracy, along with the self-training F-SVC model. By further examining those incorrectly classified samples as shown in Fig. 4, it can be concluded that most of the errors come from the confusing and cursive written Source. Some of them are even too difficult to be recognized for a human. In this case, the G would generate incorrect or even unclear Generated-Target examples. However, even if G does not perform well, the D-C may still give reliable class label based on the generated sample.

4

Conclusion and Future Work

A novel classification framework, named Style Neutralized Generative Adversarial Classifier (SN-GAC), based on the emerging Generative Adversarial Network


Method

Error Rate

NCM F-NCM MQDF F-MQDF CNN (Alexnet) SVM (LN / RBF) MR-MTL (RBF) F-SVC (ST, LN) SR-GAC

5.56% 4.51% 5.56% 4.51% 2.78% 3.47% 2.78% 2.08% 2.08%

Table 2: Error Rate on The CASIA Offline Database

9

Fig. 4: Some incorrect classified examples on the CASIA Offline Database. 1st Column: Source (x), 2nd Column: Standard-Target (x∗ ); 3rd Column: Generated-Target (G(x)); 4th Column: class label assigned by D-C from the generated sample in the 3rd Column.

(GAN), is proposed in this paper. It is designed to neutralize diverse and inconsistent style information from the original data by mapping them to patterns with standard style. The style-neutralized features are believed to be better compact and centralized, beneficial to the following classification task [7]. Aiming at promoting the recognition accuracy directly, it trains no extra classification model except the SN-GAC itself. Empirical experiments have demonstrated on two benchmark datasets that the proposed SN-GAC model not only achieves the highest classification performance so-far but taking no advantage of the test data during training with the self-training strategy, while generates high-quality human-understandable style-neutralized patterns. Future work includes the extension of the SN-GAC model to large-category classification (e.g. recognition of 3,755 classes in the whole CASIA dataset [15]), as well as the style transfer scheme to further reduce the classification error due to the style shift difference between training and validation.

Acknowledgements The work reported here was partially supported by the following: National Natural Science Foundation of China under grant no. 61473236; Natural Science Fund for Colleges and Universities in Jiangsu Province under grant no.17KJD520010; Suzhou Science and Technology Program under grant no. SYG201712, SZS201613; Jiangsu University Natural Science Research Programme under grant no. 17KJB520041; Key Program Special Fund in XJTLU (KSF-A-01);

10


References 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI. vol. 16, pp. 265–283 (2016) 2. Cate, H., Dalvi, F., Hussain, Z.: Deepface: Face generation using deep learning. arXiv preprint arXiv:1701.01876 (2017) 3. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995) 4. Evgeniou, T., Pontil, M.: Regularized multi–task learning. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 109–117. ACM (2004) 5. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014) 6. Gourier, N., Hall, D., Crowley, J.L.: Estimating face orientation from robust detection of salient facial structures. In: FG Net Workshop on Visual Observation of Deictic Gestures. vol. 6, p. 7 (2004) 7. Huang, K.Z., Yang, H., King, I., Lyu, M.R.: Machine learning: modeling data locally and globally. Springer Science & Business Media (2008) 8. Huang, K., Jiang, H., Zhang, X.Y.: Field support vector machines. IEEE Transactions on Emerging Topics in Computational Intelligence 1(6), 454–463 (2017) 9. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint (2017) 10. Jiang, H., Huang, K., Zhang, R.: Field support vector regression. In: International Conference on Neural Information Processing. pp. 699–708. Springer (2017) 11. Jiang, Y., Lian, Z., Tang, Y., Xiao, J.: Dcfont: an end-to-end deep chinese font generation system. In: SIGGRAPH Asia 2017 Technical Briefs. p. 22. ACM (2017) 12. Jing, X.Y., Wong, H.S., Zhang, D.: Face recognition based on 2d fisherface approach. Pattern Recognition 39(4), 707–710 (2006) 13. Kim, T.: Github dcgan-tensorflow (2016), https://github.com/carpedm20/ DCGAN-tensorflow 14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012) 15. Liu, C.L., Yin, F., Wang, D.H., Wang, Q.F.: Casia online and offline chinese handwriting databases. In: Document Analysis and Recognition (ICDAR), 2011 International Conference on. pp. 37–41. IEEE (2011) 16. Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585 (2016) 17. Sarkar, P., Nagy, G.: Style consistent classification of isogenous patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(1), 88–98 (2005) 18. Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200 (2016) 19. Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural Computation 12(6), 1247–1283 (June 2000) 20. Tian, Y.: Github zi2zi-tensorflow (2017), https://kaonashi-tyc.github.io/ 2017/04/06/zi2zi.html 21. Zhang, X.Y., Huang, K., Liu, C.L.: Pattern field classification with style normalized transformation. In: IJCAI Proceedings-International Joint Conference on Artificial Intelligence. vol. 22, p. 1621 (2011)