A Deep Similarity Metric Method Based on Incomplete

applied sciences Article

A Deep Similarity Metric Method Based on Incomplete Data for Traffic Anomaly Detection in IoT Xu Kang, Bin Song *

and Fengyao Sun

The State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an 710071, China; [email protected] (X.K.); [email protected] (F.S.) * Correspondence: [email protected]; Tel.: +86-29-8820-4409 Received: 31 October 2018; Accepted: 23 December 2018; Published: 2 January 2019

Abstract: In recent years, with the development of the Internet of Things (IoT) technology, a large amount of data can be captured from sensors for real-time analysis. By monitoring the traffic video data from the IoT, we can detect the anomalies that may occur and evaluate the security. However, the number of traffic anomalies is extremely limited, so there is a severe over-fitting problem when using traditional deep learning methods. In order to solve the problem above, we propose a similarity metric Convolutional Neural Network (CNN) based on a channel attention model for traffic anomaly detection task. The method mainly includes (1) A Siamese network with a hierarchical attention model by word embedding so that it can selectively measure similarities between anomalies and the templates. (2) A deep transfer learning method can automatically annotate an unlabeled set while fine-tuning the network. (3) A background modeling method combining spatial and temporal information for anomaly extraction. Experiments show that the proposed method is three percentage points higher than deep convolutional generative adversarial network (DCGAN) and five percentage points higher than AutoEncoder on the accuracy. No more time consumption is needed for the annotation process. The extracted candidates can be classified correctly through the proposed method. Keywords: anomaly detection; incomplete data; similarity metric; transfer learning; background modeling

1. Introduction Pattern recognition is one of the essential components of the Internet of Things (IoT) system and service. IoT provides an extensive range of services with a complex architecture of multiple sensors integrating together, one of the important applications is the intelligent transportation service. With the development of urbanization, more and more people choose to drive their own cars. The fast-growing private cars bring more convenience to ordinary families, but they also bring great pressure to the urban road conditions. Bad road conditions will waste drivers’ time as well as threaten the safety of people on roads. With all kinds of image sensors installed at crossroads and bayonets, the cloud computing center can gather and process the image data through the IoT. Based on the analysis of the captured image and video data with computer vision and image processing algorithms, the traffic anomalies can be automatically detected and recognized in IoT. When anomalies take place, the sensors can catch the unusual situation and notify the drivers and /passengers in time by IoT to help them make the right decisions. Meanwhile, it also ensures the safety of other travelers on the road. Besides, the detected anomalies can be sent to the traffic center in time, assisting the traffic police to arrive at the scene in time to regulate the traffic. Various behaviors in the IoT can be considered as abnormal events. In the traffic scene, the anomaly can be a car accident, an illegally stopped vehicle, or abnormal falling objects on the road such as a falling tire, a floating paper, or a road cone. These anomalies have some common Appl. Sci. 2019, 9, 135; doi:10.3390/app9010135

www.mdpi.com/journal/applsci

Appl. Sci. 2019, 9, 135

2 of 18

characteristics. They usually take place with a small probability rather than happening frequently. The anomalies on the road do not appear in the previous frames or stopped from moving in previous frames. Anomalous objects are usually small, and their visual features are not easily distinguishable. Considering the characteristics of video abnormal events, computer vision methods like background modeling can be employed to model specific areas to find objects from moving to stationary in videos. Then the machine learning methods can be exploited to discriminate which categories the anomalies belong to. However, the task seems to be challenging because of the complex traffic environment. Moreover, anomalies are difficult to capture, the number of training samples for them is insufficient compared with the normal objects. Therefore, traditional machine learning methods cannot deal with this kind of incomplete data with few effective samples. Some transfer learning and unsupervised learning methods can be applied to detect multiple kinds of anomalies. The IoT is a significant part of the new generation of information technology. According to the agreed protocol, it links any articles with the Internet to implement information exchange and communication. Therefore, intelligent identification, location, tracking, monitoring, and management can be realized in the Internet. In consideration of the IoT security, a number of challenges are indicated in [1–3]. An in-depth analysis of four major categories of anomaly detection techniques, including classification, statistics, information theory, and clustering are summarized in [4]. Zhang et al. utilize D2D communication and the IOV to analyze the preferences of vehicle users, so as to protect the users’ privacy to prevent accidents [5]. Wang et al. introduced the deep reinforcement learning methods in 5G and IoT technology for users’ security and privacy [6]. Riveiro et al. presented a visual analytics framework for multidimensional road traffic data and explored the detection of anomalous events [7]. Hoang proposed a new PCA technique and a new detection method for IoT with lower complexity [8]. A spatial localization constrained sparse coding approach for anomaly detection in traffic scenes is introduced in [9]. A fast and reliable method for anomaly detection and localization in video data showing crowded scenes is described in [10]. In recent years, Convolutional Neural Network (CNN) has been popularized in computer vision and image processing [11–13]. A new CNN architecture employing an adaptation layer and an additional domain confusion loss to learn a representation that is both semantically meaningful and domain invariant is proposed in [13]. Increasingly, researchers apply CNN to anomaly detection tasks [14–18]. Combining hand-crafted appearance and motion features with deep features, unsupervised deep learning methods are presented in [14,17]. Erfani et al. reduce the feature dimension when the model is fused, and the nonlinear expression ability of the model is improved [16]. Carrera et al. exploit a convolutional sparse model by learning a dictionary of filters from normal images [15]. Anomalies are then detected by identifying outliers from the model. Hasan et al. learn a generative model for regular motion patterns using multiple sources with very limited supervision [17]. Li et al. apply compressed-sensing (CS) theory to generate the saliency map to label the vehicles and exploit the CNN scheme to classify them [18]. Then a fully convolutional feed-forward AutoEncoder is built to learn both the local features and the classifiers. Several works based on similarity metric CNN have achieved excellent results in many scenarios and tasks [19–23]. Generator Adversarial Networks (GAN) can continuously learn the distribution of simulated data and generate samples with similar distribution of the training data [24–27]. Radford et al. proposed a class of CNNs called deep convolutional generative adversarial networks (DCGANs) which can learn a hierarchy of representations from object parts to scenes in both the generator and discriminator [24]. Odena et al. extended GAN to the semi-supervised context by outputting class labels from the discriminator network [25]. Bousmalis et al. utilized GAN to learn a transformation in the pixel domain from one domain to the other [26]. Kim et al. proposed the DiscoGAN to exploit relations between different domains so that the style from one domain to another can be successfully transferred [27]. In the past two years, AutoEncoders (AEs) are often used to detect abnormal events, especially in video surveillance [28–33]. Kodirov et al. proposed Semantic AutoEncoder (SAE) which project a visual feature vector into the semantic space for zero-shot learning [28]. Their model can also be applied

Appl. Sci. 2019, 9, 135

3 of 18

to the supervised problem. Potapov et al. combined CNN and AE for cross-dataset transfer learning in person re-identification [29]. Spatio-temporal AutoEncoders are applied to detect the abnormal events [31,33]. Zhao exploited 3-dimensional convolutions to extracts features from both spatial and temporal dimensions for video representation [31]. Chong et al. proposed an architecture including a component for spatial feature representation and another component for temporal learning [33]. Fan exploited Gaussian Mixture Variational AutoEncoder (GMVAE) to model the video anomalies as the Gaussian components of the Gaussian Mixture Model (GMM) [30]. A two-stream network framework is employed to combine the appearance and motion anomalies. A dynamic anomaly detection and localization system including a sparse denoising AutoEncoder is proposed in [32]. In this article, we propose a similarity metric CNN based on a channel attention model for a road anomaly detection task. The method mainly includes (1) A Siamese network with a hierarchical attention model by word embedding so that it can selectively measure similarities between anomalies and the template. (2) A deep transfer learning method can automatically annotate unlabeled data while fine-tuning the network, improving the accuracy of transfer learning and reducing the additional time consumption compared with DCGAN method and AutoEncoder method. (3) A background modeling method combining spatial-temporal information which can detect the anomaly areas accurately. The rest of the article is organized as follows: Section 2 introduces the traditional CNN and the similarity neural network. Section 3 describes the proposed channel attention model based on a Siamese structure and the transfer learning method about it. Section 4 shows the effectiveness of our model on incomplete datasets and some contrast experiments. We conclude this article in Section 5. 2. Materials A large number of images are generated from the distributed sensors all over the IoT system. The CNN is one of the effective methods to deal with these massive data. It imitates the construction of a biological visual perception mechanism and can be used for supervised and unsupervised learning. By calculating grid-like topology features, CNN can learn a variety of data distributions steadily without additional feature engineering. In this section, we first introduce the basic structure and principle of CNN, as shown in Section 2.1. Then a CNN based on similarity measurement is described in Section 2.2. It can not only learn the probability distribution of a single picture, but also can measure the similarity of two input pictures. Several tasks like human face verification or object tracking have employed with this structure to get the most similar samples from a template image. 2.1. CNN The traditional CNN structure is shown in Figure 1. In general, the basic structure of CNN consists of two layers, one is the feature extraction layer, the other is the feature mapping layer. The feature extraction layer is mainly composed of a convolutional layer and a pooling layer. The convolutional layer consists of several convolution units, and the parameters of each convolution unit are optimized by the backpropagation algorithm. The convolution kernel has the same number of channels as the feature maps. When implementing convolution, convolutional kernels slid on the feature map with fixed stride, and the sum of the inner product at the corresponding positions are mapped to the corresponding positions of the new feature maps. Each pixel in the feature map is only connected to the part of the previous feature map. This is called weight sharing. If the stride of convolution increases, the amount of computation in convolution will be remarkably reduced. Once the local feature is extracted, the location relationship between it and other features is also determined. The pooling layer often reduces the size of the feature map output of the convolution layer by down-sampling. It can improve the over fitting problem of the model while reducing the subsequent computation without changing the feature structure of the image. The sliding method of the down-sampling area is the same as it of the convolution kernel sliding in the convolution layer. When the current sampling area slides to a position, the maximum value or the average value of all the pixels in the sampled area is taken instead of the whole region. The feature mapping layer usually refers to the fully connected

Appl. Sci. 2018, 8, x FOR PEER REVIEW

4 of 19

the down-sampling area is the same as it of the convolution kernel sliding in the convolution layer. When the current sampling area slides to a position, the maximum value or the average value of all Appl. Sci. 2019, 9, 135 4 of 18 the pixels in the sampled area is taken instead of the whole region. The feature mapping layer usually refers to the fully connected layer. Each neuron of the fully connected layer is connected to all theEach neurons of the previous layer to integrate the features from multiple feature layer mapsto layer. neuron of the fully connected layer is connected toextracted all the neurons of the previous extractedthe by features convolutional layer. Since the fullfeature connection is connected to all the layer. outputSince of the integrate extracted from multiple maps layer extracted by convolutional the previous layer, the number of parameters it contains is much more than the convolution layer. In full connection layer is connected to all the output of the previous layer, the number of parameters it order to reduce redundancy, some neurons need be suppressed with some specific activation contains is muchthe more than the convolution layer. Into order to reduce the redundancy, some neurons functions training. For example, a dropout layer can inhibit neurons propagating need to be during suppressed with some specific activation functions during training. Forfrom example, a dropout forward a certain probability (usually 0.5). In the classification task, the length of the output layer canwith inhibit neurons from propagating forward with a certain probability (usually 0.5). In the vector from the last fully connected layer is the number of categories. The output vector is mapped classification task, the length of the output vector from the last fully connected layer is the number to categories. 0 and 1 through a softmax activation function, and1its meaning is the predicted probability thisits of The output vector is mapped to 0 and through a softmax activation function,ofand category. meaning is the predicted probability of this category. channels 0.1

0.6

0.4

0.2

0.1 0.2

0.5

0.6 0.8

0.7

0.6

0.7 0.7

0.7

0.2 0.9

0.9 0.7 0.4 0.9 0.7 0.2 0.9 0.3 0.2 0.4 0.9 0.8 0.6 0.8 0.2

0.9

0.8

0.1

0.3

Convolutional neural network Convolution layer

Fully connected layer

0.5

score

enlarge

0.15

car

0.07

softmax

Feature height channels

1×1×channels

0.68

bus

classify

truck

0.09

motor

0.01

person

Feature width

Figure of aa Convolutional Convolutional Neural NeuralNetwork Network(CNN). (CNN).The Thetraditional traditionalCNN CNNstructure structureisis Figure 1. 1. Architecture Architecture of mainly composed of convolution layers, pooling layers, fully connected layers, and some activation mainly composed of convolution layers, pooling layers, fully connected layers, and some activation functions. Each convolution kernel is connected to the part of feature maps. The input is connected functions. Each convolution kernel is connected to the part of feature maps. The input is connected toto all in the the fully fully connected connected layer. layer. all of of the the output output elements elements in

2.2. Similarity Neural Network 2.2. Similarity Neural Network The structure of similarity neural network is a little more complicated than the traditional The structure of similarity neural network is a little more complicated than the traditional CNN. CNN. Two image blocks are input to the convolution layer and pooling layer with same or different Two image blocks are input to the convolution layer and pooling layer with same or different parameters, and then the high-dimensional features of the map are obtained. Then the decision parameters, and then the high-dimensional features of the map are obtained. Then the decision network with shared parameters figures out the similarity of two image blocks [19]. The decision network with shared parameters figures out the similarity of two image blocks [19]. The decision network connected layer. layer. The Theoutput outputresult resultisisaasimilarity similarityvalue, value,representing representingthe the network is is usually usually aa fully fully connected matching image blocks. blocks. The Thematching matchingdegree degreeofoftwo twoimage imageblocks blocksneeds needstotobebe matching degree degree of of the the two two image manually annotated before training. If the two image blocks are matched, the ideal output value manually annotated before training. If the two image blocks are matched, the ideal output value is yis y= =1.1.ToTothe thecontrary, contrary,ififthe thetwo twoimage imageblocks blocksare aredifferent differentcategories, categories,the thedesired desireddecision decisionnetwork network output 0. When When dealing dealingwith withclassification classificationtasks, tasks,the theloss lossfunction functionofofCNN CNNisisthe thecross-entropy cross-entropy output is is Y Y= = 0. between the real probability distribution of the samples and the predicted probability distribution. between the real probability distribution of the samples and the predicted probability distribution. It Itconsists consistsofoftwo twoparts, parts,one oneis isthe the information entropy of the real distribution of the samples, theother other information entropy of the real distribution of the samples, the is divergence between between the the real real distribution distributionofofthe thesample sampleand andthe thepredicted predicted is the the Kullback–Leibler Kullback–Leibler divergence distribution: distribution: H ( p, q) = −∑ p( x ) log q( x ) H  p, q   x p  x  log q  x  q( x ) x p ( x ) log p ( x ) − ∑ p ( x ) log (1) = −∑ p( x ) x x q  x (1)  H log( ppk xq)   p  x  log = ( pp) +x KL p  x x x Equation (1) represents the information  H  p   difference KL  p q  between the two probability distributions. If the value is smaller, the two probability distributions are closer to each other. Figure 2a indicates the schematic diagram of the bilinear similarity neural network. The main idea is to make two image blocks go through different branches to a fully connected decision network with shared weights. If the two branches have completely different structure and weight values, the mapping function of the two branches is irrelevant. This structure is called Pseudo-Siamese. Then

Equation (1) represents the information difference between the two probability distributions. If the value is smaller, the two probability distributions are closer to each other. Figure 2a indicates the schematic diagram of the bilinear similarity neural network. The main idea is to make two image blocks go through different branches to a fully connected decision network with shared weights. If the two branches have completely different structure and weight Appl. Sci. 2019, 9, 135 5 of 18 values, the mapping function of the two branches is irrelevant. This structure is called Pseudo-Siamese. Then the values and the size of the extracted features are also different. The the values andfrom the size the images extractedisfeatures are also different. The information from the two images information the oftwo fused together when the two high-level feature maps are is fused together when the two high-level feature maps are transmitted to the first fully connected transmitted to the first fully connected layer. The number of weights is over two times larger layer. than The of weights is over two times larger than that the Siamese structure with demands shared weights. that number of the Siamese structure with shared weights. Inofaddition, there have been for a In addition, there have been demands for a greater amount of data and longer time for training. greater amount of data and longer time for training.

Patch1

Patch2

Conv

Conv

Conv

Pool

Pool

Pool

Conv

Conv

Conv

Pool

Pool

Patch2 Patch1

Pool

Decision network

Decision network

(a)

(b)

Figure 2. Similarity network. (a) Bilinear structure. The two image blocks are input into two branches respectively. (b) Double channel structure. The fused two blocks are input to generate two features respectively. with shared shared weights weights and and shared shared feature featuremaps. maps. with the network with

Figure 2b shows the illustration of the the dual dual channel channel similarity similarity neural neural network. network. The two image blocks are superimposed as the input. They share both the weights and feature maps during forwarding are superimposed as the input. They share both the weights and feature maps during propagation. The scale of The the model remarkably In the processInofthe forward forwarding propagation. scale ofis the model is reduced. remarkably reduced. processpropagation, of forward the first layer the of the feature generated convolution of convolution the superimposed which propagation, first layer map of theisfeature mapbyisthe generated by the of the images, superimposed contains common information of both images. After that, each feature is generated by the convolution images, which contains common information of both images. After that, each feature is generated by of previous layer, integrates the information both in the same way. in the same way. thethe convolution of thewhich previous layer, which integratesof the information of both Several typical structures of CNN to measure the similarity between two image patches are indicated in [19]. A similar deep metric learning framework framework is is employed employed for for face face verification verification in in [20]. [20]. The works in [17,18] both both use use the the Siamese Siamese structure structure for object object tracking by measuring the similarity between AA cascade Siamese architecture which can can learnlearn spatial and between the the instance instanceand anda asearching searchingarea. area. cascade Siamese architecture which spatial temporal information separately is presented for person re-identification in [23]. and temporal information separately is presented for person re-identification in [23]. 3. Methodology 3. Methodology The hashas been introduced in theinprevious section.section. Adopting the bilinear The traditional traditionalCNN CNN been introduced the previous Adopting thestructure bilinear shown in Figure the baseline, we provide CNN based on aCNN channel attention structure shown2ainasFigure 2a as the baseline,our wesimilarity provide metric our similarity metric based on a model for road anomaly detection task in this section. At the beginning of this section, we start with channel attention model for road anomaly detection task in this section. At the beginning of this the channel attention for Siamese architecture. Two main components of main the metric model section, we start with model the channel attention model for Siamese architecture. Two components are the metric residualmodel blocksare andthe theresidual hierarchical channel attention model.channel Then aiming at our incomplete of the blocks and the hierarchical attention model. Then data-sets, we propose a deep transfer learning method. First, the model is trained on the data-set aiming at our incomplete data-sets, we propose a deep transfer learning method. First, the model of is limited samples, and then the remaining similarity matching network is trained on the new samples. At the end of this section, an anomaly detection method based on background modeling is introduced. By modeling the pixel structure of video frames, the objects from moving to stationary on the road can be detected as a candidate. Then the proposed model can classify these candidates selectively.

Appl. Appl.Sci. Sci.2019, 2018,9, 8,135 x FOR PEER REVIEW

7 6ofof1918

3.1. Channel Attention Model for Siamese Architecture 3.1. Channel Attention Model for Siamese Architecture In recent years, ResNet [12] has been used in many fields such as detection, segmentation, In recent years, ResNet [12] has been used in many fields such as detection, segmentation, recognition and so on. It was first introduced in 2015 and won the first place in the ImageNet recognition and so on. It was first introduced in 2015 and won the first place in the ImageNet classification competition. This kind of structure is for the sake of the problem that the accuracy of classification competition. This kind of structure is for the sake of the problem that the accuracy of the network decreases on the validation set with the deepening of the network. As with the block the network decreases on the validation with thefunctions deepening the network. As with theisblock module shown in Figure 3, two featureset mapping areofproposed in ResNet, one the module shown in Figure 3, two feature mapping functions are proposed in ResNet, one is the identity identity mapping, the other is residual mapping. The block modules are cascaded to form the mapping, the otherofisResNet. residualBecause mapping. Thevery block modules areofcascaded form the structure overall structure only few samples anomalytoobjects onoverall the road have of ResNet. Because only very few samples of anomaly objects on the road have been provided for been provided for training the model, there will be a serious over-fitting problem if a single ResNet training the model, there will be a serious over-fitting problem if a single ResNet structure is chosen structure is chosen to repeatedly train the net on this incomplete data-set. Under this circumstance,to repeatedly the in netthis on this data-set. Under this circumstance, the model trained this the model train trained wayincomplete has insufficient generalization ability to accurately predict the in new way has insufficient generalization ability to accurately predict the new test samples on the road. test samples on the road. block conv1

bn1

Conv

Conv

relu

Pool

Pool

conv2

bn2

+

block1

×

block2

×

block3

×

block4

×

AvgPool

word64

att1

word128

att2

word256

att3

word512

att4

Concat

block1

×

block2

×

block3

×

block4

×

AvgPool

FC

Figure3.3.Overall Overallframework frameworkofofthe theproposed proposedmethod. method.The Thesimplest simplestversion versionofof ResNet selected Figure ResNet is is selected as as the the main of two the two branches. block indicates a deep residual structure in the dotted line. A main bodybody of the branches. TheThe block indicates a deep residual structure in the dotted line. A soft soft attention model is appended after block each block for feature location (att1,att3, att2,att4). att3, The att4). The vectors word attention model is appended after each for feature location (att1, att2, word vectors for each block have lengths different(word64, lengths word128, (word64, word256, word128, word512) word256, according word512) to according to for each block have different hierarchical hierarchical feature channels. feature channels.

As ResNet18 as as the the main main body bodyof ofthe thetwo twobranches, branches,where wherefour four As shown shown in in Figure Figure 3, we choose ResNet18 blocks The weights weightsofofthe thetwo twobranches branchesare areshared sharedininboth both blocks are are interconnected interconnected with each other. The training The features output fromfrom the last are down-sampled by an average trainingand andtesting testingstages. stages. The features output the block last block are down-sampled by an averagelayer. pooling layer. Then the pooled are concatenated for high feature dimensional pooling Then the pooled features arefeatures concatenated together for together high dimensional fusion. feature fusion. fused feature is inputted into the full to connected layer get a vector Afterwards, the Afterwards, fused featurethe is inputted into the full connected layer get a vector fortodiscriminating for discriminating whether thebelong two image belong The to the category. two inputs whether the two image patches to the patches same category. twosame inputs indicateThe the images to be indicate and the the images be detected image already in the incomplete database, detected imagetoalready exists inand the the incomplete database,exists respectively. The candidate image is respectively. candidate is extracted fromframes a specific area in the successive video frames extracted fromThe a specific area image in the successive video when an anomaly object is detected in the when an anomaly object is detected in the subsequent frames. subsequent frames. An attention attention model model is is designed designed after An after each each block, block, including includingaachannel channelattention attentionlayer layerand anda asoft soft attention layer. The visual attention model is the brain signal processing mechanism peculiar to attention layer. The visual attention model is the brain signal processing mechanism peculiar to human human vision. Human vision scans the global image quickly to get the focus area, which is generally vision. Human vision scans the global image quickly to get the focus area, which is generally called called theoffocus of attention. Then,attention more attention resources are invested thistoarea to obtain more the focus attention. Then, more resources are invested in thisin area obtain more details details of the target and suppress other useless information. We hope to input a word vector to the of the target and suppress other useless information. We hope to input a word vector to the model model through the attention mechanism, which can be used to select the categories of the anomaly through the attention mechanism, which can be used to select the categories of the anomaly object, and object, andassist indirectly assist in identifying thebetween similaritythe between the two images. The map attention map indirectly in identifying the similarity two images. The attention generated generated from themodel attention canirrelevant mask the areas irrelevant areas on the feature map and the highlight from the attention can model mask the on the feature map and highlight salient the salient region to calculate the similarity between two branches. The word vectors of different region to calculate the similarity between two branches. The word vectors of different levels have levels have different lengths while the attentional maps of different hierarchies have different different lengths while the attentional maps of different hierarchies have different resolutions. resolutions. The implementation details of the hierarchical attention mechanism are shown in Figure 4. R1 and The implementation details of the hierarchical attention mechanism are shown in Figure 4. R1 l R2 are the raw features extracted by the blocks and Ri,j,k indicates the feature value at the position (i, j) and R02 are the raw features extracted by the blocks and Ril, j , k indicates the feature value at the 0 l of the k th channel map extracted by l th block. Atti,j denotes the attention value at the position (i, j) on position  i, j  of the k ' th channel map extracted by l ' th block. Attil, j denotes the attention value


8 of 19

i, 135 j  on the soft attentional map generated after l ' th block. Ratt1 and Ratt 2 7 of 18 at theAppl. position Sci. 2019,9, represent the attentional features masked by the attention map of both branches while Rattil, j ,k is the

i, j  generated valuethe at soft the attentional position map of the k ' after mapRatt1 afterand block. are both l 0 th block. represent the Ratt attentional features R and th feature l ' thRatt2 l l i, j masked by the attention map of both branches while Ratt is the value at the position of the k0 th ( ) three-dimensional and the Att is two-dimensional. The word i,j,k vector W is a one-dimensional map from after la0 thdictionary. block. R and Ratt are both three-dimensional Att is two-dimensional. vectorfeature generated Dictionaries of different hierarchiesand are the randomly generated, l is a one-dimensional vector generated from a dictionary. Dictionaries of different The word vector W and each dictionary has several row vectors, the number of which is the same as the number of hierarchies generated, and dictionary each dictionary has several vectors, theinnumber categories definedare forrandomly the anomaly objects. The is generic, it will berow used together both of which is the same as the number of categories defined for the anomaly objects. The dictionary is training and testing stages. Once the dictionaries are initialized, the values in it are determined and generic, it will be used in both training stages. for Once the dictionaries will not change again. Thetogether proposed model has and fourtesting dictionaries word embedding.areIninitialized, the the stage, valueswhen in it are andtwo will branches not change The proposed has four dictionaries training thedetermined inputs of the areagain. the same category, model the word vectors are for word embedding. In the training stage, when the inputs of the two branches are the same category, consistent with the image categories. When the inputs of the two branches are different categories, the word vectors are consistent with the image categories. When In thethe inputs of stage, the two branches the word vectors are consistent with the image category of one branch. testing one input are theimage word block vectors arethe consistent the image of oneThe branch. of thedifferent networkcategories, is a template and other is with an image block category to be detected. wordIn the testing stage, one with inputthe of the network other isofan imagein block vectors are consistent category of is thea template template image imageblock block.and Thethe lengths vectors the to be detected. consistent with theincategory of the template image block. Thefor lengths dictionary are The the word same vectors as the are number of channels the corresponding level feature map lof channels in the corresponding level feature of vectors in the dictionary are the same as the number subsequent channel-attentions calculations. The weight Wk is the k ' th value of the word vector map for subsequent channel-attentions Wkl is the k0Then, th value the word vector corresponding to the k ' th feature map in calculations. raw featuresThe after each block. theofattentional R weight 0 corresponding to the k th feature map in raw features R after each block. Then, the attentional map map can be calculated as: can be calculated as: l l l l  R 2l l  Att  σCONV W R1 (2) (2) Attl = CONV ⊕ R1 ⊕ R2l 33 3 ×3 W



branch1



branch1

branch2

Embed

branch2

Embed

Word

conv

Word

multiply

Raw feature Attention map Channels

multiply

block

Raw feature

Attention map

block

Channels

Figure 4. Architecture of multiresolution attention model. A word vector is embedded in the fused of two branches for the generation of the shared Then,insoft is Figurefeatures 4. Architecture of multiresolution attention model. A wordattention vector is map. embedded the attention fused implemented for both branches with the shared attention map. The same process is performed after features of two branches for the generation of the shared attention map. Then, soft attention is each block. implemented for both branches with the shared attention map. The same process is performed after

each block.

The symbol ⊕ denotes the element-wise summation. The summation of R1l and R2l represents the feature fusion of the two branches. The add o l 0 th f W l and Rl can be implemented as the summation of Thel symbol  denotesl the element-wise summation. The summation of R1l and R2l Wk and each element of Ri,j,k , i ∈ [0, w − 1], j ∈ [0, h − 1], indicating that the word feature is embedded represents the feature fusion of the two branches. The add o l ' th f W l and R l can be implemented in the fused features. A 3 × 3 convolution is employed after these operations, meaning that each as the summation of Wkl and each element of Ril, j , k , i  0, w  1 , j  0, h 1 , indicating that the l element Atti,j on the map is generated from the 3 × 3 × C neighbourhood pixels of the feature at the word feature is embedded in the fused features. A 3  3 convolution is employed after these same location (i, j) on Rl . The values on the attention map reflect the influence power of the fusion l Attin operations, that eachofelement onadjacent the map is σgenerated from activation the 3  3 function, C i , j the feature meaning to the contribution prediction area. is the sigmoid l j  on Rcan neighbourhood of the at the same . The values on as: the attention it maps all pixels the values to feature that, the location attentional be calculated [0, 1]. After i, features

map reflect the influence power of the fusion feature to the contribution of prediction in the adjacent Ratt1l = R1l ⊗ Attl , Ratt2l = R2l ⊗ Attl area.  is the sigmoid activation function, it maps all the values to 0,1 . After that, the attentional

(3)

features can besymbol calculated as: The ⊗ denotes the element-wise multiplication, which can be implemented as the l and each element of Rl , k ∈ [0, C − 1]. It indicates that the raw features multiplication of Atti,j i,j,k are filtered by the attention map. Information that is useful for classification and discrimination is

Appl. Sci. 2019, 9, 135

8 of 18

preserved, while useless information is suppressed. It is worth mentioning that the attention map is shared for both branches, and the areas that the model pays attention to are the same at each hierarchy. The parameters, kernel size, kernel stride and the feature dimensions of our deep similarity metric model have been illustrated in Table 1. Table 1. The parameters, kernel size, kernel stride and the feature dimensions of proposed model. Layer

Kernel Size

Stride

Feature Size

Input Conv1 Pool1

3 × 7 × 7 × 64 3×3

2 2

224 × 224 × 3 64 × 112 × 112 64 × 56 × 56

conv conv conv conv

64 × 3 × 3 × 64 64 × 3 × 3 × 64 64 × 3 × 3 × 64 64 × 3 × 3 × 64

1 1 1 1

64 × 56 × 56

64 × 3 × 3 × 1

1

56 × 56 × 1

conv conv conv conv

64 × 3 × 3 × 128 128 × 3 × 3 × 128 128 × 3 × 3 × 128 128 × 3 × 3 × 128

2 1 1 1

128 × 28 × 28

128 × 3 × 3 × 1

1

28 × 28 × 1

128 × 3 × 3 × 256 256 × 3 × 3 × 256 256 × 3 × 3 × 256 256 × 3 × 3 × 256

2 1 1 1

256 × 14 × 14

256 × 3 × 3 × 1

1

14 × 14 × 1

256 × 3 × 3 × 512 512 × 3 × 3 × 512 512 × 3 × 3 × 512 512 × 3 × 3 × 512

2 1 1 1

512 × 7 × 7

512 × 3 × 3 × 1 7×7 1024 × 2

1 1 -

7×7×1 512 × 1 × 1 2

Block1

Att1

Block2

Att2 conv conv conv conv

Block3

Att3

Block4

conv conv conv conv

Att4 AvgPool FC

3.2. Deep Transfer Learning Method The goal of machine learning is to build a model as general as possible, so that the model can be well satisfied for different users, different devices, different environments and different requirements. However, the training and updating of machine learning models rely on data annotation. For our anomaly detection task, the most difficult is to continuously capture the abnormal events in the monitoring video, and labelling them manually in turn. Through the transfer learning method, we can find the similarity between annotated data and unannotated data, and then transfer the new model through knowledge reasoning. Fine-tuning in a deep neural network can help us save training time and improve learning accuracy, but it cannot deal with the different distribution of training data and test data. Deep neural network adaptation can make the data distribution of source domain and target domain closer, so that the classification accuracy of the network is better. Therefore, we propose a deep transfer learning method with two stages for our similarity metric network. Our model transfer learning method is mainly divided into two training stages. The first stage is shown in Figure 5a. Al and Au in the figure indicate the labeled anomalies and the unlabeled anomalies later acquired. According to [12], an adaptive layer set up before classification layer achieves a good transfer learning effect. The prediction layer in the figure is initialized with the fully connected layer from original ResNet. We implement an adaptive layer after the average pooling layer in order to prevent the model from overfitting to the distribution of the labeled anomalies. According to maximum

is shown in Figure 5a. Al and Au in the figure indicate the labeled anomalies and the unlabeled anomalies later acquired. According to [12], an adaptive layer set up before classification layer achieves a good transfer learning effect. The prediction layer in the figure is initialized with the fully connected layer from original ResNet. We implement an adaptive layer after the average pooling Appl. Sci.in 2019, 9, 135 9 of 18 layer order to prevent the model from overfitting to the distribution of the labeled anomalies. According to maximum mean discrepancy (MMD) criterion, we want to achieve the same distribution as far as possible after the features of the two branches are mapped through the mean discrepancy (MMD) criterion, we want to achieve the same distribution as far as possible after adaptive layer. L is used to measure the approximate degree of the distribution of annotated and the features of the mmd two branches are mapped through the adaptive layer. Lmmd is used to measure the unannotated anomalies in a mini-batch. Lcls is the softmax with cross-entropy of the classification approximate degree of the distribution of annotated and unannotated anomalies in a mini-batch. Lcls isloss. the softmax with cross-entropy of the classification loss. Lstage1  Lcls  1 Lmmd  2 W Lstage1 = Lcls + λ1 Lmmd + λ2 kW k2 2

(4) (4)

2

2 1 B i  Lmmd1 B h  fca iAl   fca Auii i Lmmd = k ∑ B i f1 ca Al − f ca Au  k

(5) (5)

B 1 CB−C11 i c Lcls−1 δ Y Yi = cc  log Lcls = logppclsccls ∑ ∑ B 1 c 0 B i=1 ic= 0

(6) (6)

B

i =1

andλ2in in Formula (4) two are balanced two balanced parameters. the regularization to 2 Formula λ11and (4) are parameters. kW k2 isWtheisregularization term toterm prevent prevent overfitting. size of the mini-batch during training. is the number of categories B isofthe overfitting. B is the size the mini-batch during training. C is theCnumber of categories defined c c defined for anomalies. represents outputlayer. of adaption is the predicted for anomalies. f ca(·) represents output ofthe adaption pcls is thelayer. predicted of c0 th fca   the pcls probability i 0 category. δ isofa 0–1 function. the loss labelfunction. of i th anomaly. probability category. a 0–1  Yis is Y i is the label of i ' th anomaly. c ' thloss 2

Conv

Pool

AvgPool

FC_adap

FC_cls

softmax

Conv

Pool

AvgPool

FC_adap

FC_cls

softmax

(a) First stage transfer

Conv

Pool

AvgPool

FC_adap

Concat

Conv

Pool

AvgPool

FC

softmax

FC_adap

(b) Second stage transfer Figure5.5.Transfer Transferlearning learningin intwo twostages. stages. (a) (a) First First stage transfer: fine-tuning Figure fine-tuningwith withan anadaptive adaptivefully fully connectedlayer. layer.(b) (b)Second Secondstage stagetransfer: transfer:knowledge knowledge transfer with soft labels joint learning. In connected transfer with soft labels forfor joint learning. In the the first stage, the network is fine-tuned labeledimage imagewhile whilepredicting predicting the the soft labels first stage, the network is fine-tuned byby thethe labeled labels for forthe the unlabeledimage. image. In In the the second second stage, stage, the network network is trained unlabeled trained by by both both labeled labeled image image and andunlabeled unlabeled image. A fully connected adaptive layer is adopted for feature transfer. image. A fully connected adaptive layer is

In predict the the unannotated unannotatedanomalies anomalies Inthe thefirst firsttraining trainingstage, stage, we we use use the the classification classification layer to predict simultaneously, infer the categories of of anomalies. AA soft label is defined as the simultaneously,ininorder ordertotopresumably presumably infer the categories anomalies. soft label is defined as category of theofmaximum probability predicted by the by model in the whole of transfer learning. the category the maximum probability predicted the model in the process whole process of transfer learning. The mathematical formas is follows: shown as follows: The mathematical form is shown " # 1 T c SL = argmax (7) ∑ pso f t c∈[0,C −1] T i =1 where T is the testing times of all the unannotated anomalies. pso f t is the predicted probability vector of the unannotated data.

Appl. Sci. 2019, 9, 135

10 of 18

In fact, the unlabeled data is predicted by the model every time the transfer training is performed. The output of the softmax function is a probability vector, the values of which reflects the possibility that it belongs to each category. Finally, categories corresponding to maximum values are considered as the soft labels according to the mean predictions throughout the training process. In the second training stage, both annotated data and unannotated data are used to train the similarity metric network. Except for the FC layer, the parameters of each layer are initialized with these trained in the first stage. When transferring the model from the first stage to the second stage, only the FC layer need to be completely trained while other layers just need to be fine-tuned. The outputs of the FC layer are two values, indicating their similarity and dissimilarity probabilities of the two inputs respectively. The softmax with cross-entropy loss is used to measure the similarity: Lsimi = −

λ B 1 1 B 1 i δ S = s log pssim A1il , A2il − ∑ ∑ δ Si = s log pssim A1il , A2iu ∑ ∑ B i =1 s =0 B i =1 s =0

(8)

where A1 and A2 are the two inputs of two branches. The loss function contains two parts. If one of the inputs is the anomalies with the soft label, a coefficient λ is set to weaken the influence of this part. Si is the target similarity. If two inputs and the embedded word vector belong to the same category, we set Si = 1. In order to make the network converge quickly on the incomplete data-sets, we control the class of input samples of one branch to be the same as that of word vectors. In the same way, the categories of samples for comparison and the word vector are unified. In this way, the similarity measurement of anomalies based on the complementary text is realized. 3.3. Anomaly Detection Based on Background Modeling For anomaly detection, locating the object is one of the most critical parts. In this paper, the image difference between background masks based on background modeling is adopted to locate the abnormal objects. In surveillance videos, the traditional backgrounds usually include road surface, roadside cornerstone, and green belts. All kinds of image blocks have the characteristics of pixel similarity and continuity. According to the situation above, the combination of spatial-temporal information for background modeling is adopted. The conventional background modeling method such as the Gaussian Mixture Model (GMM) depends on statistics of a single pixel, without considering the spatial distribution of pixels in the background. In view of this problem, we propose the spatial-temporal information modeling algorithm for the improved K-Means pixel clustering to process the pixels in videos. The image is converted to greyscale and processed by average down-sampling. Then n random pixels are considered as the initial clustering center. K-Means clustering on all pixels is performed through Euclidean distance. The efficiency of the category center is determined by the Calinski-Harabasz criterion: trB(k )/(k − 1) CH (k ) = (9) trW (k)/(n − 1) where n is the number of clusters, k is the current class, trB(k) is the trace of the inter-class covariance matrix, and trW (k ) is the trace of the intra-class covariance matrix. The larger CH indicates that the inter-class distance is closer while the intra-class distance is farther, leading to a better clustering result. The pixels closest to the center of the category in each class are recorded. Then the center points of n transferred categories are found in the original image, and their RGB pixel values are recorded as the category center in the subsequent clustering process. Afterwards, n transferred category pixel values are used to classify each block in the original image and the location distribution of each category on the modeled image is obtained. In the proposed method, the temporal background information is modeled by the K-nearest neighbor method. The nearest K instances in the training data-set are found to classify a new input

Appl. Sci. 2019, 9, 135

11 of 18

instance. An adaptive background modeling method is performed to get a real-time model and a time-delayed model. Specifically, the KNN background modeling is to find a hypersphere with volume V and diameter D centered on a sample point in the RGB space, which satisfies the requirement that there are more than K (generally 0.1 ∗ T) of the T samples inside the sphere. The Formula (10) is used to → calculate whether the sample points x are the background points. That is, each element in the sample set XT should be compared to determine whether it belongs to the background. The comparison between the two-pixel points is to calculate the Euclidean distance between them, and the initialization probability is as follows:



→(m) → 

− x → t

  x 1 = k K Pinit x | XT , BG Init = ∑   TV m=t−T D TV

(10)

where T is the number of historical frames, and t is the number of current video frames, K is a Gaussian kernel function. The point is considered as the background when its probability is greater than the probability of the background. Each pixel is classified into different categories according to its distance to each class center acquired from spatial information. Whether each category is in the background depends on the initialization of the spatial information and the spatial distribution. Then, we can discriminate whether the pixel is the background point by the class and location of the pixels. A comparison is made between the binary instant mask and the binary delayed mask in order to get the target bounding box in foreground whose overlap ratio is lower than a preset value (usually set to 0.1). The target bounding box appearing in instant mask but not appearing in the time-delayed mask can be considered as the potential anomalous targets. The overlap ratio formula is as follows: IOU =

region1 ∩ region2 region1 ∪ region2

(11)

where IOU is the overlap ratio of two bounding boxes in the image, defined as the ratio of intersection area to the merged area. The bounding boxes of the target is continuously recorded in N successive frames. When the number of occurrences of detected objects exceeds b0.9 ∗ N c in the N successive frames, the target is identified as an anomaly. Here, N needs to be set autonomously according to the given video and requirements, and b·c is a rounding down symbol. At the same time, the anomaly in the image is extracted from the original frame to the local hard disk for subsequent tasks such as classification and recognition. 4. Experiments In the previous section, we describe the channel attention model based on Siamese architecture for traffic anomaly classification, our deep transfer learning method and background modeling method for anomaly detection. In this section we first introduce the incomplete data for our experiments. Then several experiments are implemented to validate the effectiveness of the proposed model and transfer learning method. After testing under various conditions, we finally find that the proposed method is superior to DCGAN method and AutoEncoder method in practical application. Finally, we show the anomalies detected and classified in real traffic data-sets. The experiments are all performed by Pytorch (Pytorch 0.4.1: Facebook, Menlo Park, CA, USA) deep learning framework on Windows 10 Pro (Microsoft, Redmond, WA, USA) operating system. The experimental platform is equipped with Intel® Core(TM) i7-6700 CPU (Intel, Santa Clara, CA, USA) @ 3.40GHz, 32GB RAM and NVIDIA GeForce GTX 1060 GPU (NVIDIA, Santa Clara, CA, USA). Part of the training samples are shown in Figure 6. In the process of transfer learning, 39 images were selected as the labeled anomalies for training (11 cars, 10 trucks, 10 cones, and 8 tyres), 72 images were


13 of 19

operating system. The experimental platform is equipped with Intel® Core(TM) i7-6700 CPU (Intel, Santa Clara, California, USA) @ 3.40GHz, 32GB RAM and NVIDIA GeForce GTX 1060 GPU (NVIDIA, Santa Inofthe Appl. Sci. 2019, 9, 135 Clara, California, USA). Part of the training samples are shown in Figure 6. 12 18 process of transfer learning, 39 images were selected as the labeled anomalies for training (11 cars, 10 trucks, 10 cones, and 8 tyres), 72 images were selected as unlabeled data, 746 images were selected as unlabeled data, data 746 images were225 considered as the testing cars,The 225target trucks,is210 considered as the testing (199 cars, trucks, 210 cones, anddata 112(199 tyres). to cones, and 112 tyres). The target is to transfer from labeled data to unlabeled data. The generalization transfer from labeled data to unlabeled data. The generalization ability depends on the performance ability depends the performance of the model on on testing data-sets. of the model on testing data-sets.

(a) Cars

(b) Trucks

(c) Cones

(d) Tyres

Figure 6. 6. Several Severalcategories categoriesofoftraining training samples (b) trucks (c) cones (d) tyres. 1110cars, 10 samples (a) (a) carscars (b) trucks (c) cones (d) tyres. 11 cars, trucks, trucks, 10and cones, and are 8 tyres are selected as the incomplete samples. 199225 cars, 225 trucks, 210 10 cones, 8 tyres selected as the incomplete trainingtraining samples. 199 cars, trucks, 210 cones, cones, 11272 tyres. 72 images were selected as unlabeled and 112and tyres. images were selected as unlabeled data. data.

The first stage fine-tuning with different layers is shown in Figure Figure 7. 7. According to the overall framework of the proposed model, the layers are divided into six parts: conv1, block1,block1, block2, block2, block3, of the proposed model, the layers are divided into six parts: conv1, block4, fc 4, layer. contents in the legend of each line layers fine-tuned block3, and block and The fc layer. The displayed contents displayed in the legend ofrepresent each linethe represent the layers while otherwhile layersother do not list indo thenot legend that their parameters fixed during first stage fine-tuned layers list indenote the legend denote that theirare parameters arethe fixed during fine-tuning. The adaptive moment estimation (Adam) method is employed to optimize the objective the first stage fine-tuning. The adaptive moment estimation (Adam) method is employed to problem. Theobjective batch size was setThe to 16 andsize the was learning of the parameters is set to optimize the problem. batch set torate 16 and thefine-tuned learning rate of the fine-tuned 0.0001. The model is trained for 100 epochs. It is obvious from the figure that fine-tuning only the parameters is set to 0.0001. The model is trained for 100 epochs. It is obvious from the figure fully that connected poorconnected performance on the testing set. No significant improvement is achieved fine-tuninglayer onlyleads the tofully layer leads to poor performance on the testing set. No when fine-tuning all layers. two situations above bothallconverge at a very rate. Fine-tuning significant improvement is The achieved when fine-tuning layers. The two slow situations above both only block4 and fc achieves fastest convergence rate and highest accuracy. Therefore, the parameters converge at a very slow rate. Fine-tuning only block4 and fc achieves fastest convergence rate and of conv1,accuracy. block1, block2, and block3 are fixed in subsequent stages. are fixed in the highest Therefore, the parameters of the conv1, block1, experimental block2, and block3

subsequent experimental stages.


14 of 19

Appl.Sci. Sci.2019, 2018,9,8,135 x FOR PEER REVIEW Appl.

14ofof18 19 13

Figure 7. The first stage fine-tuning with different layers. When fine-tuning our network in the first stage, six situations were tested. The layers in the legend indicate that these layers were trained Figure 7.7.The layers. When our in first while the parameters of fine-tuning other layerswith weredifferent fixed during fine-tuning. Fine-tuning only block4 and fc Figure Thefirst firststage stage fine-tuning with different layers. Whenfine-tuning fine-tuning ournetwork network inthe the first stage, six situations were tested. The layers in the legend indicate that these layers were trained while achieves best classification results. stage, six situations were tested. The layers in the legend indicate that these layers were trained the parameters of otherof layers fixed during Fine-tuning only block4 and fc achieves while the parameters otherwere layers were fixedfine-tuning. during fine-tuning. Fine-tuning only block4 and fc best classification results. The effects of different adaptive layer sizes on the performance of the model are tested in achieves best classification results.

Figure 8. Six different sizes of adaptive layers (128, 256, 512, 1024, 2048, 4096) are attempted in two effects of different adaptiveThe layer sizes on the performance of(MBGD) the model are tested in Figure 8. stagesThe of transferring respectively. method is employed The effects of different adaptivemini-batch layer sizesgradient on the descent performance of the model are testedtoin Six different sizes of adaptive layers (128, 256, 512, 1024, 2048, 4096) are attempted in two stages of optimize The batch size(128, is set256, to 16 and the2048, learning the fine-tuned Figure 8.the Sixobjective different problem. sizes of adaptive layers 512, 1024, 4096)rate areof attempted in two transferringisrespectively. The mini-batch gradient descent (MBGD)Inmethod employed to optimize parameters set to 0.0002. The model trained for 100 epochs. the firstis stage, the withto stages of transferring respectively. The is mini-batch gradient descent (MBGD) method is model employed the objective problem. The batch size isand set to 16 and the learning rate of the fine-tuned parameters is adaptive with lengths of 2048 fastest adaptive optimizelayers the objective problem. The batch4096 sizehave is setthe to 16 and convergence the learning speed. rate of The the fine-tuned set to 0.0002. The model is trained for 100 epochs. In the first stage, the model with adaptive layers layers with a is length 2048 achieve the highest accuracy over In 0.9). curve of the adaptive parameters set toof0.0002. The model is trained for 100(just epochs. theThe first stage, modellayer with with lengths of 2048 and 4096 have the fastest convergence speed. The adaptive layers with a length length 4096 is nextlayers only to adaptive 2048.and In the second the model with adaptive adaptive with lengthslayer of 2048 4096 have stage, the fastest convergence speed. layers The adaptive of 2048 achieve the highest (just over 0.9).the Theadaptive curve oflayers adaptivelength layer 4096 isachieved next onlythe to 4096 haswith the afastest ofaccuracy convergence. Finally, 2048 layers lengthpace of 2048 achieve the highest accuracy (just over 0.9).ofThe curve of adaptive layer adaptiveaccuracy layer 2048. Inover the second stage, the model with adaptive adaptive layers layers 128 length 4096 thesecond fastest highest 0.95). The accuracy and 256 has inlayers the 4096 is next only(just to adaptive layer 2048. In thecurves secondofstage, the model with adaptive length pace of convergence. Finally, the adaptive layers of length 2048 achieved the highest accuracy stage compared corresponding in the first stage. This is2048 because the (just soft 4096 decreases has the fastest pace to of their convergence. Finally,curves the adaptive layers of length achieved the over 0.95). The accuracy curves of adaptive layers 128this anddirectly 256 in the second stage decreasesofcompared labels predicted in the first stage are not accurate, affects the performance transfer highest accuracy (just over 0.95). The accuracy curves of adaptive layers 128 and 256 in the second to their corresponding curves in the first stage. This is because the soft labels predicted in the first learning in the second stage. stage decreases compared to their corresponding curves in the first stage. This is because the soft stage are not accurate, this directly affects the performance of transfer learning in the second stage. labels predicted in the first stage are not accurate, this directly affects the performance of transfer learning in the second stage.

(a) Fine-tuning in the first stage

(b) Transfer learning in the second stage

Figure Transferlearning learningwith with different adaptive layers: (a) Fine-tuning different adaptive Figure 8. Transfer different adaptive layers: (a) Fine-tuning with with different adaptive layers (a)the Fine-tuning in the stage learning (b) adaptive Transfer the second stage in the first stage; Transfer learning with different adaptive layers inlearning the second stage; Different layers in first (b) stage; (b) first Transfer with different layers ininthe second stage; lengths adaptive connected layers were tested were in both stages. learning with adaptive Different offully adaptive fullydifferent connected layers tested in Transfer both stages. Transfer learning Figure of 8.lengths Transfer learning with adaptive layers: (a) Fine-tuning with different adaptive layers 2048 achieves best with adaptive layers 2048results. achieves best results. layers in the first stage; (b) Transfer learning with different adaptive layers in the second stage; Different lengths of adaptive fully connected layers were tested in both stages. Transfer learning

The comparison of testing accuracy proposed method, DCGAN, and AutoEncoder is shown in with adaptive layers 2048 achieves best results. Figure 9. The accuracy curve of the proposed method is the curve of the adaptive layer 2048 in last


15 of 19

The comparison Appl. Sci. 2019, 9, 135

of testing accuracy proposed method, DCGAN, and AutoEncoder is shown in 14 of 18 Figure 9. The accuracy curve of the proposed method is the curve of the adaptive layer 2048 in last experiment in Figure 8. Because the label generation process of the incomplete data is simultaneous experiment in stage Figureof8.fine-tuning, Because thethere labelisgeneration process the incomplete data is simultaneous with the first no more extra timeofconsumption. with the first stage of fine-tuning, there is no more extra time consumption.

Figure Figure9.9.Comparison Comparisonofoftesting testingaccuracy accuracyofofthe theproposed proposedmethod, method,DCGAN, DCGAN,and andAutoEncoder AutoEncoderinin second stage. TheThe batch size issize set to learning of the fine-tuned secondtransferring transferring stage. batch is 16 setwhile to 16thewhile the rate learning rate of theparameters fine-tuned isparameters set to 0.0002. The three methods are all tested for 100 epochs. is set to 0.0002. The three methods are all tested for 100 epochs.

The is implemented by the in [24].inThe labeled anomalies describeddescribed in Figure 6in TheDCGAN DCGAN is implemented by methods the methods [24]. The labeled anomalies are used to train the discriminator and the generator. After 800 iterations, more than a thousand Figure 6 are used to train the discriminator and the generator. After 800 iterations, moreimages than a are generated on these categories. Because most generated images are too similar to the original image, thousand images are generated on these categories. Because most generated images are too similar only a few images (2 cars, 8 cones, trucks, and 8 tyres) are selected as and the complementary samples to to the original image, only a few3images (2 cars, 8 cones, 3 trucks, 8 tyres) are selected as the differ from the original training set for subsequent training. The mini-batch gradient descent (MBGD) complementary samples to differ from the original training set for subsequent training. The method is employed optimize the objective Theto batch size isthe setobjective to 4 and problem. the learning mini-batch gradient to descent (MBGD) method problem. is employed optimize The rate of the fine-tuned parameters is set to 0.0002. The momentum is set to 0.5. It takes 4152 s for the batch size is set to 4 and the learning rate of the fine-tuned parameters is set to 0.0002. The DCGAN method generate extra4152 training for themethod second to transferring stage.training samples momentum is settoto 0.5. It takes s forsamples the DCGAN generate extra Thesecond AutoEncoder is implemented by the methods in [28]. The labeled anomalies introduced in for the transferring stage. FigureThe 6 are used to train the AutoEncoder. We consider aslabeled the feature extraction method. AutoEncoder is implemented by the methods inResNet [28]. The anomalies introduced in The extracted features of training samples are considered as the input data matrix of the encoder, Figure 6 are used to train the AutoEncoder. We consider ResNet as the feature extraction method. the the projection are trained theconsidered reconstruction error. After thethe features of Theparameters extracted of features of training samplesby are as the input datatraining, matrix of encoder, the train set and the features from the unlabeled set are all input into the encoder to get the semantic the parameters of the projection are trained by the reconstruction error. After training, the features features. We set use and the semantic features category cluster and generate centertoofget each of the train the features from of theeach unlabeled settoare all input into the the encoder the category in the semantic space. The unlabeled features from ResNet then are projected to the semantic semantic features. We use the semantic features of each category to cluster and generate the center toofget thecategory annotation by semantic argmin ksunlabel − ci k2 ,features where sfrom is the semantic features ofto unlabelResNet each in the space. The unlabeled then are projected i ∈{car,truck,cone,tyre}

the semantic to get by The argsize min of the sunlabel ci 2 , where sunlabel the semantic unlabeled data and ci isthe the annotation clustering center. raw− features output fromisResNet is 2048 i∈{car , truck , cone , tyre} and the size of the semantic space is 200. It takes 134 s for the AutoEncoder method to extract deep features of unlabeled data and ci is the clustering center. The size of the raw features output from features, train the encoder and decoder, and make clustering. ResNet is 2048 and the size of the semantic space is 200. It takes 134 s for the AutoEncoder method In the second transferring stage, the batch size is set to 16 while the learning rate of the fine-tuned to extract deep features, train the encoder and decoder, and make clustering. parameters is set to 0.0002 for the three methods. The result of the proposed method is shown in In the second transferring stage, the batch size is set to 16 while the learning rate of the the blue curve and the highest accuracy rate reaches 0.953. The curve of the method using DCGAN fine-tuned parameters is set to 0.0002 for the three methods. The result of the proposed method is is shown in orange. The maximum accuracy achieved was 0.929. The method performed with shown in the blue curve and the highest accuracy rate reaches 0.953. The curve of the method using AutoEncoder is shown by the green curve. Its testing accuracy can run up to 0.898. Compared with DCGAN is shown in orange. The maximum accuracy achieved was 0.929. The method performed the final stable accuracy, the proposed method is 3 percentage points higher than DCGAN and 5 with AutoEncoder is shown by the green curve. Its testing accuracy can run up to 0.898. Compared percentage points higher than the AE method. The comparison of all the performances of the three with the final stable accuracy, the proposed method is 3 percentage points higher than DCGAN and methods is shown in Table 2. 5 percentage points higher than the AE method. The comparison of all the performances of the three methods is shown in Table 2.

Appl. Sci. 2019, 9, 135

15 of 18

Table 2. Performances comparison of the three methods. Appl. Sci. 2018, 8, Methods x FOR PEER REVIEW

Proposed

DCGAN [24]

Accuracy of transfer learning 0.953 0.929 Table 2. Performances comparison of the three methods. Extra processing time for incomplete data (seconds) 0 4152 Methods Accuracy of transfer learning FigureExtra 10 shows thetime application of this in processing for incomplete datamodel (seconds)

AutoEncoder [28] 16 of 19 0.898 134

Proposed DCGAN [24] AutoEncoder [28] 0.953 0.929 0.898 road 0traffic scenarios. Our background modeling 4152 134

method can extract abnormal objects from moving to stationary in the road. There are usually two Figure 10one shows theisapplication of objects, this model in other road traffic Our background classes of anomalies, class the falling the class scenarios. is the illegally stopped vehicles. modeling method can extract abnormal objects from moving to stationary in the road. There are Because the original frame is too large to display, the first column only shows part of original frame. usually two classes of anomalies, one class is the falling objects, the other class is the illegally The imagestopped blocks vehicles. in the middle shows anomalies extracted by column the background Because column the original framethe is too large to display, the first only shows modeling part values of original The image blocks in thethe middle column shows the anomalies extracted by and each method. The in frame. the last column indicate similarities between each image block the background modeling method. The values in the last column indicate the similarities between class. As can be seen from the figure, the anomaly areas can be extracted by the proposed background each image block and each class. As can be seen from the figure, the anomaly areas can be extracted modeling method. Most extracted candidates can be classified correctly through the proposed deep by the proposed background modeling method. Most extracted candidates can be classified similarity metric The appearance of theneural white minivan sequence 2 iswhite between a car correctlyneural throughnetwork. the proposed deep similarity metric network. The in appearance of the minivan in sequence is between a carwhether and a truck. difficult discriminate whether set it’s ahas car only four and a truck. It is difficult to2discriminate it’sItaiscar or a to truck. The training or a truck. The training set has only so type that the classes so that the person is classified asfour theclasses closest to person him. is classified as the closest type to him.

Part of original frame

Anomalies extracted by background modelling

Similarities to each class Cars: 0.0248, Cones: 0.7533 Trucks: 0.0206, Tyres: 0.2013

(a) Sequence 1: falling objects in road Cars: 0.0896, Cones: 0.1564 Trucks: 0.6610, Tyres: 0.1564 Cars: 0.3980, Cones: 0.1791 Trucks: 0.2306, Tyres: 0.1923 (b) Sequence 2: illegally stopped vehicles Cars: 0.2027, Cones: 0.0124 Trucks: 0.7713, Tyres: 0.0135 (c) Sequence 3: illegally stopped vehicles Cars: 0.1942, Cones: 0.6791 Trucks: 0.0137, Tyres: 0.1131 Cars: 0.7481, Cones: 0.0024 Trucks: 0.2478, Tyres: 0.0017 (d) Sequence 4: illegally stopped vehicles

Figure 10. Road anomaly detected by background modeling: (a) Sequence 1: falling objects in road; (b) Sequence 2: illegally stopped vehicles; (c) Sequence 3: illegally stopped vehicles; (d) Sequence 4: illegally stopped vehicles. The first column shows the Region of Interests (ROIs) from original frame. The image blocks in the middle column shows the anomalies extracted by the background modeling method. The values in the last column indicate the similarities between each image block and each class.

Appl. Sci. 2019, 9, 135

16 of 18

5. Conclusions In this paper, we propose a similarity metric CNN based on a channel attention model for a traffic anomaly detection task. By extending ResNet to two branches, a Siamese structure can measure the similarity between two anomalies. Combining an attention model to generate a soft attentional mask from an embedded word vector, another modal information is complementary to the final decision. In order to solve the shortage of training samples and the overfitting problem, we employ a deep transfer learning method with two stages. In the first stage, the model is transferred from the original ResNet to the five classification model, while the soft labels are estimated adaptively. In the second stage, the model is transferred to two branches with the auxiliary attention model. Finally, a background modeling method is employed for anomaly extraction. Experimental results show that the proposed method achieves excellent results in our traffic data-set. Author Contributions: X.K. and F.S. conceived and designed the experiments; X.K. and F.S. performed the experiments; X.K. and B.S. analyzed the data; B.S. contributed experimental data, tools and devices; X.K. wrote the paper. B.S. reviewed and edited the paper. Funding: This research was supported by the National Natural Science Foundation of China under Grant (No. 61772387 and No. 61802296), Fundamental Research Funds of Ministry of Education and China Mobile (MCM20170202), the Fundamental Research Funds for the Central Universities, and supported by ISN State Key Laboratory. Acknowledgments: The authors would like to thank anonymous reviewers for their valuable comments and suggestions. Conflicts of Interest: The authors declare no conflict of interest.

References 1.

2.

3. 4. 5. 6. 7. 8.

9. 10.

11.

Rakesh, N. Performance analysis of anomaly detection of different IoT datasets using cloud micro services. In Proceedings of the 2016 International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 26–27 August 2016; pp. 1–5. Fuller, T.R.; Deane, G.E. IoT applications in an adaptive intelligent system with responsive anomaly detection. In Proceedings of the 2016 Future Technologies Conference (FTC), San Francisco, CA, USA, 6–7 December 2016; pp. 754–762. Vijayalakshmi, A.V.; Arockiam, L. A Study on Security Issues and Challenges in IoT. Eng. Sci. Manag. Res. 2016, 3, 34–43. Ahmed, M.; Mahmood, A.N.; Hu, J. A survey of network anomaly detection techniques. J. Netw. Comput. Appl. 2016, 60, 19–31. [CrossRef] Zhang, Y.; Tian, F.; Song, B.; Du, X. Social vehicle swarms: A novel perspective on socially aware vehicular communication architecture. IEEE Wirel. Commun. 2016, 23, 82–89. [CrossRef] Wang, D.; Chen, D.; Song, B.; Guizani, N.; Yu, X.; Du, X. From IoT to 5G I-IoT: The Next Generation IoT-Based Intelligent Algorithms and 5G Technologies. IEEE Commun. Mag. 2018, 56, 114–120. [CrossRef] Riveiro, M.; Lebram, M.; Elmer, M. Anomaly detection for road traffic: A visual analytics framework. IEEE Trans. Intell. Transp. Syst. 2017, 18, 2260–2270. [CrossRef] Hoang, D.H.; Nguyen, H.D. A PCA-based method for IoT network traffic anomaly detection. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon-si, Gangwon-do, Korea, 11–14 February 2018; pp. 381–386. Yuan, Y.; Wang, D.; Wang, Q. Anomaly detection in traffic scenes via spatial-aware motion reconstruction. IEEE Trans. Intell. Transp. Syst. 2017, 18, 1198–1209. [CrossRef] Sabokrou, M.; Fayyaz, M.; Fathy, M.; Klette, R. Deep-cascade: Cascading 3D deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Trans. Image Process. 2017, 26, 1992–2004. [CrossRef] [PubMed] Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105.

Appl. Sci. 2019, 9, 135

12.

13. 14. 15.

16. 17.

18. 19.

20. 21.

22.

23.

24. 25. 26.

27. 28. 29.

30. 31.

17 of 18

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; Darrell, T. Deep domain confusion: Maximizing for domain invariance. arXiv 2014, arXiv:1412.3474. Xu, D.; Ricci, E.; Yan, Y.; Song, J.; Sebe, N. Learning deep representations of appearance and motion for anomalous event detection. arXiv 2015, arXiv:1510.01553. Carrera, D.; Boracchi, G.; Foi, A.; Wohlberg, B. Detecting anomalous structures by convolutional sparse models. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. Erfani, S.M.; Rajasegarar, S.; Karunasekera, S.; Leckie, C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognit. 2016, 58, 121–134. [CrossRef] Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. Li, Y.; Song, B.; Kang, X.; Du, X.; Guizani, M. Vehicle-Type Detection Based on Compressed Sensing and Deep Learning in Vehicular Networks. Sensors 2018, 18, 4500. [CrossRef] [PubMed] Zagoruyko, S.; Komodakis, N. Learning to compare image patches via convolutional neural networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4353–4361. Sankaranarayanan, S.; Alavi, A.; Chellappa, R. Triplet similarity embedding for face verification. arXiv 2016, arXiv:1602.03418. Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 850–865. Chung, D.; Tahboub, K.; Delp, E.J. A two stream siamese convolutional neural network for person re-identification. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1992–2000. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. Odena, A. Semi-supervised learning with generative adversarial networks. arXiv 2016, arXiv:1606.01583. Bousmalis, K.; Silberman, N.; Dohan, D.; Erhan, D.; Krishnan, D. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 95–104. Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. arXiv 2017, arXiv:1703.05192. Kodirov, E.; Xiang, T.; Gong, S. Semantic AutoEncoder for zero-shot learning. arXiv 2017, arXiv:1704.08345. Potapov, A.; Rodionov, S.; Latapie, H.; Fenoglio, E. Metric Embedding AutoEncoders for Unsupervised Cross-Dataset Transfer Learning. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Island of Rhodes, Greece, 4–7 October 2018; pp. 289–299. Fan, Y.; Wen, G.; Li, D.; Qiu, S.; Levine, M.D. Video anomaly detection and localization via gaussian mixture fully convolutional variational AutoEncoder. arXiv 2018, arXiv:1805.11223. Zhao, Y.; Deng, B.; Shen, C.; Liu, Y.; Lu, H.; Hua, X.S. Spatio-temporal AutoEncoder for video anomaly detection. In Proceedings of the 2017 ACM on Multimedia Conference, Mountain View, CA, USA, 23–27 October 2017; pp. 1933–1941.

Appl. Sci. 2019, 9, 135

32. 33.

18 of 18

Narasimhan, M.G.; Kamath, S. Dynamic video anomaly detection and localization using sparse denoising AutoEncoders. Multimedia Tools Appl. 2017, 77, 1–23. [CrossRef] Chong, Y.S.; Tay, Y.H. Abnormal event detection in videos using spatIoTemporal AutoEncoder. In Proceedings of the International Symposium on Neural Networks, Sapporo, Japan, 21–26 June 2017; pp. 189–196. © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).