Towards Optimality Conditions for Non-Linear Networks

arXiv:1605.07145v1 [stat.ML] 23 May 2016

Towards Optimality Conditions for Non-Linear Networks Devansh Arpit Department of Computer Science SUNY Buffalo USA [email protected] Yingbo Zhou Department of Computer Science SUNY Buffalo USA [email protected]

Hung Q. Ngo LogicBlox San Francisco Bay Area USA [email protected] Nils Napp Department of Computer Science SUNY Buffalo USA [email protected]

Venu Govindaraju Department of Computer Science SUNY Buffalo USA [email protected]

Abstract Training non-linear neural networks is a challenging task, but over the years, various approaches coming from different perspectives have been proposed to improve performance. However, insights into what fundamentally constitutes optimal network parameters remains obscure. Similarly, given what properties of data can we hope for a non-linear network to learn is also not well studied. In order to address these challenges, we take a novel approach by analysing neural network from a data generating perspective, where we assume hidden layers generate the observed data. This perspective allows us to connect seemingly disparate approaches explored independently in the machine learning community such as batch normalization, Independent Component Analysis, orthogonal weight initialization, etc, as parts of a bigger picture and provide insights into non-linear networks in terms of properties of parameter and data that lead to better performance.

1

Introduction

Deep networks when trained under different optimization conditions may lead to drastically different results, both in terms of performance and the properties of the network parameters learned. While many heuristic methods for optimizing parameters exist, it is not well understood what parameter properties lead to good performance. Therefore our goal is to discover conditions on parameter properties that ensure near optimal performance. In practice, seemingly unrelated components (e.g. Rectified Linear units [17], orthogonal weight initialization [19], Batch Normalization [12], e.t.c.) lead to performance improvement while resulting in different learned parameters. This raises the question of whether there exists in general some optimal properties of network parameters. In order to answer this question we look at neural networks from the perspective of a data generating model where hidden layer representations give rise to the observed data (input layer). Then, we ask the question: what attributes of network parameters ensure accurate recovery of hidden representation given the data when we forward propagate? We find the answer to our question lies with a surprisingly unrelated entity–the Auto-Encoder.

Auto-Encoders (AE) [4] are commonly used for unsupervised representation learning. AEs focus on encoder decoder ˆ , where the reconstructed vector x ˆ is desired to be as learning a mapping x −−−−−→ h −−−−−→ x close to x as possible for the entire data distribution. What we show in this paper is that if we consider x is actually generated from h by some process (discussed later), then switching our perspective to generation recovery ˆ yields unexpected useful insights into the optimality of model analyze h −−−−−−−→ x −−−−−−→ h parameters of non-linear networks in terms of signal recovery. In other words, this perspective lets us look at a neural network layer from a signal recovery point of view where forward propagating x recovers the true signal h. In order to do so, we analyze the conditions under which the encoder part of an AE recovers the true h from x, while the decoder part acts as the data generation process. In this paper, our main result shows that the true signal h can be approximately recovered by the encoder part of an AE with high probability under certain conditions on the weight matrix, bias vectors and the distribution of the hidden signal. We discover that these required conditions, and the properties resulting from them, connect a wide variety of ideas that have been independently explored so far in the machine learning community as pieces of a bigger picture. These ideas include Batch Normalization [12], Normalization Propagation [3], De-noising Auto-Encoder [22], Independent Component Analysis [11] (ICA), Sparse Coding, orthogonal weight initialization [19], k-sparse auto-encoders[15], data whitening, Rectified Linear Activation [17] and Sigmoid activation (see Section 4.3 for details). While we provide insights for single layer non-linear network parameters, we believe our approach of analyzing neural networks from a data generation perspective can reveal insights into the workings of deep non-linear networks as well.

2

Sparse Signal Recovery Point of View

While it is known both empirically and theoretically, that useful features learned by AEs are usually sparse [2, 16, 17]. An important question that hasn’t been answered yet is whether AEs in general are capable of recovering sparse signals (generated as describe in the next section) in the first place. This is an important question from Sparse Coding point of view– because it entails recovering the sparsest h that approximately satisfies x = WT h, for any given data vector x and overcomplete weight matrix W. However, since this problem is NP complete [1], it is usually relaxed to solving an expensive optimization problem [5, 6], arg minkx − WT hk2 + λkhk1

(1)

h m×n

where W ∈ R is a fixed overcomplete (m > n) dictionary, λ is the regularization coefficient, x ∈ Rn is the observed data and h ∈ Rm is the signal we want to recover. For this special case, [15] analyzes the condition under which linear AEs can recover the support of the hidden signal. The general AE objective, on the other hand, minimizes the expected reconstruction cost JAE = min Ex kx − sd WT se (Wx + be ) + bd k2 (2) W,be ,bd

for some encoding and decoding activation function se (.) and sd (.), and bias vectors be and bd . In this paper we consider linear activation sd because we are interested in sparse signal recovery analysis. Notice however, in the case of auto-encoders, the activation functions can be non-linear in general, in contrast to the sparse coding objective. In addition, we do not have a separate parameter h for the hidden representation corresponding to every data sample x individually in case of AEs. Instead, the hidden representation for every sample is a parametric function of the sample itself. This is an important distinction between the optimization in equation 1 and our problem– the identity of h in equation 1 is only well defined in the presence of `1 regularization due to the overcompleteness of the dictionary; however in our problem, we assume a true signal h generates the observed data x as x = WT h + bd where the dictionary W and bias vector bd are fixed. Hence, what we mean by recovery of sparse signals in an AE framework is that if we generate data using the above generation ˆ = se (Wx + be ) indeed recover the true h for some activation process, then Can the estimate h functions se (.), and bias vector be ? And if so, what properties of W, be , se (.) and h in general lead to good recovery? However, when given an x and the true overcomplete W, the solution h to x = WT h is not unique. Then the question arises about the possibility of recovering such an h. However, as we show, recovery using the AE mechanism is strongest when the signal h is the sparsest possible one, which from compressed sensing theory, guarantees uniqueness of h if W is sufficiently coherent 1 . 1

Coherence is defined as maxWi ,Wj ,i6=j

TW | |Wi j kWi kkWj k

2

We would like to point out that while we analyze the AE recovery mechanism for recovering h, our goal is not data reconstruction and our results hold for layers in a general neural network.

3

Data Generation Process

We consider the following data generation process: x = WT h + bd + e n

(3) n

n

m×n

where x ∈ R is the observed data, bd ∈ R is a bias vector, e ∈ R is a noise vector, W ∈ R is the weight matrix and h ∈ Rm is the hidden representation (signal) that we want to recover given the observed data. Through out our analysis, we assume that the signal h belongs to the following class of distribution, Assumption 1. Bounded Independent Non-negative Sparse (BINS)–Every hidden unit hj is an independent random variable with the following density function: (1 − pj )δ0 (hj ) if hj = 0 f (hj ) = (4) pj fc (hj ) if hj ∈ (0, lmaxj ] where fc (.) can be any arbitrary normalized distribution bounded in the interval (0, lmaxj ], mean µhj , and δ0 (.) is the Dirac Delta function at zero. As a short hand, we say that hj follows the distribution BINS(p, fc , µh , lmax ). Notice Ehj [hj ] = pj µhj . The above continuous distribution assumption is justified because of the following intuition: In deep networks with ReLU activations, hidden unit pre-activations have a Gaussian like symmetric distribution [11, 12]. If we assume these distributions are mean centered2 , then on the application of ReLU on pre-activation values, hidden units’ distribution has a large mass at 0 while the rest of the mass concentrates in (0, lmax ) for some finite positive lmax since the pre-activation concentrates symmetrically around zero. As we show in the next section, ReLU is indeed capable of recovering such signals. On a side note, the distribution from the above assumption can take shapes similar to that of Exponential or Rectified Gaussian distribution3 (which are generally used for modelling biological neurons) but is simpler to analyse. This is because our definition is more general in a sense since we allow fc (.) to be any arbitrary normalized distribution. The only restriction assumption 1 has is that to be bounded; but this does not change the representative power of this distribution significantly because: a) the distributions used for modelling neurons have very small tail mass; b) in practice, we are generally interested in signals with upper bounded values. The above data generation process (equation 3 and the assumptions above) as a whole is justified because of the following reasons: 1. The above data generation model finds applications in a number of areas [24, 13, 23]. Notice that while x is the measurement vector (observed data), which can in general be noisy, h denotes the actual signal (internal representation) because it reflects the combination of dictionary (WT ) atoms involved in generating the observed samples and hence serves as the true identity of the data. 2. Sparse distributed representation [10] is both observed and desired in hidden representations. It has been empirically shown that representations that are truly sparse and distributed (large number of hard zeros) usually yield better linear separability and performance [8, 23, 24]. Decoding bias (bd ): Consider the data generation process (exclude noise for now) x = WT h + bd . Here bd is a bias vector which can take any arbitrary value but similar to W, it is fixed for any particular data generation process. However, the following proposition shows that if an AE can recover the sparse code (h) from a data sample generated as x = WT h, then it is also capable of recovering the sparse code from the data generated as x = WT h + bd and vice versa. Proposition 1. Let x1 = WT h where x1 ∈ Rn , W ∈ Rm×n and h ∈ Rm . Let x2 = WT h + bd ˆ 1 = se (Wx1 + b) and h ˆ 2 = se (Wx2 + b − Wbd ). Then where bd ∈ Rn is a fixed vector. Let h ˆ 1 = h iff h ˆ 2 = h. h Thus without any loss of generality, we will assume our data is generated by the process x = WT h + e. 2

This happens for instance as a result of the Batch Normalization [12] technique, which leads to significantly faster convergence. It is thus a good practice to have a mean centered pre-activation distribution. 3 depending on the distribution fc (.)

3

4

Signal Recovery Analysis and Optimal Properties of Network Parameters

Even though auto-encoders themselves are no longer heavily used for parameter initialization since many supervised training methods [12, 20, 17] directly lead to the state-of-the-art results, our analysis ˆ = se (Wx + be ) involved in auto-encoders nonetheless leads to of the recovery mechanism h useful insights about the optimization involved in training supervised networks. This is because we focus on analysing the recovery bound for the hidden representation h given the corresponding data sample x and the properties of the weight matrix and encoding bias that lead to good recovery. Thus instead of serving for simply analysing the data reconstruction bounds, our analysis tells more about what happens when data is forward propagated through a non-linear neural network in terms of the representation achieved at the hidden layer being the true data generator. Hence, we define the notion of Auto-Encoder signal recovery mechanism that we will analyse though out this paper, Definition 1. Let a data sample x ∈ Rn be generated by the process x = WT h + e where W ∈ Rm×n is a fixed matrix, e is noise and h ∈ Rm . Then we define the Auto-Encoder signal ˆ = se (Wx + be ) where se (.) is ˆ se (x; W, be ) that recovers the estimate h recovery mechanism as h an activation function. 4.1

Recovery Analysis

We analyse two separate class of signals in this category– binary sparse, and continuous sparse signals that follow BINS. For notational convenience, we will drop the subscript of be and simply refer this parameter as b as it is the only bias vector (we are not considering the other bias bd due to proposition 1). Due to space limitations, we have moved the analysis for binary sparse signals to appendix. In general we found similar conclusions hold for both the binary and continuous signal case. Theorem 1. (Noiseless Continuous Signal Recovery): Let each element of h ∈ Rm follow ˆ ReLU (x; W, b) be an auto-encoder signal recovery BINS(p, fc , µh , lmax ) distribution and let h mechanism with Rectified Linear activation function (ReLU) and bias b for a measurement vector P x ∈ Rn such that x = WT h. If we set bi , − j aij pj µhj ∀i ∈ [m], then ∀ δ ≥ 0,  P (δ+ j (1−pj )(lmax −2pj µh ) max(0,aij ))2 j j m X P −2 2 2 1 ˆ j aij lmaxj e Pr kh − hk1 ≤ δ ≥ 1 − m i=1 (5)  P −2p µ ) max(0,−a ))2 (δ+ (1−p )(l j

−2

j

+e where ai s are vectors such that WiT Wj aij = WiT Wi − 1

if if

maxj j hj P 2 2 j aij lmaxj

ij



i 6= j i=j

(6)

Wi is the ith row of the matrix W cast as a column vector. Analysis: We first analyze the properties of the weight P matrix that results in strong recovery bound. We find that for strong recovery, the terms (δ + j (1 − pj )(lmaxj − 2pj µhj ) max(0, aij ))2 and P (δ+ j (1−pj )(lmaxj −2pj µhj ) max(0, −aij ))2 should be as large as possible, while simultaneously, P 2 the term j a2ij lmax needs to be as close to zero as possible. First notice the term (1 − pj )(lmaxj − j 2pj µhj ). Since µhj < lmaxj by definition, we have that both terms containing (1 − pj )(lmaxj − 2pj µhj ) are always positive and contributes towards stronger recovery if pj is less than 50% (sparse), and becomes stronger as the signal becomes sparser (smaller pj ). Now if we assume the rows of the weight matrix W are highly incoherent and that each row of W has unit `2 length, then it is safe to assume each aij (∀i, j ∈ [m]) is close to 0 from the definition of aij and properties of W we have assumed. Then for any small positive value of δ, we can approximately P say

(δ+

j (1−pj )(lmaxj −2pj µhj ) max(0,aij )) P 2 2 j aij lmaxj

2

δ2 where each aij is very close to zero. 2 2 a j ij lmaxj P (δ+ j (1−pj )(lmaxj −2pj µhj ) max(0,−aij ))2 P 2 2 . Thus we j aij lmax

≈

same argument holds similarly for the term

P

j

4

The find

that we get a strong signal recovery bound if the weight matrix is highly incoherent and all hidden weight lengths are set to 1. P In the case of bias, we have set each element of the bias bi , − j aij pj µhj ∀i ∈ [m]. Notice from P the definition of BINS, Ehj [hj ] = pj µhj . Thus in essence, bi = − j aij Ehj [hj ]. Expanding aij , we get, bi = −WiT WT Eh [h] + Ehi [hi ] = −WiT Eh [x] + Ehi [hi ]. The recovery bound is strong for continuous signals when the recovery mechanism is set to ˆ i , ReLU(WT (x − Ex [x]) + Eh [hi ]) h i i

(7)

and the rows of W are highly incoherent and each hidden weight has length ones (kWi k2 = 1). Now we state the recovery bound for the noisy data generation scenario. Proposition 2. (Noisy Continuous Signal Recovery): Let each element of h ∈ Rm follow ˆ ReLU (x; W, b) be an auto-encoder signal recovery BINS(p, fc , µh , lmax ) distribution and let h mechanism with Rectified Linear activation function (ReLU) and bias b for a measurement vector x ∈ RnP such that x = WT h + e where e is any noise random vector independent of h. If we set bi , − j aij pj µhj − WiT Ee [e] ∀i ∈ [m], then ∀ δ ≥ 0,  2 T (e−E [e])+P (1−p )(l (δ−Wi e maxj −2pj µhj ) max(0,aij )) j j m X P −2 1 ˆ a2 l2 j max ij e j kh − hk1 ≤ δ ≥ 1 − Pr m i=1  P −2p µ ) max(0,−a ))2 (δ−WT (e−E [e])+ (1−p )(l i

−2

e

+e

j

maxj j P 2 2 j aij lmaxj

j

hj

ij

 (8)

where ai s are vectors such that WiT Wj aij = WiT Wi − 1

if if

i 6= j i=j

(9)

Wi is the ith row of the matrix W cast as a column vector. Notice we have not assumed any distribution on the noise random variable e. Also, this term has no effect on recovery (compared to the noiseless case) if the noise distribution is orthogonal to the hidden weight vectors. On the other hand, the same properties of W lead to better recovery as P in the noiseless case. However, in the case of bias, we have set each element of the bias bi , − j aij pj µhj − WiT Ee [e] ∀i ∈ [m]. Notice from the definition of BINS, Ehj [hj ] = pj µhj . Thus P in essence, bi = − j aij Ehj [hj ] − WiT Ee [e]. Expanding aij , we get, bi , −WiT WT Eh [h] + Ehi [hi ] − WiT Ee [e] = −WiT Eh [x] + Ehi [hi ]. Thus the expression of bias is unaffected by error statistics as long as we can compute the data mean. 4.2

Properties of Generated Data

Since the data we observe results from the hidden signal given by x = WT h, it would be interesting to analyze the distribution of the generated data. This would help us answer what kind of pre-processing would ensure stronger signal recovery. Theorem 2. (Uncorrelated Distribution Bound): If data is generated as x = WT h where h ∈ Rm +m has covariance matrix diag(ζ), (ζ ∈ R ) and W ∈ Rm×n (m > n) is such that each row of W has unit length and the rows of W are maximally incoherent, then the covariance matrix of the generated data is approximately spherical (uncorrelated) satisfying, r 1 minkΣ − αIkF ≤ (mkζk22 − kζk21 ) (10) α n where Σ = Ex [(x − Ex [x])(x − Ex [x])T ] is the covariance matrix of the generated data. 5

Analysis: Notice, for any vector v ∈ Rm , mkvk22 ≥ kvk21 , and the equality holds when each element of the vector v is identical. Data x generated using a maximally incoherent dictionary W (with unit `2 row length) as x = WT h guarantees x is highly uncorrelated if h is uncorrelated with near identity covariance. This would ensure the hidden units at the following layer are also uncorrelated during training. Further the covariance matrix of x is identity if all hidden units have equal variance. This analysis acts as a justification for data whitening where data is processed to have zero mean and identity covariance matrix. Notice although the generated data does not have zero mean, the recovery process (equation 7) subtracts data mean and hence it does not affect recovery. 4.3

Connections with existing work

Auto-Encoders (AE): Our analysis reveals the conditions on weights and bias (section 4) in an AE that lead to strong signal recovery (for both continuous and binary signals), which ultimately implies low data reconstruction error. However, the above arguments hold on AEs from a recovery point of view. Training a basic AE on data may lead to learning of the identity function. Thus usually AEs are trained along with a bottle-neck to make the learned representation useful. One such bottle-neck is the De-noising criteria given by, JDAE = minkx − WT se (W˜ x + b)k2

(11)

W,b

˜ is a corrupted version of x. It has been shownthat where se (.) is the activation function and x Pm ∂h T 2 k the Taylor’s expansion of DAE (Theorem 3 of [2]) has the term j,k=1 ∂ajj ∂h ∂ak (Wj Wk ) . If j6=k

we constrain the lengths of the weight vectors to have fixed length, then this regularization term minimizes a weighted sum of cosine of the angle between every pair of weight vectors. As a result, the weight vectors become increasingly incoherent. Hence we achieve both our goals by adding one additional constraint to DAE– constraining weight vectors to have unit length. Even if we do not apply an explicit constraint, we can expect the weight lengths to be upper bounded from the basic AE objective itself, which would explain the learning of incoherent weights due to the DAE regularization (we experimentally confirmed this to be true). On a side note, our analysis also justifies the use of tied weights in auto-encoders. Sparse Coding (SC): SC involves minimizing kx − WT hk2 using the sparsest possible h. The analysis after theorem 1 shows signal recovery using the AE mechanism becomes stronger for sparser signals (as also confirmed experimentally in the next section). In other words, for any given data sample and weight matrix, the AE recovery mechanism recovers the sparsest possible signal; which justifies using auto-encoders for recovering sparse codes (see [9, 15, 18] for work along this line) as long as the conditions on the weight matrix and bias are met. Batch Normalization (BN) and Normalization Propagation (NormProp): BN [12] identifies the problem of Internal Covariate Shift in deep network. This refers to shifting distribution of hidden layer inputs during training since network parameters get updated after each iteration. In order to address this problem, they suggest to make every hidden layer’s pre-activation normalized to have Normal distribution. While BN achieves this by computing a running average of input using training mini-batches for each hidden layer and subtracting it, NormProp parametrically computes this mean and subtracts it at every layer. As shown by our analysis (equation 7), setting bias vector to the negative of the expected pre-activation4 leads to strong signal recovery. This is an interesting coincidence since we arrive at this parameterization of the bias vector in order to achieve strong signal recovery bound. On the other hand, if the generating signal has identity covariance, the generated data is also approximately uncorrelated with equal variance (as shown in theorem 2). This is achieved by BN/NormProp by dividing each pre-activation by the standard deviation of its input. However, our analysis only holds for the first layer of a deep network, but since BN/NormProp are used for higher layers as well, this suggests a similar signal recovery analysis can be done for modeling higher hidden layers as well. 4

The term Ehj [hj ] in equation 7 is small since ReLU hidden unit output is generally sparse.

6

0.5

c

c

1.0

Coherent Weight Matrix

90.40251531 80.38135714 0.5 70.36019898 60.33904082 50.31788265 1.0 40.29672449 30.27556633 1.5 20.25440816 10.23325000 0.212091841.0

1.5 1.0

0.5 0.0

∆b

0.5

0.5 0.0

∆b

0.5

89.00344898 84.19822449 79.39300000 74.58777551 69.78255102 64.97732653 60.17210204 55.36687755 50.56165306 45.75642857

Figure 1: Error heatmap showing optimal values of c and ∆b for recovering signal using Incoherent (left) and Coherent (right) matrix.

Avg. % Recovery Error

InCoherent Weight Matrix

60 50 40 30 20 10 0

Recovery Error vs. Signal Sparsity orthogonal xavier gaussian orthogonal (noise) xavier (noise) gaussian (noise)

0.2

0.4

0.6

0.8

Unit activation probability (p)

Figure 2: Effect of signal sparseness on it’s recovery. Sparser signals are recovered better.

Independent Component Analysis [11] (ICA): ICA assumes we observe data generated by the process x = WT h where all elements of the signal h are independent and W is a mixing matrix. The task of ICA is to recover both W and h given data. This data generating process is precisely what we assume in section 3. Based on this assumption, our results show 1) the properties of W that can recover such independent signals h; 2) auto-encoders can be used for recovering such signals and weight matrices W. Orthogonal weight initialization [19]: The authors of [19] show that orthogonal weight initialization is optimal for training deep linear networks. We on the other hand, conclude from our analysis that signal recovery is strong when the filters in a weight matrix are highly incoherent. This condition is achieved trivially by orthogonal weight matrices when feature dimensions are larger than the number of filters (undercomplete). However, even in the overcomplete weight matrix scenario, we empirically show (next section) that if the transpose of the weight is orthogonalized, it results in highly incoherent filters compared with other methods for generating random weight matrices. Thus our findings support the orthogonal weight initialization of [19]. However, we do not make any claims for the weight matrices of higher layers of a deep non-linear network. k-Sparse AEs [15]: The authors of [15] propose to zero out all the values of hidden units smaller than the top-k values for each sample during training. This is done to achieve sparsity in the learned hidden representation. This strategy is justified from the perspective of our analysis as well. This is because the PAC bound (theorem 1) derived for signal recovery using the AE signal recovery mechanism shows we recover a noisy version of the true sparse signal. Since, noise in each recovered signal unit is roughly proportional to the original value, de-noising such recovered signals can be achieved by thresholding the hidden unit values (exploiting the fact that the signal is sparse). This can be done either by using a fixed threshold or picking the top k values. Data Whitening: Theorem 2 shows data generated from BINS and incoherent weight matrices are roughly uncorrelated. Thus recovering back such signals would be easier, and properties of weights and bias predicted by our analysis would be applicable if we pre-process the sampled data to have uncorrelated dimensions (also suggested by Lecun et al. [14]); a condition achieved by whitening.

5

Empirical Verification

We empirically verify the fundamental predictions made in section 4 which both serve to justify the assumptions we have made, as well as confirm our results. We verify the following: a) the optimality of the rows of a weight matrix W to have unit length and being highly incoherent for the single hidden layer case; b) effect of sparsity on signal recovery for the single hidden layer case. 5.1

Optimal Properties of Weight and Bias

Our analysis on signal recovery in section 4 (equation 7) shows signal recovery bound is strong when a) the data generating weight matrix W has rows of unit `2 length; b) the rows of W are highly incoherent; c) each bias vector element is set to the negative expectation of the pre-activation; d) signal h has each dimension independent. In order to verify this, we generate N = 5, 000 signals h ∈ Rm=200 from BINS(p = 0.02,fc =uniform, µh = 0.5,lmax = 1) with fc (.) set to uniform distribution for simplicity. We then generate the corresponding 5, 000 data sample x = cWT h in R180 using an incoherent weight matrix W ∈ R200×180 (each element sampled from zero mean Gaussian, the columns are orthogonalized, and `2 length of each row rescaled to 1; notice the rows cannot be orthogonal). We then recover each signal using, ˆ i , ReLU(cWT (x − Eh [x]) + Eh [hi ] + ∆b) h i i 7

(12)

where c and ∆b are scalars that we vary between [0.1, 2] and [−1, +1] respectively. For the recovered signals, we calculate the Average Percentage Recovery Error (APRE) as, APRE =

N,m 100 X ˆ i − hi | > ) w i 1(|h j j N m i=1,j=1 hj

(13)

ˆ i denotes the j th dimension of the recovered where we set = 0.1, 1(.) is the indicator operator, h j signal corresponding to the ith true signal and, 0.5 if hij > 0 p (14) whij = 0.5 if hij = 0 1−p

Coherence

Comparison of Incoherence The error is weighted with whij so that the recovery error for both 0.09 0.08 i zero and non-zero hj s are penalized equally. This is specially needed 0.07 0.06 in this case because hij is sparse and a low error can also be achieved 0.05 0.04 ˆ i s to zero. Along with the by trivially setting all the recovered h orthogonal j 0.03 gaussian 0.02 incoherent weight matrix, we also generate data separately using a 0.01 highly coherent weight matrix that we get by sampling each element 0.00 100 150 200 250 n randomly from a uniform distribution on [0, 1] and scaling each row to unit length. According to our analysis, we should get least error for c = 1 and ∆b = 0 for the incoherent matrix while the coherent Figure 3: Coherence of orthogmatrix should yield both higher recovery error and a different choice onal and Gaussian weight maof c and b. The error heat maps are shown in figure 1. For the inco- trix with varying dimensions. herent weight matrix, we see that the empirical optimal is precisely c = 1 and ∆b = 0 (which is exactly as predicted) with Avg. % Recovery Error =0.21 even though the weight matrix is only approximately incoherent (not maximally). For the coherent weight matrix on the other hand, we get the optimal values at c = 0.1 and ∆b = −0.1 with Avg. % Recovery Error =45.75. This clearly shows our predictions indeed hold in practice even under approximate conditions.

5.2

Effect of Sparsity on Signal Recovery

We analyze the effect of sparsity of signals on their recovery using the mechanism shown in section 4. In order to do so, we generate incoherent matrices using three different methods– Gaussian, Xavier [7] and orthogonal [19]. However, all the generated weight matrices are normalized to have unit `2 row length. Additionally, we sample signals and generate data using the same configurations as mentioned in section 5.1; only this time, we fix c = 1 and ∆b = 0, vary hidden unit activation probability p in [0.02, 1], and duplicate the generated data while adding noise to the copy, which we sample from a Gaussian distribution with mean 100 and standard deviation 0.05. According to our analysis, noise mean should have no effect on recovery so the mean value of 100 shouldn’t have any effect; only standard deviation affects recovery. The plot of Avg. % Recovery Error vs. unit activation probability is shown in figure 25 . We find for all weight matrices, recovery error reduces with increasing sparsity (decreasing p). Additionally, we find the recovery error trend is significantly lower for orthogonal weight matrices 6 while it’s identical7 for Gaussian and Xavier weights. Recall theorem 1 suggests stronger recovery for more incoherent matrices. So we plot the row coherence of W ∈ Rm×n sampled from Gaussian and Orthogonal methods with m = 200 and varying n ∈ [100, 300]. The plots are shown in figure 3. Clearly orthogonal matrices have significantly lower coherence even though the orthogonalization is done column-wise. This explains significantly lower recovery error for orthogonal matrices in figure 2.

6

Conclusion and Future Work

Our main contribution is to provide a novel perspective of looking at non-linear neural networks as a generative process. Specifically, if we assume that observed data is generated by hidden layer signals, then the true hidden representation can be accurately recovered if the weight matrices are highly 5

noise in brackets indicate the generated data was corrupted with the Gaussian noise. notice the rows of W are not orthogonal for overcomplete filters, rather the columns are orthogonalized, unless W is undercomplete 7 The latter is because upon weight length rescaling Gaussian and Xavier become identical.

6

8

incoherent with unit `2 length filters and bias vectors as described in equation 7 (theorem 1). Additionally, recovery becomes increasingly accurate with increasing sparsity in hidden signals. Finally, data generated from such signals (assumption 1) have the property of being roughly uncorrelated (theorem 2). As a result of these insights, our analysis brings together a number of independently explored approaches in the machine learning community like– Batch Normalization, Sparse Coding, data whitening, e.t.c– showing them as parts of a bigger picture. However, our analysis only models the first layer of any non-linear neural network where the input data is Gaussian like while the hidden representation is non-negative sparse (due to the activation function). It would be interesting to analyze the conditions needed for higher layers of such networks in terms of signal recovery where the input and output are both non-negative sparse (for most activation functions, e.g., ReLU, Sigmoid, e.t.c.). Our study serves as a necessary step towards this goal.

References [1] Edoardo Amaldi and Viggo Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(1):237–260, 1998. [2] Devansh Arpit, Yingbo Zhou, Hung Ngo, and Venu Govindaraju. Why regularized auto-encoders learn sparse representation? In ICML, 2016. [3] Devansh Arpit, Yingbo Zhou, Bhargava U. Kota, , and Venu Govindaraju. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In ICML, 2016. [4] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4-5):291–294, 1988. [5] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207–1223, 2006. [6] Emmanuel J Candes and Terence Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, 2006. [7] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics, pages 249–256, 2010. [8] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, pages 315–323, 2011. [9] Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. Unsupervised learning of sparse features for scalable audio classification. ISMIR, 11:445. 2011, 2011. [10] Geoffrey E Hinton. Distributed representations. 1984. [11] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411–430, 2000. [12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 448–456. JMLR.org, 2015. [13] Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467, 2010. [14] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012. [15] Alireza Makhzani and Brendan Frey. k-sparse autoencoders. CoRR, abs/1312.5663, 2013. [16] Roland Memisevic, Kishore Reddy Konda, and David Krueger. Zero-bias autoencoders and the benefits of co-adapting features. In ICLR, 2014. [17] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, 2010. [18] Andrew Ng. Sparse autoencoder. CSE294 Lecture notes, 2011. [19] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013. [20] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

9

[21] Kevin Swersky, David Buchman, Nando D Freitas, Benjamin M Marlin, et al. On autoencoders and score matching for energy based models. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1201–1208, 2011. [22] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096–1103, 2008. [23] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEEE TPAMI, 31(2):210 –227, Feb. 2009. [24] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pages 1794–1801, 2009.

10

Appendices A

Binary Sparse Signal Analysis

We consider data generated from binary sparse signals h, i.e., h that follows BINS(p, fc , µh , lmax ), where fc = δ1 (Dirac Delta function with value ∞ at 1 and 0 elsewhere). Restricted Boltzmann Machines (RBM) are parametric models widely used for modelling the distribution of any given binary valued data. The model essentially consists of a bipartite set of nodes, one for modeling the visible units (data) while the other for modeling hidden units (unseen part of data). These units are interconnected using weights and the goal of RBMs is to maximize the marginal probability of the data with respect to model parameters. The hidden units in this model are usually modeled as binary valued stochastic variables. Recently, [21] showed the free energy of an RBM can be used to derive an auto-encoder objective. In other words, the hidden units of an RBM correspond to the hidden units of AEs. This connection motivates us to investigate whether the AE recovery mechanism is capable of recovering binary sparse signals. First we consider the noiseless case of data generation, Theorem 3. (Noiseless Binary Signal Recovery): Let each element of h follow BINS(p, δ1 , µh , lmax ) ˆ ∈ Rm be an auto-encoder signal recovery mechanism with Sigmoid activation function and and let h P bias b for a measurement vector x ∈ Rn such that x = WT h. If we set bi = − j aij pj ∀i ∈ [m], then ∀ δ ∈ (0, 1), ! m (δ 0 +p a )2 (δ 0 +(1−pi )aii )2 X −2 Pm i ii 2 −2 Pm 1 ˆ a a2 j=1,j6=i ij + p e j=1,j6=i ij (1 − pi )e kh − hk1 ≤ δ ≥ 1 − Pr (15) i m i=1 δ where aij = WiT Wj , δ 0 = ln( 1−δ ) and Wi is the ith row of the matrix W cast as a column vector.

Analysis: We first analyse the properties of the weight matrix W that results in strong recovery 0 bound. Notice the terms (δP + pi aii )2 and (δ 0 + (1 − pi )aii )2 need to be as large as possible, while m simultaneously, the term j=1,j6=i a2ij needs to be as close to zero as possible. For the sake of analysis, lets set8 δ 0 = 0 (achieved when δ = 0.5). Then our problem gets reduced to maximizing the 2 4 kWi k2 ik Pm ratio Pm(aii ) a2 = Pm kW(W , where θij is the angle between Wi T W )2 = kWj k2 cos2 j j=1,j6=i

ij

j=1,j6=i

i

j=1,j6=i

θij

and Wj . From the property of coherence, if the rows of the weight matrix is highly incoherent, then cos θij is close to 0. Again, for the ease of analysis, lets replace each cos θij with a small positive 2 2 1 number . Then Pm(aii ) a2 ≈ 2 PmkWi k kWj k2 = 2 Pm 2 2 . Finally, since we j=1,j6=i j=1,j6=i kWj k /kWi k j=1,j6=i ij would want this term to be maximized for each hidden unit hi equally, the obvious choice for each weight length kWi k (i ∈ [m]) is to set it to 1. Finally, lets analyse P the bias vector. Notice we have instantiated each element of the encoding bias bi to take value − j aij pj . Since pj is essentially the mean of each binary hidden unit hi , we can say P that bi = − j aij Ehj [hj ] = −WiT WT Eh [h] = −WiT Eh [x]. Signal recovery is strong for binary signals when the recovery mechanism is given by ˆ i , Sigmoid(WT (x − Eh [x])) h i

(16)

where the rows of W are highly incoherent and each hidden weight has length ones (kWi k2 = 1), and each dimension of data x is approximately uncorrelated (see theorem 2). Now we state the recovery bound for the noisy data generation scenario. Proposition 3. (Noisy Binary Signal Recovery): Let each element of h follow BINS(p, δ1 , µh , lmax ) ˆ ∈ Rm be an auto-encoder signal recovery mechanism with Sigmoid activation function and and let h bias b for a measurement vector x = WT h + e where e ∈ Rn is any noise vector independent of h. 8

Setting δ = 0.5 is not such a bad choice after all because for binary signals, we can recover the exact true signal with high probability by simply thresholding the signal recovered by Sigmoid.

11

If we set bi = − Pr

P

j

aij pj − WiT Ee [e] ∀i ∈ [m], then ∀ δ ∈ (0, 1),

1 ˆ kh − hk1 ≤ δ m

≥1−

m X

(1 − pi )e

−2

T (e−E [e])+p a )2 (δ 0 −Wi e i ii Pm a2 j=1,j6=i ij

(17)

i=1 −2

+pi e

T (e−E [e])+(1−p )a )2 (δ 0 −Wi e i ii Pm a2 j=1,j6=i ij

! (18)

δ where aij = WiT Wj , δ 0 = ln( 1−δ ) and Wi is the ith row of the matrix W cast as a column vector.

Similar to the continuous signal recovery case, we have not assumed any distribution on the noise random variable e and this term has no effect on recovery (compared to the noiseless case) if the noise distribution is orthogonal to the hidden weight vectors. Again, the same properties of W lead to betterP recovery as in the noiseless case. In the case of bias, we have set each element of the bias bi , − j aij pj − WiT Ee [e] ∀i ∈ [m]. Notice from the definition of BINS, Ehj [hj ] = pj . Thus P in essence, bi = − j aij Ehj [hj ] − WiT Ee [e]. Expanding aij , we get, bi , −WiT WT Eh [h] − WiT Ee [e] = −WiT Eh [x]. Thus the expression of bias is unaffected by error statistics as long as we can compute the data mean.

B

Proofs

Proposition 1. Let x1 = WT h where x1 ∈ Rn , W ∈ Rm×n and h ∈ Rm . Let x2 = WT h + bd ˆ 1 = se (Wx1 + b) and h ˆ 2 = se (Wx2 + b − Wbd ). Then where bd ∈ Rn is a fixed vector. Let h ˆ 1 = h iff h ˆ 2 = h. h ˆ 1 = h. Thus h = se (Wx1 + b). On the other hand, h ˆ 2 = se (Wx2 + b − Wbd ) = Proof: Let h se (WWT h + Wbd + b − Wbd ) = se (WWT h + b) = se (Wx1 + b) = h. The other direction can be proved similarly. Theorem 1. Let each element of h ∈ Rm follow BINS(p, fc , µh , lmax ) distribution and let ˆ ReLU (x; W, b) be an auto-encoder signal recovery mechanism with Rectified Linear activation h functionP(ReLU) and bias b for a measurement vector x ∈ Rn such that x = WT h. If we set bi , − j aij pj µhj ∀i ∈ [m], then ∀ δ ≥ 0, Pr

1 ˆ kh − hk1 ≤ δ m

≥1−

m X



(δ+

−2

e

P

j (1−pj )(lmaxj −2pj µhj ) max(0,aij )) P 2 2 j aij lmaxj

2

i=1 (δ+

−2

+e


if if

P 2 j (1−pj )(lmaxj −2pj µhj ) max(0,−aij )) P 2 2 j aij lmaxj

i 6= j i=j



(19)



(20)

Wi is the ith row of the matrix W cast as a column vector. Proof. From definition 1 and the definition of aij above, hî = max{0,

X

aij hj + hi + bi }

j

hî − hi = max{−hi ,

X

aij hj + bi }

(21)

j

12

aij hj + bi . Thus, hî − hi = max{−hi , zi }. Then, conditioning upon zi , Pr(|hî − hi | ≤ δ) = Pr |hî − hi | ≤ δ hi > 0, |zi | ≤ δ Pr(|zi | ≤ δ, hi > 0) + Pr |hî − hi | ≤ δ hi > 0, |zi | > δ Pr(|zi | > δ, hi > 0) + Pr |hî − hi | ≤ δ hi = 0, |zi | ≤ δ Pr(|zi | ≤ δ, hi = 0) + Pr |hî − hi | ≤ δ hi = 0, |zi | > δ Pr(|zi | > δ, hi = 0) Since Pr |hî − hi | ≤ δ |zi | ≤ δ = 1, we have,

Let zi =

P

j

(22) (23) (24) (25)

Pr(|hî − hi | ≤ δ) ≥ Pr(|zi | ≤ δ) (26) The above inequality is obtained by ignoring the positive terms that depend on the condition |zi | > δ and marginalizing over hi . For any t > 0, using Chernoff’s inequality, Eh [etzi ] Pr(zi ≥ δ) ≤ (27) etδ P Setting bi = − j aij µj , where µj = Ehj [hj ] = pj µhj , h P i hQ i ta (h −µ ) Q taij (hj −µj ) ij j j Eh et j aij (hj −µj ) Eh e j j Ehj e = = (28) Pr(zi ≥ δ) ≤ tδ etδ etδ ta (h −µ ) e Let Tj = Ehj e ij j j . Then, h i Tj = (1 − pj )e−taij µj + pj Ev∼fc (0+ ,lmax ,µh ) etaij (v−µj ) (29) where fc (a, b, µh ) denotes any arbitrary distribution in the interval (a, b] with mean µh . If aij ≥ 0, let α = −µj and β = lmaxj − µj which essentially denote the lower and upper bound of hj − µj . Then, h i Tj = (1 − pj )etaij α + pj Ev∼fc (0+ ,lmaxj ,µhj ) etaij (v−µj ) (30) β − (v − µj ) taij α (v − µj ) − α taij β ≤ (1 − pj )etaij α + pj Ev e + e (31) β−α β−α β − (1 − pj )µhj (1 − pj )µhj − α = (1 − pj )etaij α + pj etaij α + pj etaij β (32) β−α β−α pj (1 − pj )µhj taij α pj βetaij α pj α taij β = (1 − pj )etaij α + − (e − etaij β ) − e (33) β−α (β − α) β−α where the first inequality in the above equation is from the property of a convex function. Define α u = taij (β − α), γ = − β−α . Then, pj (1 − pj )µhj pj α u pj β − (1 − eu ) − e (34) Tj ≤ e−uγ 1 − pj + β−α (β − α) β−α pj (1 − pj )µhj pj (1 − pj )µhj pj α pj α = e−uγ 1 + − − − eu (35) β−α (β − α) β−α (β − α) pj (1 − pj )µhj pj (1 − pj )µhj = e−uγ 1 − pj γ + + pj γ + eu (36) (β − α) (β − α) (37) Define φ = pj γ +

pj (1−pj )µhj (β−α)

g 0 (u) = −γ +

and let eg(u) , Tj = e−uγ (1 − φ + φeu ). Then, u

φe 1 − φ + φeu

g(u) = −uγ + ln(1 − φ + φeu ) =⇒ g(0) = 0 p(1 − p)µh =⇒ g 0 (0) = −γ + φ = −γ(1 − p) + (β − α) φ(1 − φ)eu g 00 (u) = (1 − φ + φeu )2 φ(1 − φ)(1 − φ + φeu )eu (1 − φ − φeu ) g 000 (u) = (1 − φ + φeu )4 13

(38) (39) (40) (41)

Thus, for getting a maxima for g 00 (u), we set g 000 (u) = 0 which implies 1 − φ − φeu = 0, or, 00 eu = 1−φ φ . Substituting this u in g (u) ≤ 1/4. By Taylor’s theorem, ∃c ∈ [0, u]∀u > 0 such that, upj (1 − pj )µhj u2 00 g (c) ≤ 0 − uγ(1 − pj ) + + u2 /8 2 (β − α)

g(u) = g(0) + ug 0 (0) +

(42)

Thus we can upper bound Tj as, Tj ≤ e

pj (1−pj )µh j u2 /8−u γ(1−pj )− (β−α)

=e

pj (1−pj )µh α(1−pj ) j t2 a2ij (β−α)2 /8+taij (β−α) β−α + (β−α)

(43)

Substituting for α, β, we get, Tj ≤ e

2 t2 a2ij lmax /8+taij (1−pj )(−µj +pj µhj ) j

2 t2 a 2 ij lmax

=e

j

(44)

8

On the other hand, if aij < 0, then we can set α = µj − lmaxj and β = µj and proceeding similar to equation 30, we get, Tj ≤ e

2 t2 a2ij lmax /8+t|aij |(1−pj )(µj −lmaxj +pj µhj ) j

2 t2 a2 ij lmax

=e

8

j

−t|aj |(1−pj )(lmaxj −2pj µhj )

(45)

Then, collectively, we can write Pr(zi ≥ δ) as Pr(zi ≥ δ) ≤

P Y Tj 2 t2 j a2ij lmax /8−t(δ+(1−pj )(lmaxj −2pj µhj ) max(0,−aij )) j = e tδ e j

(46)

We similarly bound Pr(−zi ≥ δ) by effectively flipping the sign of aij ’s, Pr(−zi ≥ δ) ≤ e

t2

P

j

2 a2ij lmax /8−t(δ+(1−pj )(lmaxj −2pj µhj ) max(0,aij ))

(47)

j

Minimizing both 46 and 47 with respect to t and applying union bound, we get, (δ+

ˆ i − hi | ≥ δ) ≤ e Pr(|h (δ+

−2

+e

P

−2

P 2 j (1−pj )(lmaxj −2pj µhj ) max(0,aij )) P 2 2 j aij lmaxj

2 j (1−pj )(lmaxj −2pj µhj ) max(0,−aij )) P 2 2 j aij lmaxj

(48)

∀ i ∈ [m]

(49)

Since the above bound holds for all i ∈ [m], applying union bound on all the units yields the desired result. ˆ ∈ Rm be Proposition 2. Let each element of h follow BINS(p, fc , µh , lmax ) distribution and let h an auto-encoder signal recovery mechanism with Rectified Linear activation function and bias b for a measurement vector x ∈ Rn P such that x = WT h + e where e is any noise random vector independent of h. If we set bi , − j aij pj µhj − WiT Ee [e] ∀i ∈ [m], then ∀ δ ≥ 0,  T (e−E [e])+P (1−p )(l 2 (δ−Wi e maxj −2pj µhj ) max(0,aij )) j j m X P −2 2 2 1 ˆ j aij lmaxj  Pr kh − hk1 ≤ δ ≥ 1 − e m i=1  P (δ−WT (e−E [e])+ (1−p )(l −2p µ ) max(0,−a ))2 i

−2

e

+e

j

maxj j P 2 2 j aij lmaxj

j

hj

ij

 (50)


if if

i 6= j i=j

(51)

Wi is the ith row of the matrix W cast as a column vector. 14

Proof. Recall that, hî = max{0,

X

aij hj + hi + WiT e + bi }

(52)

j

hî − hi = max{−hi ,

X

aij hj + WiT e + bi }

(53)

j

aij hj + bi + WiT e. Then, similar to theorem 1, conditioning upon zi , Pr(|hî − hi | ≤ δ) = Pr |hî − hi | ≤ δ hi > 0, |zi | ≤ δ Pr(|zi | ≤ δ, hi > 0) + Pr |hî − hi | ≤ δ hi > 0, |zi | > δ Pr(|zi | > δ, hi > 0) + Pr |hî − hi | ≤ δ hi = 0, |zi | ≤ δ Pr(|zi | ≤ δ, hi = 0) + Pr |hî − hi | ≤ δ hi = 0, |zi | > δ Pr(|zi | > δ, hi = 0) Since Pr |hî − hi | ≤ δ |zi | ≤ δ = 1, we have,

Let zi =

P

j

Pr(|hî − hi | ≤ δ) ≥ Pr(|zi | ≤ δ)

(54) (55) (56) (57)

(58)

For any t > 0, using Chernoff’s inequality for the random variable h, Pr(zi ≥ δ) ≤

Eh [etzi ] etδ

(59)

aij µj − WiT Ee [e], where µj = Ehj [hj ] = pj µhj , hQ h P i i ta (h −µ ) Q taij (hj −µj ) ij j j Eh Eh et j aij (hj −µj ) je j Ehj e = = (60) Pr(zi ≥ δ) ≤ T (e−E [e]) T (e−E [e]) T (e−E [e]) tδ−tW tδ−tW tδ−tW e e e i i i e e e Setting δ¯ := δ − WiT (e − Ee [e]), we can rewrite the above inequality as ta (h −µ ) Q ij j j j Ehj e Pr(zi ≥ δ) ≤ (61) etδ¯ Since the above inequality becomes identical to equation 28, the rest of the proof is similar to theorem 1. Setting bi = −

P

j

Theorem 2. (Uncorrelated Distribution Bound): If data is generated as x = WT h where h ∈ Rm m has covariance matrix diag(ζ), (ζ ∈ R+ ) and W ∈ Rm×n (m > n) is such that each row of W has unit length and the rows of W are maximally incoherent, then the covariance matrix of the generated data is approximately spherical (uncorrelated) satisfying, r 1 (mkζk22 − kζk21 ) (62) minkΣ − αIkF ≤ α n where Σ = Ex [(x − Ex [x])(x − Ex [x])T ] is the covariance matrix of the generated data. Proof. Notice that, Ex [x] = WT Eh [h]

(63)

Thus, Ex [(x − Ex [x])(x − Ex [x])T ] = Eh [(WT h − WT Eh [h])(WT h − WT Eh [h])T ] (64) = Eh [WT (h − Eh [h])(h − Eh [h])T W] (65) = WT Eh [(h − Eh [h])(h − Eh [h])T ]W

(66)

Substituting the covariance of h as diag(ζ), Σ = Ex [(x − Ex [x])(x − Ex [x])T ] = WT diag(ζ)W 15

(67)

Thus, kΣ − αIk2F = tr (WT diag(ζ)W − αI)(WT diag(ζ)W − αI)T T

T

2

T

= tr W diag(ζ)WW diag(ζ)W + α I − 2αW diag(ζ)W

(68)

(69)

Using the cyclic property of trace, kΣ − αIk2F = tr WWT diag(ζ)WWT diag(ζ) + α2 I − 2αWWT diag(ζ) m X = kWWT diag(ζ)k2F + α2 n − 2α ζi m X ≤( ζi2 )(1 + µ2 (m − 1)) + α2 n − 2α i=1

Finally minimizing w.r.t α, we get α∗ = get,

1 n

i=1 m X

ζi

(70) (71) (72)

i=1

Pm

i=1 ζi .

Substituting this into the above inequality, we

m m m X 1 X 2 2 X 2 minkΣ − αIk2F ≤ ( ζi2 )(1 + µ2 (m − 1)) + ( ζi ) − ( ζi ) α n i=1 n i=1 i=1

(73)

m m X 1 X 2 ζi ) ζi2 )(1 + µ2 (m − 1)) − ( =( n i=1 i=1

(74) (75)

matrix is maximally incoherent, using Welch bound, we have that, µ ∈ hSince i q the weight m−n n(m−1) , 1 . Plugging the lower bound of µ (maximal incoherence) for any fixed m and n into the above bound yields, m m X m−n 1 X 2 minkΣ − αIk2F ≤ ( ζi2 )(1 + (m − 1)) − ( ζi ) α n(m − 1) n i=1 i=1

(76)

m m X 1 X 2 m−n ζi ) ζi2 )(1 + )− ( =( n n i=1 i=1

(77)

1 mkζk22 − kζk21 n

(78)

=

ˆ ∈ Rm be an auto-encoder Theorem 3. Let each element of h follow BINS(p, δ1 , µh , lmax ) and let h signal recovery mechanism with Sigmoid activation P function and bias b for a measurement vector x ∈ Rn such that x = WT h. If we set bi = − j aij pj ∀i ∈ [m], then ∀ δ ∈ (0, 1), ! m (δ 0 +p a )2 (δ 0 +(1−pi )aii )2 X −2 Pm i ii 2 −2 Pm 1 ˆ a a2 j=1,j6=i ij + p e j=1,j6=i ij Pr kh − hk1 ≤ δ ≥ 1 − (1 − pi )e (79) i m i=1 δ where aij = WiT Wj , δ 0 = ln( 1−δ ) and Wi is the ith row of the matrix W cast as a column vector.

Proof. Notice that, Pr(|hî −hi | ≥ δ) = Pr(|hî −hi | ≥ δ hi = 0) Pr(hi = 0)+Pr(|hî −hi | ≥ δ hi = 1) Pr(hi = 1) (80) and from definition 1, X hî = σ( aij hj + bi )

(81)

j

16

Thus, Pr(|hî − hi | ≥ δ) = (1 − pi ) Pr(σ(

X

+pi Pr(σ(−

X

aij hj + bi ) ≥ δ hi = 0)

j

aij hj − bi ) ≥ δ hi = 1)

(82)

j

P P δ Notice that Pr(σ( j aij hj + bi ) ≥ δ hi = 0) = Pr( j aij hj + bi ≥ ln( 1−δ ) hi = 0). Let P P P δ zi = j aij hj + bi and δ 0 = ln( 1−δ ). Then, setting bi = −Eh [ j aij hj ] = − j aij pj , using Chernoff’s inequality, for any t > 0, i h P t j6=i aij (hj −pj )−tpi aii E e tzi h Eh [e ] Pr(zi ≥ δ 0 hi = 0) ≤ = etδ0 h eitδ0 Q ta (h −p ) Q taij (hj −pj ) ij j j Eh j6=i e j6=i Ehj e = = (83) 0 0 et(δ +pi aii ) et(δ +pi aii ) Let Tj = Ehj etaij (hj −pj ) . Then, Tj = (1 − pj )e−tpj aij + pj et(1−pj )aij = e−tpj aij (1 − pj + pj etaij ) Let e

g(t)

(84)

, Tj , thus, g(t) = −tpj aij + ln(1 − pj + pj etaij ) =⇒ g(0) = 0 pj aij e =⇒ g 0 (0) = 0 1 − pj + pj etaij

(86)

pj (1 − pj )a2ij etaij (1 − pj + pj etaij )2

(87)

pj (1 − pj )a3ij etaij (1 − pj + pj etaij )(1 − pj − pj etaij ) (1 − pj + pj etaij )4

(88)

g 0 (t) = −pj aij +

g 00 (t) = g 000 (t) =

(85)

taij

(89) Setting g 000 (t) = 0, we get t∗ = ∃c ∈ [0, t]∀t > 0 s.t., g(t) = g(0) + tg 0 (0) +

1 aij

ln(

1−pj pj ).

Thus, g 00 (t) ≤ g(t∗ ) =

a2ij 4 .

By Taylor’s theorem,

t2 a2ij t2 00 g (c) ≤ 2 8

(90)

Thus we can upper bound Tj as, t2 a 2 ij

Tj ≤ e 8 (91) 0 Hence we can write Pr(zi ≥ δ ) as t2 a 2 Q Q ij P 2 t2 8 j6=i aij j6=i Tj j6=i e 0 −t(aii pi +δ 0 ) 8 Pr(zi ≥ δ ) ≤ t(δ0 +a p ) = t(δ0 +a p ) = e (92) ii i ii i e e P P On the other hand, notice Pr(σ(− j aij hj − bi ) ≥ δ hi = 1) = Pr(− j aij hj − bi ≥ δ ln( 1−δ ) hi = 1) = Pr(−zi ≥ δ 0 hi = 1). h i P −t j6=i aij (hj −pj )−t(1−pi )aii E e −tzi h Eh [e ] Pr(−zi ≥ δ 0 hi = 1) ≤ = etδ0 etδ0 hQ i −taij (hj −pj ) Eh e j6=i = i )aii ) et(δ0 +(1−p −ta (h −p ) Q ij j j j6=i Ehj e = 0 et(δ +(1−pi )aii ) 17

(93) (94) (95)

Let Tj = Ehj e−taij (hj −pj ) . Then we can similarly bound Pr(−zi ≥ δ 0 ) by effectively flipping the sign of aij ’s in the previous derivation, Pr(−zi ≥ δ ) ≤

j6=i Tj 0 +a (1−p )) t(δ ii i e

t2 a 2 ij

Q

Q

0

=

j6=i e 0 +a (1−p )) t(δ ii i e

t2

8

P

=e

2 j6=i aij 8

−t(aii (1−pi )+δ 0 )

(96)

Minimizing both 92 and 96 with respect to t and applying union bound, we get, Pr(|hî − hi | ≥ δ) ≤ (1 − pi )e

−2(aii pi +δ 0 )2 P 2 j6=i aij

+ pi e

−2(aii (1−pi )+δ 0 )2 P 2 j6=i aij

(97)

Since the above bound holds for all i ∈ [m], applying union bound on all the units yields the desired result. ˆ ∈ Rm be an autoProposition 3. Let each element of h follow BINS(p, δ1 , µh , lmax ) and let h encoder signal recovery mechanism with Sigmoid activation function and bias b for a measurement T n vector P x = W Th + e where e ∈ R is any noise vector independent of h. If we set bi = − j aij pj − Wi Ee [e] ∀i ∈ [m], then ∀ δ ∈ (0, 1), T (e−E [e])+p a )2 (δ 0 −Wi m e i ii X Pm −2 1 ˆ a2 j=1,j6=i ij Pr kh − hk1 ≤ δ ≥ 1 − (1 − pi )e (98) m i=1 ! (δ 0 −WT (e−E [e])+(1−p )a )2 −2

+pi e

e i P m a2 j=1,j6=i ij

i

ii

(99)

δ where aij = WiT Wj , δ 0 = ln( 1−δ ) and Wi is the ith row of the matrix W cast as a column vector.

Proof. Notice that, Pr(|hî − hi | ≥ δ) = Pr(|hî − hi | ≥ δ hi = 0) Pr(hi = 0) + Pr(|hî − hi | ≥ δ hi = 1) Pr(hi = 1) and from definition 1, X hî = σ( aij hj + bi + WiT e)

(100) (101)

(102)

j

Thus, Pr(|hî − hi | ≥ δ) = (1 − pi ) Pr(σ(

X

+pi Pr(σ(−

X

aij hj + bi + WiT e) ≥ δ hi = 0)

j

aij hj − bi − WiT e) ≥ δ hi = 1)

(103)

j

P P δ Notice that Pr(σ( j aij hj +bi +WiT e) ≥ δ hi = 0) = Pr( j aij hj +bi +WiT e ≥ ln( 1−δ ) hi = P P δ 0). Let zi = j aij hj + bi + WiT e and δ 0 = ln( 1−δ ). Then, setting bi = −Eh [ j aij hj ] − P T Wi Ee [e] = − j aij pj , using Chernoff’s inequality on random variable h, for any t > 0, h P i t j6=i aij (hj −pj )−tpi aii E e tzi h Eh [e ] Pr(zi ≥ δ 0 hi = 0) ≤ tδ0 −tWT (e−E [e]) = tδ 0 −tWiT (e−Ee [e]) e i e hQ ie ta (h −p ) Q taij (hj −pj ) ij j j Eh j6=i e j6=i Ehj e = t(δ0 −tWT (e−E [e])+p a ) = t(δ0 −tWT (e−E [e])+p a ) (104) e i ii e i ii i i e e Setting δ¯ := δ 0 − WiT (e − Ee [e]), we can rewrite the above inequality as ta (h −p ) Q ij j j j6=i Ehj e 0 Pr(zi ≥ δ hi = 0) ≤ (105) ¯ i aii ) et(δ+p Since the above inequality becomes identical to equation 83, the rest of the proof is similar to theorem 1.

18