Elaborato finale

2 downloads 0 Views 3MB Size Report
Elaborato finale per il. MASTER IN TECNOLOGIE ... to c (or vice versa), exist a direct link between a and c (the friend of my friend is my friend). As represented in.
Elaborato finale per il MASTER IN TECNOLOGIE BIOINFORMATICHE APPLICATE ALLA MEDICINA PERSONALIZZATA

Enrico Pieroni CRS4 Bioinformatics Tutor: Alberto de la Fuente Ragno Group Reverse-engineering and Analysis of Genome scale NetwOrks CRS4 Bioinformatics

Anno 2005/2006

Digging in a directed biological network Characterization of the structural properties of the yeast gene network

Introduction We analysed a Gene Network1 of S. cerevisiae (the baker yeast) inferred by integration of expression and genotype data. While some basic features are similar to those derived for yeast Transcription Regulatory Network2, other local and global features are quite different, ranging from bow-tie structure to overrepresented patterns. These features can highlight hidden and yet unknown functional paths through genes, and are potentially useful to infer functional importance. Many researchers have tried also to set up rules capable to grow networks mimicking the existing biological networks (e.g. food, metabolic, protein, genes, ...), thus providing insights about the biological development of them. In the second part of this work we will propose and analyse a set of rules capable to grow networks with properties close to the yeast gene network in analysis. Results The yeast network The yeast gene network can be represented as a simple data structure: a two colum table, with the source (regulator gene) node in the first column and the target node (regulated gene) in the second one:

yeast : ,1  yeast : , 2 . Every

link is equally weighted, and we can assume the weight is simply 1. The table yeast can be seen as the standard way to represent the sparse matrix called adjacency matrix, that is simply a matrix with 1 at row t (as target) and column s (as source) if there exists a link connecting node s to node t, and 0 otherwise:

{

Ats = 1 0 1 2

if ∃ the entry t , s in yeast table otherwise

}

GN are directed and every edge represents causal influence between gene activities, both regulation of transcription (by TF) and influences through signal-transduction or metabolic activities. TRN have directed edges if a product of source gene experimentally bind to the promoter region of the target gene.

The network being directed, the adjacency matrix is not symmetric, and it will generally not be possible to rearrange it as an lower diagonal matrix, due to the presence of cycles. It is then immediate to span the whole range of nodes and build the three simplest but very important functions (1-3): ●

P ij , that is the probability to have a node with i

the joint in-out degree distribution, incoming edges and j outgoing edges;



the in degree distribution,

in

Pi

, that is the probability that a node has i incoming

edges; and ●

the out degree distribution,

out 

Pj

, that is the probability that a node has j outgoing

edges.

The mean in and out degree may then be evaluated as usual: 〈 i〉=∑ ij P ij i=∑ i P i i in



〈 j〉=∑ij P ij j=∑ j P j

out 

j

Every edge contribute to the network with 1 in and 1 out degree, so the net degree of each node should be zero:

∑ij P ij i− j=0



〈i 〉=〈 j 〉 .

Another important network index is the cluster coefficient (1-3). It is a measure of the transitivity of the relationships between nodes, that is how is likely that if a is connected to b and b to c (or vice versa), exist a direct link between a and c (the friend of my friend is my friend). As represented in Figure 1, the downstream cluster coefficient can be defined locally for a node a as the number neighbours

dns N ∇ a  of oriented triangles existing linking a to the nearest

a  b ∧ b  c ∨ c  b ∧ a  c , over the number of all possible

oriented triangles surrounding the node a with total degree dns

C

Just

dns

N ∇ a  a= k a k a −1

reverting cluster

3

:

b the

arrow

directions we can define the upstream

k a =i j

coefficient.

Spanning over the nodes, we

a

c

Figure 1: Transitivity property used to define the cluster coefficient.

can then evaluate the downstream cluster coefficient distribution, that is the

3

Notice that a factor two is missing with respect to the undirected networks.

probability for a node with out degree j, to have a cluster coefficient C: dns

D j C  . The same can be done for the upstream case.

If the network represents an order due to some hidden biological reason, what is the reference we have to use to decide if something is happening chance?

more

by

The

reference

are

simple

standard Erdös-Renyi

random networks (1), in which the

probability

to

connect

every two nodes is constant, and does not depend on the selected nodes. As shown in Figure

2,

typically

for

undirected networks (1):

Figure 2: Typical degree distribution for random, scale free and hierarchical networks. The picture is taken from (1).



random networks (Aa) show a Poissonian degree distribution

(Ab) and a flat cluster coefficient distribution (Ac); ●

biologically interesting networks (Ba) show a power behaviour (Bb) and a flat cluster coefficient (Bc);



hierarchical networks (Ca) show a power behaviour for both degree distribution (Cb) and cluster coefficient (Cc).

Let us now see the behaviour of the yeast network version 00. Table I reports the simplest parameters of the network. Reciprocally regulated couples are simply couples of nodes a,b, with a regulating b and vice versa4. Notice the difference in the range of the in degree

0⋯32 and of the out degree

0⋯789 . A biological explanation of this observation is that the input

controls cannot increase too much or the target gene could have trouble to deal with the amount of information, while for the number of outputs (information transfer and controls) there is no such limit. The number of regulators (1465) is larger than the number of regulated genes (4109); the 4 They can be found by inspection of the adjacency matrix as the nodes for which there is a local symmetry:

Aab =Aba

.

number of regulators not regulated (138) is quite small while the number of genes with no target is very high (2782). While the in degree range is fully covered, the out degree range has 626 values that are not represented. On average the in and out degree are equal to 7.22. The sparseness of the adjacency matrix can be measured as: 2

number of edges / [number of nodes −number of nodes ]=0.00157 Table I: Basic yeast network characteristics Characteristic

Value

Number of edges

33127

Number of nodes

4589

max number of node with outgoing edges

4589

max number of node with incoming edges

4586

max number of in-edges

32

max number of out-edges

789

mean number of in-edges

7.219

mean number of out-edges

7.219

number of regulators

1465

number of targets

4109

number of disconnected nodes

342

number of connected nodes

4247

number of regulators not regulated

138

number of nodes with no target

2782

number of nodes both regulator and target

1327

Number of reciprocally regulated couples

100

As shown in Figure 3, the out degree distribution follows a good scale-free behaviour,

out 

Pj ~j

−

, with power index

=1.3571 . The tail oscillations are

simply due to less nodes available with higher out degree.

Figure 3: Out degree distribution in log-log scale.

Figure 4: In degree distribution in log-lin scale.

The in degree distribution, as can be seen in Figure 4, shows a less clear behaviour. If we apply a LSQ linear regression up to i=20, we can identify an exponential behaviour

in

i

P i ~ 0

with base

 0=0.9098 . The reason for this

behaviour is that the the number of input a gene is capable to manage have a physical threshold and thus tends rapidly to decrease, exponentially and even faster. The joint distribution cannot be expressed as the product of the two in and out ones, and the correlation between in and out degrees is

 〈ij 〉−〈 i〉 〈 j〉  /  〈i 〉〈 j〉 =0.125 . In Figure 5, we plot the downstream cluster coefficient distribution as a function of the outgoing degree. The result is almost flat with large oscillations. In Figure 6 we show its upstream counterpart: it tends to increase slightly, an opposite behaviour with respect to hierarchical network. We can then evaluate5 the number of triangular cycle is 296, and the number of oriented transitivity triangles6, 42449. Finally, the total average downstream

Figure 6: Upstream cluster coefficient Figure 5: Downstream cluster coefficient distribution in log-log plot. distribution in log-log plot. clustering coefficient is 0.00208, while the upstream counterpart is 0.07. The latter in particular is definitely higher than for Poisson network, while we have to proceed to a stricter comparison with specific randomized or Poisson networks to fully evaluate the meaning of the downstream cluster coefficient. 5

By looking the nodes of the adjacency matrix for which hold

6

By looking nodes for which

directions.

Aab =1 ∧ Abc =1 ∧ Aca=1

Aab =1 ∧ Aac =1 ∧  Abc =1 ∨ Acb=1

.

and the same changing all arrow

Another interesting issue is the distribution of the shortest path length (s.p.l.). The s.p.l.

l ts between two nodes

t , s  is defined (1-3) as the minimum of all

the possible path lengths connecting node s to node t. Spanning over all nodes, we can build the shortest path matrix, that is again a sparse matrix where the 85.2% of the elements are infinite (no connection between nodes). As shown in Figure 7, the average path length is 4.75, and the network diameter (the maximum s.p.l.) is 16. The distribution is strongly peaked around its medium value and, of course, asymmetric. The way in which the average path length grows with the number of nodes, usually gives an indication of the network structure (small, ultra-small world). In particular the fact that the upstream cluster coefficient is higher than in random network and the small average path length strongly suggests that we are in the presence of a small world, but further analysis must be done in this direction. As shown in Table II, by counting the number of elements of the path matrix equal to 1,2,3, ..., we can estimate the effective number of mean nearest neighbours, nearest nearest neighbours, and so on ... Table II: i-th average nearest neighbours numbers. Path Matrix Value

average # elements

1

7.218784

2

38.23491

3

109.5217

Given that the network is equally weighted,

the

shortest

path

algorithm we used is quite simple

Figure 7: Shortest path length distribution. Log-lin scale. ●

(essentially Dijkstra's algorithm7): ●

loop over all the nodes;

for a specific node, find out all the paths departing from it: expand it and all its nearest neighbours and all nearest nearest neighbours up to the point that there are no any more neighbours to reach;



assign a path length 1 to all first nearest neighbours, then 2 to second nearest neighbours, and so on ...

7

http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm

We also evaluated by recursive depth expansion8, without taking into account already visited nodes, the average number of second and third neighbours: z 2=58.62 ∧ z 3=509.67 . Taking into account already visited nodes, they

reduce to

z 2=42 ∧

z 3=146.7 . Removing over-counting due to triangles we

Figure 8: Network components. recover exactly the values of Table II. As represented in Figure 8, a simple but powerful representation of the network is the so called bow-tie (4), allowing to categorise every node in one of the following six mutually exclusive components9: ●

SCC (690): every node can reach and can be reached by every other node in this subset. The SCC component correspond to what other authors (4,5) defined as the giant cluster or component;



IN (92): every node can reach the SCC subset;



OUT (3293): every node is reached by a node of the SCC subset;



TUBES (17): nodes through which nodes in the IN subset can reach the OUT subset without passing through the SCC;



TENDRILS (67+63): nodes reached by a node in the IN subset or node reaching the OUT subset;



DISC (342+11): disconnected components.

In Figure 9, we show a picture of the whole network bow tie. The SCC is a strongly cyclic

connected cluster mediating communication between IN and

OUT component and is omitted in figure 9 to emphasise the presence of many 8

Take a node and expand all its nearest neighbours, then expand each neighbours ... Doing this you can check or not if a nearest neighbours that you find was already found at a previous expansion level. 9 Between parenthesis are reported the number of nodes in each subset for our yeast network version 00.

Figure 9: Bow tie cathegorisation of yeast network. GREEN: IN component, RED: OUT, BLUE: TUBES, PINK: TENDRILS. ‘short-cut edges’ between IN and OUT. The structure between IN and OUT could thus be though of as a big ‘feed forward motif’ (see below) in which there are paths directly from IN to OUT and a path through the complex SSC. The biological sense is that we have a small number of peer super-controllers (IN: 92), a big number of “actuators” (OUT: 3293), a highly connected core to process and distribute information (SCC: 690) and another way to escape the central processing and reach quicker the actuators (TUBES: 17). The OUT component itself have a quite interesting structure, that unrolling 2node loops, can be hierarchically clustered (Figure 10), indicating a modular structure of this component. Generating functions The generating functions formalism (4) is a quite powerful tool to analyse both real and growing networks. Given the joint degree distribution generating function is defined as further elucidation about it.

G x , y=∑ij Pij x y i

j

P ij , the

. See Appendix A for

Figure 10: OUT component hierarchically organised. In our yeast network, we evaluated numerically the generating functions, and then obtain:

∑ij  2 j k − j−k  P ij=102.808

showing we are in the the giant

component phase, with an average path length of

l=4.08 , close to our

finding in the yeast gene network (4.75). The relative size of the SCC plus IN component is

0.1783 to be compared with the exact relative size of 17.04%

found in the gene network, and the size of the SCC plus the OUT is

0.8849 to

be compared with 86.79%. Moreover, we recover the 1st , 2nd, 3rd nearest neighbours average number10: z =7.2190 7.2188 ∧

z 2=58.8029 58.62 ∧

z 3=478.2623 509.67

Motifs: local network properties Motifs

are

network

interconnection

patterns

occurring

at

a

frequency

significantly higher than in randomized networks (6). Different kinds of networks (i.e. social, technological, ecological, transcriptional) show different motifs (6). The underlying idea is that motifs (at least partly) determine function and therefore occur frequently in certain kinds of networks. We here investigate the 2-node feedback loop motif and all 3-node motifs. As shown in 10 Between parenthesis the values obtained with recursive expansion without checking already visited nodes.

Table III, some motifs are over-represented, particularly the feed-back id1, the feed-forward loop id38 and feedback loops id102, id110. Biological webs always report this fact for the feed-forward loop (6), likely activating output only if the input signal is persistent and allow a rapid deactivation when the signal goes off (6) or increase flexibility and modulation by allowing two paths for the same output. Feedback loops (both 2 and 3 nodes) can be crucial to system dynamics, likely contributing to multi-stationarity or homeostasis. ID 1

NREAL

NRANDOM

100

SD

34.4

ZSCORE

4.3

15.26

6 3271391 3291918.6 462.5

-44.39

12

200121

220637.3 468.4

14

24080

36

150987

38

37055

46

2386

74

1363

1726.2

16.6

-21.89

78

67

102.8

3.5

-10.35

98

130

113.9

16.1

1

102

125

51.7

8.1

9.1

108

187

54.1

5.9

22.61

110

33

8.9

2.8

8.71

238

4

0.1

0.3

12.33

27592.6

-43.8

203

-17.3

173089.3 430.8

-51.3

16660.3

461

44.24

678.4 102.2

16.71

Table III: Frequencies and significance of sub-graphs. NREAL is the number of motifs we found in our network, NRANDOM the number in randomised network and SD the standard deviation.

Figure 11: 2- and 3-node subgraphs.

Growing networks Defining a set of probabilistic rules to create new nodes and edges, it is possible to grow a set of networks with common statistical properties. These shared features can be extrapolated by analysing a big number of networks grown with the same rules, or by using the rate equation or master equation11 approach (7-14). The former is a simple tool of statistical physics allowing to describe the system evolution by counting the creation and deletion of system states. We will see in our concrete experiment how it works. We started exploring rules to mimic the properties of our target yeast network. We know that a scale free behaviour is both a result of preferential attachment 11 The master equation (8) is another widely used approach to search structural properties of growing networks, that is perfectly equivalent to the rate equation.

(of the kind “the rich get richer”) and a growing mechanism (15), while there's less evidence on how to jointly create a distribution close to what we observed for the outgoing degree. Our first idea was to try the following rules (that from now on will be referred as FPM rules12): 1) add a new node with probability p •

the new node is a source with probability pS, and in this case the probability to attach to a node with in-out degree (i,j) is constant (as in random Erdös-Renyi network (15))



or a target with probability

qS = 1- pS, and in this case the

attachment kernel depends only on the source node outgoing degree (as for preferential attachment (15)): Aij = A j= j 2) add an edge with probability q=1-p, and the probability to link (ij) with (i'j') depends only on the source outgoing degree C i j ,i ' j ' =C j= j The linear attachment kernel in rule 1) have been chosen because it is the only one allowing to avoid generic behaviour (like freezing or stretched exponential distribution) and showing a power behaviour (9,12). The shift have been introduced to modulate the scale free power index. We will first analyse the process as a whole: if (N(t),I(t),J(t)) are the total number of nodes, incoming and outgoing edges at time t, then at time t+1, with probability p one may have a new node (and then an associated edge) or with probability 1-p a new edge alone:  N , I , J 



{

 N 1, I 1, J 1 w/ prob. p  N , I 1, J 1 w / prob. 1− p

Then, in any case the number of in-out degrees increase by 1 at every time step, while the number of nodes increases by 1 with probability p: N t = pt



I t =J t=t

Then the average in-out degrees per node are simply: 〈 i〉=

I t  1 = N t p

∧ 〈 j〉=

J t 1 = =〈i〉 N t  p

The growing process of the network can be statistically defined by a rate equation describing all the possible changes happening when adding a node or 12 Where F stands for de la Fuente, P for Pieroni and M for Mancosu.

an edge (as shown pictorially in Figure 12). Defining

N i , j t as the number of

nodes having i incoming degree and j outgoing degree (similar to the joint idegree distribution), we can write: dN i , j pp S N i , j ppS N i−1, j = si , j −  dt A A p 1− pS  N i , j A j p1− p S  N i , j −1 A j−1  B B 1− p N i , j C j 1− p N i , j −1 C j −1  C C 1− p N i , j 1− p N i −1, j −  A A s i , j= p [ p S  i , 0  j ,11− p S  i , 1  j ,0 ]

For instance, the first line in the above equation represents the source term ((a) and (b) processes of Figure 12) and the attachment of the new node (c) to a node with in degree i, thus decreasing the number of nodes with in degree i (- sign), or to a node with in degree i-1, thus increasing the number of nodes with in degree i (+ sign). The constants ppS in the nominator are the relative frequencies of these events, while the constant A in the denominator is simply a term ensuring each single term normalise correctly: A=∑ ij N i , j= N =tp

B=∑ ij N i , j A j =J  N =t 1 p

C=∑ij N i , j C j= J  N =t 1 p 

Being these three factors proportional to the time, we can make the ansatz N i , j t=tn i , j , thus transforming the master differential equation into a simpler

finite difference one: 1L 2 j  ni , j= R1 ni−1, j  R3 R4 j ni , j −1 s i , j

The coefficients Li and Ri are complicated but algebraic functions of the four parameters used in the FPM rules ( p , p S ,  ,  ), that can be found in Appendix B. To be more concrete, with our network version 00 we can estimate the probability parameters as follows: number number number pS= number p=

of of of of

nodes 4589 = =0.1385 edges 33127 source nodes 1465 = =0.3192 nodes 4589

With

these

parameters,

choosing ==0 ,

that

allows

an

easier

implementation of the Montecarlo simulation, we have: L 2=0.1268 ∧ R1=0.8673 ∧

R3=−0.268

Add a node i=0 j=1

(a)

i=1 j=0

(b)

Attach it to the existing network

Aj

j --> j+1

i --> i+1

(c)

(d)

Add an edge

Cjj (e)

i --> i+1 j --> j+1

Figure 12: State change in the network evolution due to the FPM growing rules. (a) Source term adding a node as a source (i=0,j=1). (b) Source term adding a node as a target (i=1,j=0). (c) The new source node attach to an existing node with in degree i, increasing its in degree by 1. (d)The new target node attach to a node with out degree j, increasing its out degree by 1. (e) A new edge is introduced increasing by 1 both the out degree of the old source node and the in degree of the out target node. The difference equation can be quickly solved numerically, but we will first try to infer analytical properties of the solution. By solving the difference equation for j=0 we get immediately: n i ,0=R 1 n i−1,0



i

ni ,0 ∝R1

then for j=1: n i ,1=

So, if

R1 R 3  L2 ni−1,1 n 1L 2 1 L2 i ,0 R3 L2=0

(that is our first experiment), we then obtain a similar result:

 

R1 n i ,1 ∝ 1L 2

i

while, in general (that is if R3 L2≠0 ), we have13:



n i ,1=linear combination of

R

i 1



  R1 1L 2

i

i ∞ ,

In any case, since we are interested in the asymptotic behaviour for these results greatly simplify to: i

n i ,1~ with

{

R1 = R1 1 L2



if R3L 2≠0 if R3L 2=0

Continuing to solve the difference equation for j=2 we can easily confirm this general trend: ni ,2 = R3L 2≠0

if

then

otherwise

R1 R12 L 2 ni −1,2  n ⇒ 12 L 2 12 L 2 i ,1

          

ni ,2 =linear combination of

n i ,2=linear combination of



R

i 1

R1 1L2



i



R1 1L 2

i

R1 12 L 2

∧ i

R1 12 L2

R1 ~ 1 L2

i

i

~R1

i

We can then conclude that the asymptotic behaviour of the joint in-out degree distribution for large in degree is given by an exponential law14: n i , j~ i with

i ∞

for

{

R1 = R1 1L2



if R3 L2≠0 if R3 L2=0

In the same way, we can extrapolate the behaviour for

j  ∞ , observing first

the difference equation for i=0 (see Appendix C): 1L 2 j n0, j = R 3L2 j n 0, j−1 ⇒

−

n0, j ~ j with =1−R3 / L2

Then, being interested in the asymptotic behaviour, we can build for i=1 the associated differential equation: 13

ni0 is known and act as a source term, the solution is then the linear combination of the homogeneous eq. solution

(source=0) and a particular solution of the non homogeneous eq. (obtainable e.g. by trying power solution). 14 Writing

−

=e

then we have

n i , j~e

−i

that is an exponential decay.

−

1L 2 j n1, j= R3 L2 j n0, j−1C j ⇒1 L2 j

dn 1, j − 1−R3  n1, j=C j dj

and find its solution n 1, j~ j − log  j − i We can then make the ansatz that n i , j~ j log  j and it is quite easy to

prove it by induction: −

i−1

n i−1, j~ j log j

−

i −1

⇒ 1L 2 j ni , j= R3 L2 j ni , j −1 C j  log j

−

The joint distribution then shows an asymptotic (large out degree) power behaviour corrected by a log term, this result is new and need to be further analysed to extrapolate interesting biological meaning, if any. With

the

values

==0 ∧ p=0.1385 ∧ pS =0.3192 we

can

evaluate

=0.7697 ∧ =8.644

We now proceed to extrapolate the asymptotic behaviour of the incoming and outgoing degree distribution, obtained as the sum of the joint distribution with respect to one of the two indexes: I i = ∑ j ni , j



J j =∑ i n i , j

Summing the finite difference equation over the index i or j, we easily get two simpler finite difference equations for these two distributions: 1−R3− L2  I i= R1 I i−1S i

where

S i =∑ j s i , j

Then i

I i~ 0

with

 0=

R1 1− R3−L 2

and 1−R1 L2 j  J j= R3L 2 j J j −1 S j

where

S j =∑i s i , j

then (see Appendix C) J i~ j

With

−

with

the

=

1−R 1−R3 L2

values

==0 ∧ p=0.1385 ∧ pS =0.3192 we

 0=0.8673 ∧ =2.0463 .

i

⇒ ni , j~ j  log j

can

evaluate

We can then conclude that the outgoing degree distribution shows a scale free behaviour and the in degree distribution has a net exponential behaviour, with a different base than the joint distribution. Both in and out distributions resemble the ones observed in the yeast gene network. The joint distribution

Figure 14: Joint degree distribution for Figure 13: Joint degree distribution for j=∞. j=2. is not merely a product of the in and out distribution, showing that exists a correlation between in and out degrees, as we found in the gene network. We can now see the results of the numerical solution of the difference equation for the joint and for the in and out distribution (we used

i max = j max =1000 ), and

compare them to these analytical results. Figure 15 represents the out degree distribution in log log scale, where it is evident the scale free behaviour. The power extrapolated from numerical solution is

=2.0452 in perfect agreement with the theoretical value.

Figure 16 represents the in degree distribution in log-lin scale, showing a beautiful exponential behaviour. The base value extrapolated from numerical solution is

 0=0.8673 , again in perfect agreement with the theoretical value.

Figure 15: Out degree distribution.

Figure 16: In degree distribution.

Figures 13 and 14 represent the joint degree distribution in log-lin scale as a function of the in degree alone, respectively for fixed values j=2 and j=∞. It is evident an asymptotic power trend, with a base extrapolated from data equal to

=0.7697 in perfect agreement with analytical prevision.

Finally, Figures 15-15 show the plot of

i

n ij /log  j in log log scale for small

fixed in degree values. It is evident from the picture an asymptotic scale-free behaviour, with power in the range

=8.8519⋯8.6151 slightly depending on

the value of i. This result can then be considered valid for

i≪ j  ∞ , for large

i the exponent is not unique, nor in any case so close to the expected value, as it was for the previous results, showing we are missing some light mechanism, and there should be a tiny modification to the analytical solution we found.

Figure 19: Joint degree distribution over the log term for fixed i=4, 5.

Figure 17: Joint degree distribution over the log term for fixed i=2, 3.

Figure 18: Joint degree distribution over the log term for fixed i=0, 1.

Discussion Some discussion and conclusions were already spread in the previous section, and I will give here a brief summary: the yeast network ... ●

have an in degree distribution showing an exponential behaviour with base 0.9098 up to in degree 20;



have an out degree showing a good scale-free behaviour with exponent 1.3571;



have a downstream clustering coefficient of 0.00208 and an upstream clustering coefficient of 0.07;



it is quite a small world, having an average path length of 4.75 and a diameter of 16;



can be dissected in a bow tie structure, with a strongly connected component representing about 16% of the whole nodes;



have a predicted (by generating function) giant component, with relative sizes SCC+IN of about 18% and SCC+OUT of about 88%;

shows some motifs that are over-represented, like the feed-back and the feed forward



loop; behaviour can be reasonably mimicked by FPM rules to build a growing network.



More in general, we can say that ... generating function and direct network inspection agreed about the bow tie structure



and average path length; the shortest path analysis, recursive depth expansion and generating function agreed



about the structure of 1st, 2nd, 3rd, ... nearest neighbours distribution; the joint degree distribution for the FPM growing network have an unusual asymptotic



behaviour for

j  ∞ , a scale-free behaviour corrected by a log-term. This fact is

new and although the comparison with numerical solution shows we are missing some lighter correction, it seems to be a promising field for further digging.

Finally, the different result we found between the LSQ interpolation of the real network,

 0=0.9098∧=1.3571 , and the FPM growing network,

 0=0.8673∧=2.0452 , can easily be reduced by choosing appropriate values

(not zero) for the two parameters

∧ .

Materials and Methods The yeast directed network version 00 was inferred from publicly available yeast expression and genotype data (17), by Alberto de la Fuente and colleagues (16), their work will be soon published and so their data. All

the

yeast

network

analysis

(degree

distribution,

cluster

coefficient

distribution, shortest path length, nearest neighbours) and the growing network experiment have been done by me coding in Matlab. Modules are available with no guarantee upon request ([email protected]) exclusively for non commercial use. The motifs have been analysed by me using mfinder, developed at Uri Alom Lab (http://www.weizmann.ac.il/mcb/UriAlon/groupNetworkMotifSW.html). Gianmaria Mancosu developed C++ codes

for shortest path, nearest

neighbours finding and bow-tie partitioning. The network visualisation have been done by Gianmaria Mancosu using Pajek, developed by Vlado (Vladimir Batagelj). Pajek also allowed to check again bowtie and shortest paths (http://vlado.fmf.uni-lj.si/pub/networks/pajek/) ... and much more.

Ringraziamenti Vorrei ringraziare anzitutto Alberto de la Fuente e Gianmaria Mancosu, con i quali ho condiviso il lavoro di questi tre mesi. Alberto ha fornito la base per ogni serio lavoro scientifico, ossia dei dati nuovi ed originali, è stato la guida alla vasta letteratura esistente sulle reti e soprattutto ha saputo guidarci attraverso le sottili implicazioni biologiche del formalismo matematico. Un ringraziamento speciale inoltre per la sua attenta e multipla lettura di questo piccolo lavoro. Gianmaria, oramai secondo solo a Vlado nell'uso di Pajek, ha sezionato la rete in ogni modo possibile ed ha generato quelle belle figure che riporto e che, oltre al profondo contenuto scientifico, sono altresì necessarie per pubblicare su Nature. Gianmaria ha fornito la base di confronto per alcuni risultati e di appoggio per la sua conoscenza del mondo degli algoritmi. Ringrazio inoltre tutti i colleghi e le colleghe del gruppo di Bioinformatica, che durante lo “stage” hanno saputo creare una atmosfera di lavoro piacevole e stimolante, dimostrandosi indulgenti nei confronti di questo vecchio studente: Betta, Patricia, Sergio, Giuliana, Joel, Frederick, Lisa, Matteo, Massimiliano, Mimmo, Alejandro, Anna, Michele, Alphonse, Enrico, Simone, Riccardo, Maria, Fabio e Giorgio. Nessuno se ne voglia per l'ordine di comparsa, non è casuale, ma

semplicemente

dall'ingresso.

quello

con

cui

sono

disposte

le

stanze

a

partire

Bibliography 1) A.L. Barabasi, and Z. Oltvai Network Biology: Understanding the cell's functional organisation, Nature Rev. Gen., 5:101-114 (Feb. 2004). 2) R. Albert Scale-free networks in cell biology, J. Cell. Sci., 118 21:4947-4957 (2005). 3) S.H. Strogatz Exploring complex networks, Nature, 410:268-276 (Mar. 2001). 4) M.E.J. Newman, S.H. Strogatz, and D.J. Watts Random graphs with arbitrary degree distributions and their applications, arXiv, cond-mat/0007235 v2 (7 May 2001). 5) M.E.J. Newmann The structure and function of complex networks, SIAM Review, 45, 2:167-256 (2003). 6) R. Milo et al, Network motifs: simple building blocks of complex networks, Science, 298:824-827 (25 Oct. 2002). 7) P. L. Krapivsky, G. J. Rodgers, and S. Redner Degree distribution of Growing Networks, arXiv, cond/mat/0012181 v2 (4 Jun. 2001). 8) P. L. Krapivsky, and S. Redner Network Growth by Copying, Phys. Rev. E, 71, 036118 (2005). 9) P. L. Krapivsky, and S. Redner Rate Equation Approach for Growing Networks, in Statistical Mechanics of Complex Networks, Ed. R. Pastor-Satorras, M. Rubi and A. DiazGuilera, Springer (2003). Also Lect. Not. Phys., 625:3-22 (2003). 10) P. L. Krapivsky, S. Redner, and F. Leyvraz Connectivity of Growing Random Networks, Phys. Rev. Lett., 85, 21:4629-4632 (2000). 11) P. L. Krapivsky, and S. Redner Organisation of Growing Random Networks, Phys. Rev. E, 63, 066123:1-14 (2001). 12) C. Moore, G. Ghoshal, and M. E. J. Newmann, Exact solutions for models of evolving networks with addition and deletion of nodes, arXiv, cond/mat/0604069 v1 (4 Apr 2006). 13) A. Barrat and R. Pastor-Satorras Rate equation approach for correlations in growing networks, arXiv, cond/mat/0410646 v1 (25 Oct 2004). 14) S.N. Dorogovtsev and J.F.F. Mendes, Evolution of Networks, arXiv, cond/mat/0106144 v2 (7 Sept 2001). 15) R. Albert and A.L. Barabasi Statistical mechanics of complex networks, Rev. Mod. Phys., 74, 1:47-97 (2002). 16) Bing Liu, Alberto de la Fuente and Ina Hoeschele From genetics to gene networks: Evaluating approaches for integrative analysis of genetic marker and gene expression data for the purpose of gene network inference, BMC Genomics, (2007) Under review. 17) R.B. Brem, and L. Kruglyak The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci USA, 102(5): 1572-1577 (2005)

Appendix A – Generating Function Formalism From the generating function G it is immediate to derive the in/out mean ∂G ∂G = degree, z =〈i 〉=〈 j 〉= . ∂ x x= y=1 ∂ y x= y=1 Defining the generating functions for the number of out edges leaving a

[ ]

[ ]

randomly chosen vertex, and the number leaving the vertex reached by following

a

randomly

chosen

edge,

G0  y =G 1, y  ∧ G 1  y=

[

1 ∂G z ∂x

]

x=1

respectively , we can also evaluate the average number of first and second neigbors

[ ]

reachable

from

a

randomly

chosen

vertex:

2

z 2=

∂ G 2 ∂x

=G 0 ' 1G 1 ' 1 . x= y=1

G m  y=G m−1  G1  y 

Extending to the generating function of m-th neighbours: we

can

evaluate

the

average

number

of

m-th

nearest

neighbours:

z m=z m−1 G1 ' 1 . Assuming every vertex is reachable from a randomly chosen

starting vertex (that is typically the case when a giant component appear), it is l=1

possible to approximate the average path length as:

log  N / z  . log z 2 /z 

Moreover, it is possible to write down the mean component size, the phase transition (to giant component) condition and the size of the giant component (4). The giant component exist if

∑ij  2 j k − j−k  P ij0

, and its size plus the

S SCC  IN =1−G 0 u , where u is a real

size of the IN component is given by solution of u=G1  u . Defining the counterpart of the randomly selected vertex,

G0

and

G1

for the links arriving at a

F 0  x =G x , 1 ∧ F 1  x=

[

1 ∂G z ∂y

]

y=1

, we can derive

in an analogous manner, the size of the SCC plus the OUT component: S SCC OUT =1−F 0 v  where v is solution of

v =F 1  v .

Appendix B – Mapping GN parameters to FDE Here we will provide the mapping between the FPM growing network parameters

 ,  , p , p S  and the coefficients

 L1−2 , R1−4 

of the finite

difference equation for the joint degree distribution: 1L 1 i L2 j ni , j = R1R2 i ni−1, j  R 3R4 j ni , j −1si , j

L1 =0 ∧ L2 =g 1 / g 0 R1=g 2 / g 0 ∧ R2= L1 s i , j=s i , j / g 0



R 3=g 3 / g 0



R4 =L2

where g 0 =1 f 0 f 1 f 2 g 1= f 1 f 2 g 2=−1 f 1−1 f 2 g 4=g 1

with f 0= p S −11/ p f 1= p1− p S /1 p f 2=1− p/1 p

Notice that if the shift of the two attachment kernels is zero, R3 L2=0

==0 , then

Appendix C – A simple finite difference equation Here we would like to evaluate the asymptotic behaviour of the solution of the following difference equation: ai x i = bi x i−1 x 0 =c

By using the following representation15: x=

  x1 x

we can rewrite the equation as:  ai1  bi1   ai1   ai xi = x i −1 ⇒ x i= xi −1  ai   bi   bi1   bi

Now, by simple variable change, y i=

  ai1 xi ∧   bi1

particularly

y0 =

 a1 c  b1

it is immediate to get the new difference equation and the analytical solution: y i= y i −1 ⇒ y i= y 0 ⇒ x i=

 bi1  bi1  a1 y0 = c  ai1  ai1  b1

The same result can be recovered by simple iterated substitutions in the original equation. Then we could recover the asymptotic behaviour from this result, but it is easier to proceed by manipulating directly the equation. Let us try a power solution and see what happen for large i:

 



x i ∝i







⇒ aii = bii−1



ai 1 −1/ i  − /i = 1− ~e  =e bi i

The left hand side of the last result can be again approximated as a/ i

ai 1a /i e a−b / i = ~ =e bi 1b /i e b/ i

Then it immediate to conclude that the asymptotic behaviour exists and is given by a power law with exponent

15 The Gamma function is defined as

  z = z−1 !

=b−a

.

if x is integer, and in general as



  z =∫0 e u −u

z−1

du