NETWORK PROPAGATION MODELS FOR GENE ... - IEEE Xplore

NETWORK PROPAGATION MODELS FOR GENE SELECTION Wei Zhang∗1 , Baryun Hwang∗1 , Baolin Wu2 and Rui Kuang1 1

Department of Computer Science and Engineering, University of Minnesota Twin Cities Division of Biostatistics, School of Public Health, University of Minnesota Twin Cities

2

ABSTRACT In this paper, we explore several network propagation methods for gene selection from microarray gene expression datasets. The network propagation methods capture gene co-expression and differential expression with unified machine learning frameworks. Large scale experiments on five breast cancer datasets validated that the network propagation methods are capable of selecting genes that are more biologically interpretable and more consistent across multiple datasets, compared with the existing approaches.

section 2.2, we describe another network propagation model on Patient-Gene Expression Bipartite Graph. 2.1. Learning on Gene Correlation Graph Ϭ͘ϳϵ

&ĂůƐĞƉŽƐŝƚŝǀĞ

&ĂůƐĞŶĞŐĂƚŝǀĞ

Ϭ͘ϵϭ

Ϭ͘ϭϴ

Ϭ͘ϳϯ

Ϭ͘ϭϱ Ϭ͘ϴϯ

Ϭ͘ϲϳ

Ϭ͘ϭϯ

Ϭ͘ϭϭ

Ϭ͘Ϭϰ

Ϭ͘Ϯϳ

Ϭ͘Ϭϳ

Index Terms— Biomarkers, Gene Expression, Network Propagation, Breast Cancer Metastasis

(A) Label initialization Ϭ͘ϲϳ Ϭ͘ϳϴ

Ϭ͘ϰϳ

High-throughput array technologies generate genome-scale experimental data such as gene expressions, copy number variations and single nucleotide polymorphisms for cancer studies. The high-throughput datasets typically report quantitative measures of tens of thousands of genomic features for a cohort of hundreds of patients. Identifying disease signatures called biomarkers from the tremendous number of genomic features becomes a central problem in utilizing the data for clinical decision-making. However, one challenging and unsolved problem in biomarker discovery is that highly descriptive/discriminative genomic features between patient groups are not consistent within sub-group of patients in the same dataset or across multiple datasets generated for the same research purpose, because of the unavoidable patient heterogeneity, and statistical randomness or experimental noise in the data. One well-known example is the discrepancy between the first FDA approved breast cancer diagnostic 70-gene biomarkers, MammaPrint, reported by a lab in Netherlands [1] and a 76-gene biomarkers reported in a very similar study by another lab in San Diego [2]: only 3 genes are identical in the 70 genes and the 76 genes. In this study, we propose several network propagation methods that fully explore modular co-expression structures along with gene differential expressions to provide more reliable biomarker discovery from highdimensional gene expression datasets. 2. METHODS Three general learning models for network propagation are introduced for gene selection. In section 2.1, we describe the network propagation model on Gene Correlation Graph and its extension combining network propagation with hierarchical clustering. In *The first two authors equally contributed. Thanks to NIH for partial funding. Corresponding: [email protected].

&ĂůƐĞƉŽƐŝƚŝǀĞ

&ĂůƐĞŶĞŐĂƚŝǀĞ

1. INTRODUCTION

Ϭ͘ϭϵ

Ϭ͘ϭϴ

Ϭ͘ϯϴ

Ϭ͘ϭϳ Ϭ͘ϲϭ

Ϭ͘ϱϴ Ϭ͘Ϭϯ

Ϭ͘ϭϳ

Ϭ͘ϯϬ

Ϭ͘Ϯϳ Ϭ͘ϭϬ

Ϭ͘ϯϯ

Ϭ͘Ϭϱ

(B) After propagation

Fig. 1. Network propagation on Gene Correlation Graph. (A) Each vertex is initialized with the Pearson’s correlation coefficient between the corresponding gene expressions and the casecontrol labeling. (B) After the initial scores are propagated in the graph, a new score is assigned to each gene vertex. The genes are re-ranked by their final scores. A gene correlation graph is defined by G = (V, E, W ), where V = {v1 , v2 , ...vn } is the vertex set, each of which represents a gene, and E ∈ V × V is the set of weighted edges. In the weighted adjacency matrix W , each Wij is the reliability score [3] base on the absolute value of the Pearson’s correlation coefficients between genes vi and vj . The reliability score between 1 gene vi and gene vj is calculated by RSij = ROij ×RO , where ji ROij is gene vi ’s rank among all the genes with respect to the correlation with gene vj and ROji is gene vj ’s rank with respect to the correlation with gene vi . Each vertex is initialized by a vector y, which is the Pearson’s correlation coefficients between gene expressions and the case/control labeling. The initial label vector y provides the prior-knowledge of differential expression of each individual gene in the case/control study. Base on the regularization framework proposed by [4], we can learn a label assignment function f : V → R to assign a score indicating association with the case-control study to each vertex. In this regularization framework, the cost function is defined as follows, fi fj 2 Ω(f ) = Wij ( √ − ) + f − y2 , (1) D D ii jj (i,j)∈E

978-1-61284-792-4/10/$26.00 ©2010 IEEE

where D is a diagonal matrix with Dii = j Wij and ≥ 0 is a parameter to weight the two terms in the cost function. The first term in Eqn.(1) is the smoothness constraint, enforcing that a good f function should assign similar importance scores to genes that are highly correlated (strongly connected vertices). The second term is the fitting constraint, enforcing that the new score should also be consistent with the initial label (differential expression). In this framework, we can combine both the discriminative power of each individual gene and the correlation among the genes to identify gene markers. The optimization framework can be solved with an iterative label propagation algorithm, f t = (1 − α)y + αSf t−1 ,

(2)

1

1

where S = D− 2 ∗ W ∗ D− 2 , t denotes the propagation step and α = 1/(1 + ). This algorithm simply propagates labels among the neighbors in the graph. The algorithm will converge to the closed-form solution of minimizing Eqn.(1), which is f ∗ = (1 − α)(I − αS)−1 ∗ y.

(3)

After convergence, f ∗ gives a new ranking of the genes. The highly ranked genes are considered as candidate marker genes. Figure 1 shows how label propagation can capture the hidden clusters to recover false negatives and eliminate false positives. In the example, we assume two hidden clusters in the network, one of which contains a false negative and another contains a false positive. After running label propagation, final scores are assigned by balancing their coherence and discrimination so that genes in the same cluster are assigned similar scores. One drawback of the model introduced above is that the pairwise correlation between gene expression features is calculated with the feature values across all the patient samples. Since some co-expression patterns might only exist in a sub-group of patients due to the patient heterogeneity, this kind of bi-cluster structures might be lost in the graph representation. In order to address the problem, we propose an extension of the model by introducing an additional step of unsupervised hierarchical clustering of the samples by their correlation in gene expressions. After the clustering step, a gene-correlation graph is built for each cluster at all levels of the hierarchical tree, and network propagation is performed on each of the graph sequentially. The closed-form solution of this extended model is f∗ = (

C

(1 − α)(I − αSc )−1 ) ∗ y,

(4)

c=1

where C is the total number of clusters and Sc is the normalized adjacency matrix of the graph built for cluster c. Note that, since network propagation is applied sequentially on each graph, the closed-form solution smoothes the initialization scores y with the graph structures one at a time. This is equivalent to running network propagation on each gene correlation graph based on the output of the previous run. Shuffling of the order of the graphs will not change the result. 2.2. Learning on Patient-Gene Expression Bipartite Graph We can also formulate a semi-supervised learning problem and apply a bipartite graph-based learning algorithm for gene selection [5]. In order to distinguish gene vertices and sample vertices, we use functional notation in this subsection. Gene expression data is modeled by a bipartite graph G = (V, U, E, W ), where V

^ĂŵƉůĞƐ

нϭ нϭ Ͳϭ Ͳϭ

'ĞŶĞƐ

'ĞŶĞƐ

^ĂŵƉůĞƐ

Ϭ Ϭ Ϭ Ϭ Ϭ

нϬ͘ϭϰ

нϬ͘ϱϲ

нϬ͘ϯϭ Ϭ͘ϬϬ

нϬ͘ϱϵ

ͲϬ͘ϱϲ ͲϬ͘ϲϬ

ͲϬ͘ϭϳ ͲϬ͘Ϯϴ

(B) After propagation

(A) Label initialization

Fig. 2. Network propagation on a patient-gene expression bipartite graph. (A) Each sample vertex is assigned the initial label of case (+1) or control (-1). All gene vertices are initialized to 0. (B) After the label information is propagated through the bipartite graph, genes are re-ranked by their final scores. represents the set of sample vertices, U represents the set of gene vertices, and E ∈ V × U represents the set of weighted edges. Each edge (v, u) ∈ E connects sample vertex v ∈ V and gene vertex u ∈ U with weight w(v, u) in the W , where w(v, u) is the expression level of gene v in sample u. The sample vertices in V are labeled with +1/-1 (case/control) and the gene vertices are initialized with zeros. The initialization function y for sample vertex and gene vertex are denoted by y(v) and y(u). In this context, the cost function over G = (V, U, E, W ) is defined as Ω(f )

=

f (v) f (u) 2 w(v, u)( √ − ) D Duu vv (v,u)∈E +f (v) − y(v)2 + f (u) − y(u)2 ,

(5)

where D and D are diagonal matrices with Dvv = u∈U w(v, u) and Duu = v∈V w(v, u), and ≥ 0 is a parameter for balancing the terms on the right side of this cost function. The first term constrains the new score to be consistent between the strongly connected vertex pairs (u, v) ∈ V × U . The second term and the third term constrain the new label assignment consistent with the initial labeling. To solve the optimization problem in Eqn. (5), we can also use a similar network propagation algorithm to compute the closed-form solution. The propagation algorithm iteratively performs propagation between the two vertex sets in both directions as follow, f (v)t = (1 − α)y(v) + αSf (u)t−1 f (u)t = (1 − α)y(u) + αSf (v)t−1 1

1

where α = 1/(1 + ), S = D− 2 ∗ W ∗ D− 2 and t denotes the current propagation step. Label information is propagated through neighbors in the bipartite graph. The algorithm will converge to a closed-form solution in an identical form as in Eqn.(3), and the final scores obtained after convergence are used to rank the genes as well as classify additional test samples. Figure 2 shows that the bipartite graph modeling can capture bi-clusters. In the example graph, the first and the second genes are connected to positive samples whereas the forth and the fifth genes are connected to negative samples. After running our algorithm, subset of genes in the same bi-cluster receive similar values. Since the discovered genes are strongly connected to either the case or the control group, we can consider the genes with significant scores within the bi-clusters as biomarkers.

Study # of Meta # of Meta-free

Pawitan 35 35

Wang 95 114

Miller 37 150

Loi 51 96

Desmedt 35 136

Table 1. Patient samples in the breast cancer datasets. 3. DATA PREPARATION We collected five independent microarray gene expression datasets generated for studying breast cancer metastasis. The five datasets were generated by the Affymetrix HG-U133A platform. The raw .CEL files were downloaded from GEO website: Pawitan (GSE1456), Wang (GSE2034), Miller (GSE3494), Loi (GSE6532), and Desmedt (GSE7390) [6, 2, 7, 8, 9], and normalized by RMA [10]. After merging of the probes by gene symbols and removing of the probes with no gene symbol, a total of 13,261 unique genes derived from the 22,283 probes were included in our study. The patients are classified as cases and controls in the five datasets based on the time of developing distant metastasis: the patients who were free of metastasis for longer than eight years of survival and follow-up time were classified as metastasis-free, and the patients who developed metastases within five years were classified as metastasis cases. The number of selected samples are reported in Table 1. 4. RESULTS We evaluated the genes selected by network propagation models with two criteria: consistency across the five datasets and enrichment by known gene sets. We tested the network propagation model on Gene-correlation Graph (NP), the extension with hierarchical clustering (Hierarchical-NP) and the model on Patient-Gene Expression Bipartite Graph (NP-Bipartite). Two baseline methods, Rank Product (RP) [11] and Correlation Coefficients (CC), are included. The Rank Product method is derived based on biological reasoning of fold-change criteria and combination of multiple gene rankings, which leads to a high reproducibility among independent studies. The Correlation Coefficients are the correlation between gene expressions and the +1/-1 labeling of the samples. Genes were selected by their ranking scores. NP-bipartite separates each gene as up-regulation and down-regulation, the larger absolute value was used as the ranking score for each gene [5]. 4.1. Consistency across Five Datasets To measure the consistency of the selected marker genes on the five independent datasets, we report the percentage of common genes identified by a method in the the top gene lists from the datasets. This measurement assumes that the true makers genes are more likely to be selected in each dataset than other genes. Thus, the higher consistency across the datasets might indicate higher quality in gene selection. We plot the number of common genes among the first k genes in the gene ranking lists from at least four of the datasets in Figure 3. In the plot, the results of NP (α = 0.5, 0.95), Hierarchical-NP (α = 0.5) and NP-Bipartite (α = 0.5, 0.95) are compared with the results of Correlation Coefficients and Rank Product. The network propagation methods with all parameters clearly identify significantly more reproducible marker genes than Correlation Coefficients. The reason is because Correlation Coefficients only consider each feature independently. Rank Product also identified highly consistent genes across the five datasets. In fact, only NP-Bipartite (α = 0.95) generates higher consistency than Rank Product. It

Fig. 3. Marker gene consistency. Common marker genes identified by network propagation and the baseline methods on the five breast cancer datasets are reported. The x-axis is the number of selected marker genes ranked by each method. The y-axis is the percentage of the overlapped genes between the selected markers in at least four of the breast cancer datasets. is likely that subgroup based approaches such as Rank Product and NP-Bipartite tend to rank the genes with higher variabilities at the top of the rank list because the gene significances are only measured on a subset of patient samples. Rank Product actually amplifies the significance by a product of the ranking of foldchanges for each case-control pairs. Thus, the subgroup-based approaches might need to be compared separately from the other full-sample-based methods. 4.2. Gene Set Enrichment Analysis Presumably, true marker genes of breast cancer metastasis should perform similar biological functions or involve in common cancer gene groups. We evaluated the selected marker genes with enrichment analysis on the computational gene sets and Gene Ontology (GO) gene sets in Molecular Signature Database (MSigDB) [12]. Specifically, we considered 427 cancer-geneneighborhood gene sets, 456 cancer-module gene sets, 825 gene sets derived from the biological process ontology and 396 gene sets derived from the molecular function ontology. For each compared method, we selected the top 100 marker genes from each breast cancer dataset, and computed the overlap between the marker genes and the gene sets. The average number of the overlapped marker genes from the five datasets is used to compute the p-value for enriching a gene set with hypergeometric distributions. We report the number of enriched cancer gene sets by NP (α = 0.5), Hierarchical-NP (α = 0.1), NPBipartite (α = 0.5), Correlation Coefficients and Rank Product in Figure 4. Clearly, the marker genes identified by full-samplebased approaches enriched more gene sets on cancer gene neighborhood and biological process than subgroup-based approaches. Especially, NP and Hierarchical-NP generated much more enriched gene sets with an enrichment p-value less than 1e − 10. On the contrary, the marker genes discovered by subgroup-based approaches enriched more gene sets on cancer modules and biological process. To further validate the statistical significance, we also randomly selected 100 gene from the total 13261 genes, and computed the overlaps between the selected genes and the gene sets. The random test was repeated 100 times, in which none of the gene set could get a enrichment score less than 0.001 (e-3).

&RUUHODWLRQ&RHIILFLHQW

30

13

+LHUDUFKLFDOí13

5DQN3URGXFW

13í%LSDUWLWH

Cancer Gene Neighborhood

Frequency

25 20 15 10 5 0 15

10

5

3

−log10(P−Value) Threshold 70

Cancer Modules

60

6. REFERENCES

Frequency

50

[1] LJ Van’t Veer, H. Dai, MJ Van de Vijver, YD He, AA Hart, M. Mao, HL Peterse, K. Van der Kooy, MJ Marton, AT Witteveen, et al., “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, vol. 415, no. 6871, pp. 530–536, 2001.

40 30 20 10 0 10

7

4

3

−log10(P−Value) Threshold

Biological Process

14

[4] D. Zhou, O. Bousquet, TN Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” Advances in Neural Information Processing Systems, vol. 16, pp. 321–328, 2004.

12

Frequency

[2] Y. Wang, J.G.M. Klijn, Y. Zhang, A.M. Sieuwerts, M.P. Look, F. Yang, D. Talantov, M. Timmermans, M.E. Meijer-van Gelder, J. Yu, et al., “Geneexpression profiles to predict distant metastasis of lymph-node-negative primary breast cancer,” The Lancet, vol. 365, no. 9460, pp. 671–679, 2005. [3] D. Ucar, I. Neuhaus, P. Ross-MacDonald, C. Tilford, S. Parthasarathy, N. Siemers, and R.R. Ji, “Construction of a reference gene association network from multiple profiling data: application to data analysis,” Bioinformatics, vol. 23, no. 20, pp. 2716, 2007.

16

10 8

[5] T.H. Hwang, H. Sicotte, Z. Tian, B. Wu, J.P. Kocher, D.A. Wigle, V. Kumar, and R. Kuang, “Robust and efficient identification of biomarkers by classifying features on graphs,” Bioinformatics, vol. 24, no. 18, pp. 2023–2029, 2008.

6 4 2 0 10

7

4

3


[6] Y. Pawitan, J. Bjohle, L. Amler, A.L. Borg, S. Egyhazi, P. Hall, X. Han, L. Holmberg, F. Huang, S. Klaar, et al., “Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts,” Breast Cancer Res, vol. 7, no. 6, pp. R953–R964, 2005. [7] L.D. Miller, J. Smeds, J. George, V.B. Vega, L. Vergara, A. Ploner, Y. Pawitan, P. Hall, S. Klaar, E.T. Liu, et al., “An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival,” Proceedings of the National Academy of Sciences, vol. 102, no. 38, pp. 13550–13555, 2005.

20

Molecular Function 15

Frequency

dimensional data with modular structures among features and samples. There are open problems that require more investigation. We evaluated the discriminative power of the selected genes and found that the genes selected by the network propagation methods and the baseline methods provide similar classification power on hold-out data (Results are not shown). Network propagation methods are in general more computational demanding compared with most other gene selection methods. Development of approximations based on sparse matrices will improve the computational efficiency. Moreover, a good understanding of the optimal α parameter to balance the contribution of coexpression and differential expression is also lacking. We plan to investigate additional network propagation models that can provide more interpretability to the gene rankings and the parameter selection.

[8] S. Loi, B. Haibe-Kains, C. Desmedt, F. Lallemand, A.M. Tutt, C. Gillet, P. Ellis, A. Harris, J. Bergh, J.A. Foekens, et al., “Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade,” Journal of Clinical Oncology, vol. 25, no. 10, pp. 1239–1246, 2007.

10

5

0 6

4

3


2

Fig. 4. Gene Set Enrichment. The number of significantly enriched gene sets in cancer gene neighborhood, cancer modules, biological process and molecular function at different p-value cut-offs.

[9] C. Desmedt, F. Piette, S. Loi, Y. Wang, F. Lallemand, B. Haibe-Kains, G. Viale, M. Delorenzi, Y. Zhang, M.S. d’Assignies, et al., “Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multicenter independent validation series,” Clinical Cancer Research, vol. 13, no. 11, pp. 3207, 2007. [10] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, Uwe Scherf, and T.P. Speed, “Exploration, normalization, and summaries of high density oligonucleotide array probe level data.,” Biostatistics, vol. 4, no. 2, pp. 249, 2003.

5. DISCUSSIONS

[11] R. Breitling, P. Armengaud, A. Amtmann, and P. Herzyk, “Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments,” FEBS Lett, vol. 573, pp. 83–92, 2004.

Our study verified that network propagation is a promising approach to integrate gene co-expression relations with case/control differential expression for gene selection. Network propagation approaches are also general models that can work on any high-

[12] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, and E.L. Ebert, “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.,” Pro Natl Acad Sci USA, vol. 102, no. 43, pp. 15545–15550, 2005.