TERPENE SYNTHASES IN GINGER AND ...

35 downloads 64 Views 7MB Size Report
Paterson and coworkers recently published work on several .... has been shown for other plants such as snapdragon and sweet basil [19, 38]. Only one terpene ...
TERPENE SYNTHASES IN GINGER AND TURMERIC by Hyun Jo Koo

_____________________

A Dissertation Submitted to the Faculty of the DEPARTMENT OF PLANT SCIENCES In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY

In the Graduate College THE UNIVERSITY OF ARIZONA

2009

2

THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Hyun Jo Koo entitled Terpene Synthases in Ginger and Turmeric and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy Date: 08/21/09 David R. Gang Date: 08/21/09 Hans D. VanEtten Date: 08/21/09 David W. Galbraith Date: 08/21/09 Elizabeth Vierling Date: 08/21/09 Vahe Bandarian Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. Date: 08/21/09 Dissertation Director: David R. Gang

3

STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at The University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author.

SIGNED: Hyun Jo Koo

4

ACKNOWLEDGEMENTS I would like to thank Dr. David R. Gang for his support for several years and for the enormous help and effect he has had on my research and scientific attitude. I also thank him for allowing me to visit my hometown and meet my family. I would also like to thank Drs. Hans D. VanEtten, David W. Galbraith, Elizabeth Vierling and Vahe Bandarian for their teaching, advice, consultations, corrections, and generous help that have been critical to this work. My thanks go out to Susan J. Miller and Dr. Nirav C. Merchant for their teaching and generous consultations for Perl languages and MySQL database, and maintenance of the Amadeus server. My appreciation also goes out to Dr. Jay D. Keasling, Joseph Chappell, and Joseph P. Noel for the providing plasmids and E. coli and yeast cell lines. I would also like to thank all the past and present members of Dr. Gang’s laboratory. My special thanks to Eric McDowell and Dr. Jeremy Kapteyn for life forces in the lab for years and to Drs. Yuying Sang and Anna Berim for their help. Finally, I want to thank my parents for their love, support and encouragement throughout my life. Moreover, I appreciate the help I received from my brother in many aspects. Furthermore, I want to thank my wife for her trust and her family. Also, I want to thank all my friends for their general care, help, and confidence to me. This project is supported by the National Science Foundation Plant Genome Research Program (Grant DBI-0227618 to D.R.G.). 항상 사랑해주시고 걱정해주시는 부모님과 형, 상희, 장인어른, 장모님 모두 감사드립니다.

5

TABLE OF CONTENTS LIST OF FIGURES……………………………………………………………………… 7 ABSTRACT...……………………………………………………………………………. 8 CHAPTER I - INTRODUCTION….…………………………………………………… 10 1.1 Terpene synthases in plants…………………………………………………… 10 1.1.1

Introduction: Plant terpenoids………………………………………… 10

1.1.2

Terpenoid biosynthetic pathway……………………………………… 11

1.1.3

Research on terpene synthases in plants……………………………… 15

1.2 Specialized metabolites in ginger and turmeric……………………………… 16 1.2.1

Ginger and turmeric…………………………………………………… 16

1.2.2

Pharmaceutical properties of ginger…………………………………… 18

1.2.3

Pharmaceutical properties of turmeric………………………………… 18

1.2.4

Terpene synthases in ginger and turmeric…………………………… 19

1.2.5

Analysis of terpene synthases using EST data, metabolite profiles and microarray……………………………………… 20

CHAPTER II - PRESENT STUDY …………………………………………………… 23 2.1 Summary/Perspective…………………………………………………………… 23 2.2 Description of appendices……………………………………………………… 24 REFERENCES……..…………...…………………………………………………….… 27 APPENDIX A - GINGER AND TURMERIC EXPRESSED SEQUENCE TAGS IDENTIFY SIGNATURE GENES FOR RHIZOME IDENTITY AND DEVELOPMENT AND THE BIOSYNTHESIS OF CURCUMINOIDS AND GINGEROL ……….……. 31 APPENDIX B - CLONING AND CHARACTERIZATION OF SEVERAL TERPENE SYNTHASES EXPLAIN TERPENOID PRODUCTION IN GINGER AND TURMERIC TISSUES ……..…………...……………………………………………………….…… 90

6

TABLE OF CONTENTS - continued APPENDIX C - AN APPROACH FOR PEAK DETERMINATION AND QUANTIFICATION IN EI-GC/MS ANALYSIS OF COMPLEX BIOLOGICAL DATASETS AND ITS APPLICATION TO METABOLOMIC INVESTIGATIONS ……..………...…………………..……………………….….… 189 APPENDIX D - METABOLIC PROFILING AND PHYLOGENETIC ANALYSIS OF MEDICINAL ZINGIBER SPECIES: TOOLS FOR AUTHENTICATION OF GINGER (ZINGIBER OFFICINALE ROSC.)……..…………...….…………………………… 268

7

LIST OF FIGURES Figure 1. The MVA pathway …………………………………………………………… 12 Figure 2. The MEP pathway …………………………………………………………… 13 Figure 3. The general terpenoid/isoprenoid biosynthetic pathway …………………… 14 Figure 4. Terpenoids in ginger and turmeric………….………………………………… 15 Figure 5. Interwoven loop design for ginger and turmeric microarray experiments……. 22

8

ABSTRACT Ginger (Zingiber officinale Rosc.) and turmeric (Curcuma longa L.) produce important pharmacologically active metabolites at high levels, which include terpenoids and polyketides such as curcumin and gingerols. This dissertation describes the terpenoids produced by ginger and turmeric, candidate ESTs for terpene synthases, and the cloning and expression of several terpene synthases. A comparison of metabolite profiles, microarray results and EST data enable us to predict which terpene synthases are related with the production of specific terpenoids. Analysis of EST data further suggests several genes important for the growth and development of rhizomes. Ginger and turmeric accumulate important pharmacologically active metabolites at high levels in their rhizomes. Comparisons of ginger and turmeric EST data to publicly available sorghum rhizome ESTs revealed a total of 777 contigs common to ginger, turmeric and sorghum rhizomes but absent from other tissues. The list of rhizome-specific contigs was enriched for genes associated with regulation of tissue growth, development, and regulation of transcription. The analysis suggests ethylene response factors, AUX/IAA proteins, and rhizome-enriched MADS box transcription factors may play important roles in defining rhizome growth and development. From ginger and turmeric, 25 mono- and 16 sesquiterpene synthase sequences were cloned and the function of 13 mono- and 11 sesquiterpene synthases were revealed.

9

There are many paralogs in the ginger and turmeric terpene synthase family, some of which have the same or similar function. However some paralogs have diverse functions and this suggests the evolution of terpene synthases in ginger and turmeric. Importantly, α-zingiberene/β-sesquiphellandrene synthase was identified, which makes the substrates for α-turmerone and β-turmerone production in turmeric. Also P450 candidates for αzingiberene/β-sesquiphellandrene oxidase are proposed. Research involving analysis of metabolite profiles requires the manipulation of a large datasets, such as those produced by GC/MS. We developed an approach to identify compounds that involves deconvolution of peaks obtained using SICs as well as common peak selections between samples even though the peaks may be very small and represent unknown compounds. The limitation of this approach occurs when there are huge peaks in the samples, which distort the SIC of small embedded peaks and sometimes their own SICs.

10

CHAPTER I INTRODUCTION

1.1

Terpene synthases in Plants

1.1.1

Introduction: Plant terpenoids More than 100,000 plant specialized metabolites have already been identified and

over 20,000 different terpenoid metabolites are known [1], although some reports list that number at more than 50,000 [2], which exceeds the number of alkaloids and phenylpropanoids combined. Terpenoids extracted from plants have been used for many different purposes such as fragrances, flavors, pharmaceutical agents, and insect repellants [3]. Aside from their importance to humans, their biological roles in plants are crucial for the entire plant kingdom. Many phytohormones, such as cytokinins (contains hemiterpene), abscisic acids (sesquiterpenes), gibberellins (diterpenes) and brassinosteroids (triterpenes), are terpenoids or contain terpenoid moieties. Many terpenoids with conjugated double bonds acts as antioxidants to cope with a variety of abiotic stresses [4]. Nonvolatile terpenoids such as tocopherols and tocotrienols are lipid-phase antioxidants that scavenge lipid peroxy radicals and react with and physically quench singlet oxygen. Carotenoids provide photoprotection through antioxidant activity in addition to light energy absorption. Other

11

volatile terpenoids also have been shown to have antioxidant activity. Antibacterial and antifungal activities of terpenoids protect plants from invasive pathogens. Toxicity or repellent activities of terpenoids toward vertebrates or insects protect plants too. And allelopathy activities help the plant compete with other plant species [5]. Many terpenoids are volatile. Even some diterpenes (C20), as well as low molecular weight hemiterpenes (C5), monoterpenes (C10) and sesquiterpenes (C15), are released into the air under normal atmospheric conditions. [6].

1.1.2

Terpenoid biosynthetic pathway The building blocks of terpenes are isopentenyl diphosphate (IPP) and

dimethylallyl diphosphate (DMAPP), which are synthesized by the mevalonate (MVA) pathway [7] in the cytosol (Fig. 1) and the 2C-methyl-D-erythritol 4-phosphate (MEP) pathway [8] in the plastid (Fig. 2). The MVA pathway is shared by animals, fungi and the plant cytoplasm, whereas the MEP pathway occurs in plant plastids and is also found in protozoans, most bacteria, and algae [9]. Hemiterpenes (C5) are synthesized from IPP; monoterpenes (C10) are synthesized from geranyl diphosphate (GPP, IPP + DMAPP); sesquiterpenes (C15) are synthesized from farnesyl diphosphate (FPP, GPP + IPP); diterpenes (C20) are synthesized from geranylgeranyl diphosphate (GGPP, FPP + IPP); triterpenes (C30) are modified from squalene (FPP + FPP); and tetraterpenes (C40) are modified from phytoene (GGPP + GGPP) (Fig. 3)

12

O CoA

S Acetyl CoA

+

S

2 H+ CoA 2 NADPH 2 NADP+

O

OH O

CoA

O

S Acetoacetyl CoA

OH OH

Mevalonate O

O

P

P

O

O

O

H2O

O

O

O

CoA

HMG-CoA synthase

O

ATP ADP Mevalonate kinase

O

O

OH O

P

O 5-phosphomevalonate

5-pyrophosphomevalonate

O mevalonate-5-pyrophosphate decarboxylase

Pi + CO2 mevalonate-5-pyrophosphate decarboxylase

Fig. 1 The MVA pathway.

O

ATP ADP O

Phosphomevalonate kinase

O

ATP ADP O

OH O

CoA S O 3-hydroxy-3-methylglutaryl-CoA (HMG-CoA)

CoA

O

HMG-CoA reductase

O

CoASH

O

O

O P

O

O O

O P O

O O

P O

O

O

O

P

P

P

O

O

O

O

Isopentenyl diphosphate (IPP)

IPP isomerase (IDI)

O

O

O

O O

P O

O

Dimethylallyl diphosphate (DMAPP)

13

O

O COO-

+

O P O OH

O

OH

DXP reductoisomerase (DXR)

OH

OH

O

O

NH2

N

O

O

O P O P O

CDP-ME synthase (CMS)

O

OH

NADP+

1-Deoxy-D-xylulose 5-phosphate

PPi CTP Mg2+

O

O

OH

Glyceraldehyde 3-phosphate

OH

O P O

DXP synthase (DXS)

O

OH

NADPH

O

O

O P O

H

Pyruvate

OH

CO2

O

ATP

ADP

N

O

CDP-ME kinase (CMK) HO

OH 4-(cytidine 5'-diphospho)-2-C-methyl-D-erythritol (CDP-ME)

2-C-Methyl-D-erythritol 4-phosphate (MEP) O O P O O

O

O P O P O OH

O

OH

N

O

O

O

NH2

O

CMP

O

N

O P O P

OH

HO

OH 4-(cytidine 5'-diphospho)-2-C-methyl-D-erythritol 2-phosphate (CDP-MEP)

O

O

ME-cPP synthase (MCS)

O

O

Ferredoxin I (e- donor) HMBPP synthase (HDS)

OH

2-C-methyl-D-erythritol-2,4-cyclodiphosphate (ME-cPP) O

O

O P O P OH O

O O Isopentenyl diphosphate (IPP)

O

O P O P OH OH

O

O

IPP isomerase (IDI)

IPP/DMAPP synthase (IDS)

1-hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate (HMBPP)

O

O

O P O P OH O O dimethylallyl diphosphate (DMAPP)

Fig. 2 The MEP pathway.

14

MVA pathway (cytosol) MEP pathway (plastid) Squalene Triterpene (C30)

+ FPP

OPPIPP GPP

OPP- + IPP FPP

OPPDMAPP

OPP-

+ IPP

Sesquiterpene (C15)

Monoterpene (C10)

Hemiterpene (C5)

GGPP + GGPP

OPPDiterpene (C20)

Phytoene Tetraterpene (C40)

Fig. 3 The general terpenoid/isoprenoid biosynthetic pathway. MVA, mevalonate; MEP, 2C-methyl-D-erythritol 4-phosphate; IPP, isopentenyl diphosphate; DMAPP, dimethylallyl diphosphate; GPP, geranyl diphosphate; FPP, farnesyl diphosphate, GGPP, geranylgeranyl diphosphate

15

1.1.3

Research on terpene synthases in plants Analysis of terpene synthases from conifers demonstrated that RRX8W and

DDXXD motifs are conserved in mono-, sesqui-, diterpene synthases [10]. However, diterpene synthases have a unique 210-amino acid conserved sequence and DXDD motif. Sesquiterpene synthases do not have a transit peptide whereas mono- and diterpene synthases do [10], suggesting that mono- and diterpene synthases are targeted to a plastidial compartment whereas sesquiterpene synthases remain in the cytosol. Monoterpenoids, diterpenoids, and tetraterpenoids are synthesized in the plastid from substrates derived from the MEP pathway, whereas sesquiterpenoids and triterpenoids are produced in the cytosol using substrates typically derived from the MVA pathway [3]. However, in some cases, it has been shown that sesquiterpenes are derived from IPP and DMAPP of plastidic/MEP pathway origin, suggesting that transport of these compounds from the plastid to the cytosol can occur [11] [12]. Recently, a [Z,Z]-FPP synthase was reported from the wild tomato Solanum habrochaites with transit peptide [13]. Terpene synthases are typically known to use substrates with trans configurations such as [E]-GPP, [E,E]-FPP, [E,E,E]-GGPP, etc. This [Z,Z]-FPP synthase had a transit peptide and was targeted to plastid. Santalene/bergamotene synthase from Solanum habrochaites also has a transit peptide and uses [Z,Z]-FPP as substrate. Arabidopsis has 32 predicted terpene synthase genes and 8 pseudogenes [14]. Arabidopsis flowers emit over 20 sesquiterpenes and only two terpene synthases are responsible for the formation of virtually all known sesquiterpenes in that plant [15]. As

16

shown in this example, terpene synthases can have broad production profiles, which may also contribute to metabolome diversity. Several terpene synthase structures have been determined, including 5-epiaristolochene synthase from tobacco (Nicotiana tabacum) [16], bornyl diphosphate synthase from sage (Salvia officinalis) [17], limonene synthase from spearmint (Mentha spicata) [18] and (+)-delta-cadinene synthase from tree cotton (Gossypium arboreum) [19]. This structure information revealed that the conserved DDXXD motif is important for interaction with the diphosphate group of the substrate by binding divalent cations such as Mg2+ or Mn2+. All the terpene synthases cloned from ginger and turmeric have conserved DDXXD motifs.

1.2

Specialized metabolites in ginger and turmeric

1.2.1

Ginger and turmeric Ginger (Zingiber officinale, Rosc.) and turmeric (Curcuma longa L.) are

tropical/sub-tropical perennial plants, whose rhizomes have been used for both culinary and medicinal purposes in different societies for thousands of years. Both ginger and turmeric belong to the Zingiberaceae, the ginger family, and they are normally propagated by rhizomes and produce varieties of terpenoids (Fig. 4).

17

OH

OH

OPP Linalool GPP

FPP

Nerolidol

OPP

O

OH CHO neral

Geraniol

p-Mentha-1,4(8)-diene

Camphene

1R-(+)-α -Pinene

O (Z)-Cinerone

Cineole

γ-Terpinene β-Elemene

O p-Menth-8-en-2-one o-Cymene

α-Phellandrene

α-Terpinene H

H 7-epi-Sesquithujene

ar-Curcumene

(E)-Caryophyllene

(Ζ)−β-Farnesene

O

ar-Turmerone

(E,E)-α-Farnesene

H

H γ-Curcumene

(E)-γ-Bisabolene

α-Zingiberene O

O

OH

H Curcuphenol

β-Bisabolene

α-Oxobisabolene

Fig. 4 Major terpenoids identified from ginger and turmeric

α-Turmerone

β-Sesquiphellandrene O

β-Turrmerone

18

1.2.2

Pharmaceutical properties of ginger Ginger has been used for centuries for the treatment of a variety of human

ailments including the common cold, fevers, rheumatic disorders, gastrointestinal complications, motion sickness, diabetes, cancer, etc [20]. Ginger also has anti-bacterial [21] and anti-fungal [22] activities. Many of these medicinal activities, including anticancer and anti-inflammatory activities [23] are believed to be due to the presence of active phenolic compounds such as the gingerols, paradols and shogaols [24-26]. However terpenoids in ginger are also reported to have important roles. β-Elemene arrests the cell cycle and induces apoptotic cell death in lung cancer cells [27] and elemene is good for treatment of patients with chylothorax [28]. Zingiberene as well as 6-gingerol significantly inhibited gastric lesions [29] and later research revealed βsesquiphellandrene, β-bisabolene, ar-curcumene and 6-shogaol as anti-ulcer active principles in ginger [30].

1.2.3

Pharmaceutical properties of turmeric Turmeric has anti-inflammatory [31] and anti-cancer [32] properties, which have

been mainly attributed to curcumin, a diarylheptanoid compound. However, turmeric oil containing ar-turmerone, turmerone and curlone showed antioxidant effects and may provide an explanation for their antimutagenic action [33]. This turmeric oil also has antibacterial activity [34]. Both curcuminoids and sesquiterpenoids in turmeric exhibit hypoglycemic effects via peroxisome proliferator-activated receptor-γ (PPAR-γ)

19

activation as one of the mechanisms and suppress an increase in blood glucose levels in type 2 diabetic KK-Ay mice. This effect was additive or synergistic when both curcuminoids and sesquiterpenoids in turmeric were applied together [35]. Ar-turmerone from turmeric oil displays anti-tumorigenesis activity, inhibiting cell proliferation and activating an ar-turmerone-mediated apoptotic protein in human lymphoma U937 cells [36]. It was also found that apoptosis was selectively induced by ar-turmerone in human leukemia Molt 4B and HL-60 cells, but not observed in human stomach cancer KATO III cells [37]. Ar-turmerone also shows antiplatelet activities that can prevent and treat arteriol thrombosis [38].

1.2.4

Terpene synthases in ginger and turmeric Ginger and turmeric contain several varieties of specialized compounds including

polyketides and terpenoids, some of which have been described above. Despite their important roles, as far as we know, only two sesquiterpene synthases, (+)-germacrene D synthase from ginger (Zingiber officinale, Rosc.) rhizome [39] and α-humulene synthase from shampoo ginger (Zingiber zerumbet Smith) rhizome [40] are verified in the Zingiberaceae family. Based on ginger and turmeric EST data from white ginger (rhizome, root and leaf), yellow ginger (rhizome, root and leaf) and turmeric (rhizome and leaf) (released to the NCBI database), we selected putative mono- and sesquiterpene synthases, and cloned and expressed them with GPP and FPP as substrates in E. coli or yeast. Although many of these enzymes were found to be insoluble when expressed in these

20

systems, we were able to identify the functions for some of them. We also analyzed why some that are paralogs produce different products even though their sequences are very similar according to protein structural modeling. Both ginger and turmeric produce α-zingiberene and β-sesquiphellandrene. However, only turmeric synthesizes α-turmerone, β-turmerone, which are also described as tumerone and curlone, respectively, in some papers [41]. Here we suggest that the major terpenoids produced by turmeric are α-turmerone and β-turmerone instead of tumerone and curlone.

1.2.5

Analysis of terpene synthases using EST data, metabolite profiles, and

microarray Over 50,000 expressed sequence tags (ESTs) from rhizomes, leaves and roots of two ginger lines (white ginger, GW; and yellow ginger, GY) and rhizomes and leaves of one turmeric line (orange turmeric, T3C) were analyzed. Using these ESTs, we identified ginger and turmeric genes potentially involved in rhizome biology and specialized metabolism, particularly in the production of curcuminoids, gingerols and terpenoids. Based on the EST database, 15,208 probes for microarrays were designed and printed on Agilent 8x15K format slides. Yellow ginger and two varieties of turmeric (Fat Mild Orange, FMO and Thin, Yellow Aromatic, TYA) were used for microarray experiments to compare transcripts in different tissues (rhizome, root and leaf) and developmental stages, which was done using an interwoven loop design (Fig. 4). FMO and TYA

21

rhizomes (3, 5 and 7 months old), FMO and TYA roots (7 months old), FMO and TYA leaves (7 months old) were compared for turmeric. For ginger, GY rhizomes (2, 3, 4, 6 and 7 months old), GY roots (2 and 7 months old) and GY leaves (2 and 7 months old) were compared. Using metabolite profiles generated with gas chromatography/mass spectrometry (GC/MS), we were able to extract peaks of terpenoids and compare across samples to see which samples have common peaks. Comparison of the metabolite profiles, the microarray results and EST data enable us to predict which terpene synthases might be involved in the production of specific terpenoids.

22

Turmeric 24 arrays

TYA-Rh 3M

TYA-Rh 5M

TYA-Rh 7M

TYA-R 7M

TYA-L 7M

Ginger 24 arrays

GY-Rh 4M

FMO-Rh 3M

FMO-Rh 5M

GY-Rh 3M

FMO-Rh 7M

GY-Rh 2M

FMO-R 7M

GY-R 2M

GY-L 2M

FMO-L 7M

Cy3

GY-Rh 6M

GY-Rh 7M

GY-R 7M

GY-L 7M

Cy5

Fig. 5 Interwoven loop design for ginger (GY) and turmeric (FMO and TYA) microarray experiments. FMO and TYA rhizomes (3, 5 and 7 months old), FMO and TYA roots (7 months old), FMO and TYA leaves (7 months old), GY rhizomes (2, 3, 4, 6 and 7 months old), GY roots (2 and 7 months old) and GY leaves (2 and 7 months old) were compared. Arrows indicate labeling dyes, Cy3 to Cy5. One red arrow represents the experiment that was supposed to have been performed. However, the actual microarray comparison that was done is indicated by the black rounded arrow; the TYA-R-7M sample labeled with Cy3 instead of the TYA-Rh 7M sample labeled with Cy3 was compared to TYA-R-7M sample labeled with Cy5.

23

CHAPTER II PRESENT STUDY

2.1 Summary/perspective

From various specialized metabolites, terpenoids are my favorites. I was attracted by their scents and further fascinated by their functions for the producing plants and for humans. I like also ginger and turmeric. I use ginger a lot as a condiment and enjoy curry dishes. For several years, I studied terpene synthases in ginger and turmeric. We analyzed ginger and turmeric EST data and suggested the metabolite pathways in these species, used the results to clone genes, and utilized them to design microarray probes. We extracted a large amount of metabolite samples for both GC/MS and LC/MS from different tissues of ginger and turmeric, which was used to predict genes involved in the specific metabolites when considered with microarray data together. To manipulate the large amount of data, I used Perl and MySQL to make life easier and make the analysis work possible. It was very good to use the Amadeus server and supercomputer on campus. I tried to clone and express as many terpene synthases as possible. Except one rare transcript, I could find either 5’ or 3’ end(s) for most of terpene synthases shown in the EST data. I have cloned 25 mono- and 16 sesquiterpene synthase sequences and

24

revealed the function of 13 mono- and 11 sesquiterpene synthases from ginger and turmeric. Many proteins expressed in E. coli were insoluble, some of which were tried in yeast. However, many of them were still insoluble even in the yeast. For those genes that did not express soluble proteins in E. coli and yeast, I hope that we will be able to express soluble proteins in plants so that we can find their functions. Transformation into Nicotiana benthamiana using Agrobacterium infiltration might be the solution for terpene synthase expression with plants as host. P450 monooxygenase candidates for α-zingiberene/β-sesquiphellandrene oxidase were cloned and expressed with α-zingiberene/β-sesquiphellandrene synthase and basil P450 reductase. The products look like the hydrated forms of α-zingiberene, β-bisabolene, and β-sesquiphellandrene. However, that still needs to be verified. With several years of work, I could find terpene synthases that correspond to the production of major mono- and sesquiterpenes in ginger and turmeric. I also suggest that tumerone and curlone are α-turmerone and β-turmerone, and also provide the candidate genes required for their syntheses.

2.2 Description of appendices

Appendix A represents the first manuscript included in this dissertation. Entitled “Ginger and turmeric expressed sequence tags identify signature genes for rhizome identity and development and the biosynthesis of curcuminoids and gingerols”, this

25

manuscript is currently being edited for submission. For this paper, I and Eric T. McDowell are co-first authors and we are responsible for writing the manuscript. Regarding the actual research included in this paper, I am responsible for the part on analysis of the EST database by their gene families (terpene synthases, P450 monooxygenases, reductases, acyltransferases, SAM related, small molecule O-methyl transferases, SABATH family), identification and analysis of rhizome specific genes that are commonly found in both ginger/turmeric and sorghum rhizome EST databases, analysis of the EST databases by gene ontology (GO), the design of oligos for microarray experiments, part of preparation of samples for microarrays, part of microarray data analysis, and cloning and expression of terpene synthases used in this manuscript. We plan to submit this manuscript to: Plant Physiology.

Appendix B represents the second manuscript included in this dissertation. Entitled “Cloning and characterization of several terpene synthases explain terpenoid production in ginger and turmeric tissues”, this manuscript is currently being edited for submission. For this paper, I am the first author and responsible for experiments and writing.

Appendix C represents the third manuscript included in this dissertation. Entitled “An approach for peak determination and quantification in EI-GC/MS analysis of complex biological datasets and its application to metabolomic investigations”, this

26

manuscript is currently being edited for submission. For this paper, I am the first author and responsible for experiments and writing. We plan to submit this manuscript to: Metabolomics.

Appendix D represents the first published manuscript included in this dissertation. Entitled “Metabolic profiling and phylogenetic analysis of medicinal Zingiber species: tools for authentication of ginger (Zingiber officinale Rosc.), this manuscript was published in Phytochemistry. I am the third author for this paper. With the first author together, I harvested all the samples in a greenhouse and extracted metabolites for both GC/MS and LC/MS analysis from all the samples used in this paper. I was also responsible to get RAW data through GC/MS.

27

REFERENCES 1.

Schwab, W., Metabolome diversity: too few genes, too many metabolites? Phytochemistry, 2003. 62(6): p. 837-49.

2.

McCaskill, D. and R. Croteau, Prospects for the bioengineering of isoprenoid biosynthesis. Adv Biochem Eng Biotechnol, 1997. 55: p. 107-46.

3.

Tholl, D., Terpene synthases and the regulation, diversity and biological roles of terpene metabolism. Curr Opin Plant Biol, 2006. 9(3): p. 297-304.

4.

Vickers, C.E., et al., A unified mechanism of action for volatile isoprenoids in plant abiotic stress. Nat Chem Biol, 2009. 5(5): p. 283-91.

5.

Wink, M., PLANT-BREEDING - IMPORTANCE OF PLANT SECONDARY METABOLITES FOR PROTECTION AGAINST PATHOGENS AND HERBIVORES. Theoretical and Applied Genetics, 1988. 75(2): p. 225-233.

6.

Dudareva, N., E. Pichersky, and J. Gershenzon, Biochemistry of plant volatiles. Plant Physiol, 2004. 135(4): p. 1893-902.

7.

Graebe, J.E., Isoprenoid biosynthesis in a cell-free system from pea shoots. Science, 1967. 157(784): p. 73-5.

8.

Charon, L., et al., Deuterium-labelled isotopomers of 2-C-methyl-D-erythritol as tools for the elucidation of the 2-C-methyl-D-erythritol 4-phosphate pathway for isoprenoid biosynthesis. Biochem J, 2000. 346 Pt 3: p. 737-42.

9.

Estevez, J.M., et al., 1-Deoxy-D-xylulose-5-phosphate synthase, a limiting enzyme for plastidic isoprenoid biosynthesis in plants. J Biol Chem, 2001. 276(25): p. 22901-9.

10.

Keeling, C.I. and J. Bohlmann, Genes, enzymes and chemicals of terpenoid diversity in the constitutive and induced defence of conifers against insects and pathogens. New Phytol, 2006. 170(4): p. 657-75.

11.

Adam, K.P., R. Thiel, and J. Zapp, Incorporation of 1-[1-(13)C]Deoxy-D-xylulose in chamomile sesquiterpenes. Arch Biochem Biophys, 1999. 369(1): p. 127-32.

28

12.

Dudareva, N., et al., The nonmevalonate pathway supports both monoterpene and sesquiterpene formation in snapdragon flowers. Proc Natl Acad Sci U S A, 2005. 102(3): p. 933-8.

13.

Sallaud, C., et al., A novel pathway for sesquiterpene biosynthesis from Z,Zfarnesyl pyrophosphate in the wild tomato Solanum habrochaites. Plant Cell, 2009. 21(1): p. 301-17.

14.

Aubourg, S., A. Lecharny, and J. Bohlmann, Genomic analysis of the terpenoid synthase ( AtTPS) gene family of Arabidopsis thaliana. Mol Genet Genomics, 2002. 267(6): p. 730-45.

15.

Tholl, D., et al., Two sesquiterpene synthases are responsible for the complex mixture of sesquiterpenes emitted from Arabidopsis flowers. Plant J, 2005. 42(5): p. 757-71.

16.

Starks, C.M., et al., Structural basis for cyclic terpene biosynthesis by tobacco 5epi-aristolochene synthase. Science, 1997. 277(5333): p. 1815-20.

17.

Whittington, D.A., et al., Bornyl diphosphate synthase: structure and strategy for carbocation manipulation by a terpenoid cyclase. Proc Natl Acad Sci U S A, 2002. 99(24): p. 15375-80.

18.

Hyatt, D.C., et al., Structure of limonene synthase, a simple model for terpenoid cyclase catalysis. Proc Natl Acad Sci U S A, 2007. 104(13): p. 5360-5.

19.

Gennadios, H.A., et al., Crystal structure of (+)-delta-cadinene synthase from Gossypium arboreum and evolutionary divergence of metal binding motifs for catalysis. Biochemistry, 2009. 48(26): p. 6175-83.

20.

Kundu, J.K., H.K. Na, and Y.J. Surh, Ginger-derived phenolic substances with cancer preventive and therapeutic potential. Forum Nutr, 2009. 61: p. 182-92.

21.

Lopez, P., et al., Solid- and vapor-phase antimicrobial activities of six essential oils: susceptibility of selected foodborne bacterial and fungal strains. J Agric Food Chem, 2005. 53(17): p. 6939-46.

22.

Ficker, C.E., et al., Inhibition of human pathogenic fungi by ethnobotanically selected plant extracts. Mycoses, 2003. 46(1-2): p. 29-37.

29

23.

Habib, S.H., et al., Ginger extract (Zingiber officinale) has anti-cancer and antiinflammatory effects on ethionine-induced hepatoma rats. Clinics (Sao Paulo), 2008. 63(6): p. 807-13.

24.

Nigam, N., et al., Induction of apoptosis by [6]-gingerol associated with the modulation of p53 and involvement of mitochondrial signaling pathway in B[a]Pinduced mouse skin tumorigenesis. Cancer Chemother Pharmacol, 2009. 24: p. 24.

25.

Jeong, C.H., et al., [6]-Gingerol suppresses colon cancer growth by targeting leukotriene A4 hydrolase. Cancer Res, 2009. 69(13): p. 5584-91.

26.

Shukla, Y. and M. Singh, Cancer preventive properties of ginger: a brief review. Food Chem Toxicol, 2007. 45(5): p. 683-90.

27.

Wang, G., et al., Antitumor effect of beta-elemene in non-small-cell lung cancer cells is mediated via induction of cell cycle arrest and apoptotic cell death. Cell Mol Life Sci, 2005. 62(7-8): p. 881-93.

28.

Jianjun, Q., et al., Treatment of chylothorax with elemene. Thorac Cardiovasc Surg, 2008. 56(2): p. 103-5.

29.

Yamahara, J., et al., The anti-ulcer effect in rats of ginger constituents. J Ethnopharmacol, 1988. 23(2-3): p. 299-304.

30.

Yamahara, J., et al., [Stomachic principles in ginger. II. Pungent and anti-ulcer effects of low polar constituents isolated from ginger, the dried rhizoma of Zingiber officinale Roscoe cultivated in Taiwan. The absolute stereostructure of a new diarylheptanoid]. Yakugaku Zasshi, 1992. 112(9): p. 645-55.

31.

Jurenka, J.S., Anti-inflammatory properties of curcumin, a major constituent of Curcuma longa: a review of preclinical and clinical research. Altern Med Rev, 2009. 14(2): p. 141-153.

32.

Ravindran, J., S. Prasad, and B.B. Aggarwal, Curcumin and Cancer Cells: How Many Ways Can Curry Kill Tumor Cells Selectively? Aaps J, 2009. 10: p. 10.

33.

Jayaprakasha, G.K., et al., Evaluation of antioxidant activities and antimutagenicity of turmeric oil: a byproduct from curcumin production. Z Naturforsch [C], 2002. 57(9-10): p. 828-35.

30

34.

Negi, P.S., et al., Antibacterial activity of turmeric oil: a byproduct from curcumin manufacture. J Agric Food Chem, 1999. 47(10): p. 4297-300.

35.

Nishiyama, T., et al., Curcuminoids and sesquiterpenoids in turmeric (Curcuma longa L.) suppress an increase in blood glucose level in type 2 diabetic KK-Ay mice. J Agric Food Chem, 2005. 53(4): p. 959-63.

36.

Lee, Y., Activation of apoptotic protein in U937 cells by a component of turmeric oil. BMB Rep, 2009. 42(2): p. 96-100.

37.

Aratanechemuge, Y., et al., Selective induction of apoptosis by ar-turmerone isolated from turmeric (Curcuma longa L) in two human leukemia cell lines, but not in human stomach cancer cell line. Int J Mol Med, 2002. 9(5): p. 481-4.

38.

Lee, H.S., Antiplatelet property of Curcuma longa L. rhizome-derived arturmerone. Bioresource Technology, 2006. 97(12): p. 1372-1376.

39.

Picaud, S., et al., Cloning, expression, purification and characterization of recombinant (+)-germacrene D synthase from Zingiber officinale. Arch Biochem Biophys, 2006. 452(1): p. 17-28.

40.

Yu, F., et al., Molecular cloning and functional characterization of alphahumulene synthase, a possible key enzyme of zerumbone biosynthesis in shampoo ginger (Zingiber zerumbet Smith). Planta, 2008. 227(6): p. 1291-9.

41.

Hiserodt, R., et al., Characterization of powdered turmeric by liquid chromatography mass spectrometry and gas chromatography mass spectrometry. Journal of Chromatography A, 1996. 740(1): p. 51-63.

31 APPENDIX A - GINGER AND TURMERIC EXPRESSED SEQUENCE TAGS IDENTIFY SIGNATURE GENES FOR RHIZOME IDENTITY AND DEVELOPMENT AND THE BIOSYNTHESIS OF CURCUMINOIDS AND GINGEROLS

Manuscript “Ginger and turmeric expressed sequence tags identify signature genes for rhizome identity and development and the biosynthesis of curcuminoids and gingerols”, this manuscript is currently being edited for submission. We plan to submit this manuscript to: Plant Physiology.

32 Running head: Functional genomics of ginger and turmeric rhizome biology

Corresponding Author: David R. Gang, Department of Plant Sciences and BIO5 Institute, The University of Arizona, Thomas W. Keating Bioresearch Building, 1657 E. Helen Street, Tucson, AZ 85719, USA.

Tel:

(520) 621-7154

Fax:

(520) 626-4824

E-mail:

[email protected]

Journal research area: System Biology, Molecular Biology, and Gene Regulation

33 Ginger and turmeric expressed sequence tags identify signature genes for rhizome identity and development and the biosynthesis of curcuminoids and gingerols1

Hyun Jo Koo2, Eric T. McDowell2, Xiaoqiang Ma, Kevin Greer, Jeremy Kapteyn, Zhengzhi Xie3, HyeRan Kim4, Yeisoo Yu, David Kudrna, Carol A. Soderlund, Rod A. Wing, and David R. Gang* Department of Plant Sciences and BIO5 Institute (X.M., H.J.K., E.T.M., J.K., Z.X., D.R.G.), Department of Pharmaceutical Sciences (Z.X.), Arizona Genomics Computational Laboratory and Department of Plant Sciences and BIO5 Institute (K.G., K.P.H., C.A.S.), Arizona Genomics Institute and Department of Plant Sciences and BIO5 Institute (H.R.K., Y.Y, D.K, R.A.W), The University of Arizona, Tucson, AZ 85721

34 1

The authors acknowledge financial assistance from the National Science Foundation

Plant Genome Research Program (Grant DBI-0227618 to D.R.G.). 2

These authors contributed equally to the paper.

3

Present address: Department of Chemistry, the University of Louisville, Louisville, KY

40208. 4

Present address: Plant Genomics Institute, Chungnam National University, 220 Gung-

dong, Yuseong-gu, Daejeon, 305-764 Korea

*Corresponding author; David R. Gang, [email protected].

35 Abstract Ginger (Zingiber officinale) and turmeric (Curcuma longa) accumulate important pharmacologically active metabolites at high levels in their rhizomes. In order to identify rhizome-enriched genes and genes encoding specialized metabolism enzymes and pathway regulators, we evaluated an assembled collection of expressed sequence tags (ESTs) from eight different ginger and turmeric tissues. Comparisons to publicly available sorghum rhizome ESTs revealed a total of 777 contigs common to ginger/turmeric and sorghum rhizomes but absent from other tissues. The list of rhizomespecific contigs was enriched for genes associated with regulation of tissue growth, development, and regulation of transcription. In particular, transcripts for ethylene response factors and AUX/IAA proteins appeared to accumulate in patterns mirroring results from previous studies regarding rhizome growth responses to exogenous applications of auxin and ethylene. Thus, these genes may play important roles in defining rhizome growth and development. Additional associations were made for ginger and turmeric rhizome-enriched MADS box transcription factors, their putative rhizomeenriched homologs in sorghum, and rhizomatous QTLs in rice. Additionally, analysis of both primary and specialized metabolism genes indicates that ginger and turmeric rhizomes are primarily devoted to the utilization of leaf supplied sucrose for the production and/or storage of specialized metabolites associated with the phenylpropanoid pathway and putative type III polyketide synthase gene products. This finding reinforces earlier hypotheses predicting roles of this enzyme class in the production of curcuminoids and gingerols.

36 INTRODUCTION

Ginger (Zingiber officinale Rosc.) and turmeric (Curcuma longa L.) are important not only as spices but also as traditional Eastern medicines for arthritis, rheumatism, fever, nausea, asthma and other ailments [1]. Terpenoids (e.g., turmerone and curlone) and phenylpropanoid-polyketides (diarylheptanoids, including the curcuminoids, and the gingerol-related compounds) are believed to be responsible for most of these medicinal properties. Curcumin in particular is used in treatment of cancer, arthritis, diabetes, Crohn's disease, cardiovascular diseases, osteoporosis, Alzheimer's disease, and psoriasis, among others [2, 3]. [6]-Gingerol also has potential in treating chronic inflammation, such as in asthma and rheumatoid arthritis [4]. This interest in the ginger and turmeric rhizome-associated diarylheptanoids and gingerols has prompted both enzyme assay and metabolic profiling-based inquiries into the biosynthesis of these compounds [2, 5-7]. Nevertheless, many of the enzymes involved in production of these compounds in ginger and turmeric have not been identified. Rhizomes are of greater biological interest as well. The rhizome was the original stem of the vascular plant lineage [8] and is still the only type of stem found in primitive plant groups such as ferns and fern allies. In order to understand the evolution of the upright stem from its rhizomatous origins, we must understand how it differs from the rhizome. Furthermore, we do not understand why and how many advanced plants have “reverted” back to rhizomatous growth. Such reversions have huge economic implications, being responsible for the invasiveness and hardiness of many of the world’s most significant weeds, such as purple nutsedge (Cyperus rotundus L.), Johnson grass

37 (Sorghum halepense), and cogon grass (Imperata cylindrica (L.) Beauv.). Thus, increasing our understanding of rhizome biology may have significant impacts not only on our understanding of how important medicinal compounds are produced, but also on our ability to control important weedy species. Despite the importance of ginger and turmeric and of rhizomes in general, very few genes have been identified from ginger rhizomes [9-11] while none have been characterized from turmeric. Moreover, very little is known about the genes involved in rhizome identity, growth and development in general [10-14]. Paterson and coworkers recently published work on several Sorghum species that identified many genes that are expressed in the rhizomes of S. halepense and S. propinquim [12]. Several of these mapped to QTLs for “rhizomatousness” on the Sorghum genetic map. However, the exact role that any of these genes may play in rhizome development remains unclear. Here we describe the analysis of over 50,000 expressed sequence tags (ESTs) from rhizomes, leaves and roots of two ginger lines (white ginger, GW and yellow ginger, GY) and rhizomes and leaves of one turmeric line (orange turmeric, T3C). Using these ESTs, we identified ginger and turmeric genes potentially involved in rhizome biology and specialized metabolism, particularly in the production of curcuminoids, gingerols and terpenoids. Moreover, we provide an explanation for previously observed growth responses of rhizomes to the phytohormones auxin and ethylene [15-18].

38 RESULTS AND DISCUSSION

Production and Analysis of a Database of Ginger and Turmeric ESTs Random clones from eight cDNA libraries representing rhizome, leaf and root of two ginger lines and rhizome and leaf of one turmeric line (Supplemental Table S1), were 5´ and 3´ end-sequenced to produce ESTs, which were then assembled into contiguous transcriptional units (contigs) in the Program for Assembling and Viewing ESTs (PAVE, see Methods section). The resulting database (http://ag.arizona.edu/research/ganglab/ArREST.htm) contains a total of 50,509 ESTs (37,874 from ginger and 12,535 from turmeric) that assembled into 20,599 unigenes (13,717 contigs and 6,882 singletons). The average EST sequence length was 817 bp, with contig lengths ranging from 151 to 4021 bp, with the greatest number of contigs having between 701 and 800 bp, and with 95% exceeding 300 bp (Supplemental Fig. S1). Average EST number per contig was approximately 3.2, and only fifteen contigs contained 40 or more ESTs (Supplemental Tables S2 and S3), whereas a very large number of the contigs contained less than 10 ESTs. Many contigs contained ESTs from both species, suggesting significant homology between these two members of the Zingiberaceae. 87.6 % of the ArREST contigs possessed Gene Ontology (GO) annotations. Eight GO categories (Supplemental Fig. S2) had EST abundances greater than 5%, including protein modification (10.5%), transport (9.2%), metabolism (9.0%), transcription (8.6%), cellular process (8.0%), protein biosynthesis (6.9%), electron transport (5.6%), and biological process unknown (5.3%). In contrast, the GO category secondary metabolism

39 contained relatively few ESTs (0.3 %). Early steps in the pathways to the curcuminoids and gingerols are covered by other metabolism categories. Other interesting findings from the GO categorization are as follows: 1) transport and metabolism genes appeared to be more highly expressed (based on EST counts) in root than in leaf or rhizome of ginger; 2) genes related to protein modification appeared to be expressed at higher levels in the rhizome than in the leaf or root for both turmeric and ginger; 3) protein biosynthesis genes appeared to be expressed at higher levels in GW roots than other tissues or other plant accessions; and 4) biological process unknown genes appeared to be expressed at higher levels in turmeric than in ginger. Based on the GO categorization described above, we were able to outline a metabolic network in ginger and turmeric rhizomes that connects the metabolism of sucrose to the phenylpropanoids and terpenoids (see Supplemental Fig. S3 and Table S4). We were also able to analyze the apparent relative expression levels (based on EST abundance) of genes governing the commitment of carbon flux into several primary and specialized metabolic pathways in different tissues (Table 1), as we have previously done for glandular trichomes [19]. These results, which were validated by additional expression studies, suggest that metabolism is regulated in ginger and turmeric rhizomes differently than what occurs in leaves or roots, and supports the hypothesis that this tissue type is highly specialized for the production of high levels of specialized metabolites. We also investigated the apparent expression levels for members of eight specific gene families (see Supplemental Table S5) that play important roles in the biosynthesis of large numbers of specialized metabolites in plants: polyketide synthases (PKSs), terpene synthases (TPSs), cytochrome P450 monooxygenases (P450s), 2-oxoglutarate-dependent

40 dioxgenases (ODDs), NAD(P)H-dependent dehydrogenases/reductases, BAHD acyltransferases; SABATH carboxyl methyltransferases, and small molecule Omethyltransferases. Five of these gene families were particularly well represented in the ArREST database, with normalized total EST numbers (see Supplemental Table S5) of more than 100 for specific family sub-categories. In particular, the P450 gene family was very well represented in the database (see Supplemental Table S6, Supplemental Fig. S4), suggesting that reactions carried out by members of this family are very important for metabolism in these plants.

Biosynthesis of Diarylheptanoids and Gingerols in Ginger and Turmeric Of the more than 2,000 nonvolatile compounds detected so far by LC-MS in fresh ginger or turmeric rhizome, less than 100 have been isolated or structurally identified, let alone biosynthetically evaluated [4, 20-28]. What is known is that both the diarylheptanoid and gingerol-related classes of compounds are polyketides with origins in the phenylpropanoid pathway [29]. ESTs for phenylpropanoid pathway enzymes were abundant in almost all of the tissues examined. Cinnamate 4-hydroxylase ESTs were the most abundant among the six phenylpropanoid pathway enzymes. Phenylalanine ammonia lyase ESTs, the entry point into the pathway, were abundant in most of the ginger/turmeric cDNA libraries except leaves. Caffeoyl-CoA O-methyltransferase (CCOMT) was apparently expressed at higher levels (about 2-fold higher EST counts) in ginger rhizomes than in ginger leaves (Supplemental Tables S4 and S5) and was not detectable in turmeric leaf. These results paralleled previous work that showed that CCOMT specific activity was significantly higher in extracts from shoots when

41 compared to leaves and rhizomes for both ginger and turmeric, consistent with a role for CCOMT in xylem development [29]. Moreover, the abundance of phenylpropanoid pathway-associated ESTs in rhizomes, which do not accumulate high levels of lignin, flavonoids, lignans, or other phenylpropanoid pathway-derived compounds, supports the hypothesis that the primary biochemical function of ginger and turmeric rhizomes is the conversion of sucrose into the curcuminoids and gingerols [6, 7]. Recent metabolic profiling work has suggested that at least two (and probably more) polyketide synthases are involved in production of diarylheptanoids in turmeric [6]. One such enzyme can apparently utilize p-coumaroyl-CoA and feruloyl-CoA but not caffeoyl-CoA as substrate and produces the major curcuminoids. Additional PKSs can utilize caffeoyl-CoA to produce compounds with an ortho-diol on one of the two aromatic rings, such as 3´-hydroxy-bisdemethoxycurcumin (A) and 3´-hydroxydemethoxycurcumin (B) (see [6] and Supplemental Fig. S5). To further investigate the role of PKSs in ginger/turmeric specialized metabolism, we identified >40 contigs as putative PKSs by comparing representative type III PKS genes [30] against the ArREST database using BLASTX. These potential PKS genes belong to three major groups: chalcone/naringenin-chalcone synthases (CHSs), relatives of a recently characterized polyketide synthase from Wachendorfia thyrsiflora (Haemodoraceae) (WtPKS1), and a diverse group of putative polyketide synthases (Fig. 1, Supplemental Table S5). The ginger and turmeric enzymes tentatively identified as CHS or naringenin-CHS showed much lower expression levels compared to the other two classes in turmeric, based on EST number. CHS ESTs were not detected in GW and were only found in the GY rhizome. Ginger is not known to produce large amounts of flavonoids, which would be

42 products of these CHS enzymes, and instead is known for gingerols, gingerol-related compounds and a diversity of diarylheptanoids. Indeed, our metabolic profiling work with these ginger lines [22] suggests that flavonoids are only minor constituents of these plants, even though the phenylpropanoid pathway appears to be very active (see above). Thus, a combination of low expression of CHS and competition by other PKS-like enzymes may deplete substrate pools for CHS, thus preventing ginger from accumulating flavonoids to appreciable levels. In contrast, it appears that the ginger/turmeric WtPKS1-like subclass is ubiquitously expressed in the rhizome, leaf and root of both ginger and turmeric. WtPKS1 was the first type III PKS reported to be involved in the biosynthesis of diarylheptanoids [31], although it is not a curcuminoid or gingerol synthase and its exact role in the biosynthesis of the diversity of diarylheptanoids has not been determined. A gene (os07g17010) recently described from rice as a “curcuminoid synthase” [32] and expressed only at very low levels in the developing anther (and nowhere else in the plant) also likely plays a different role in vivo than in vitro assays suggested: rice does not produce curcuminoids [32]. The rice gene is not closely related to WtPKS1 or to any of the type III PKS genes identified from ginger and turmeric (see Fig. 1), the plants which do produce very high levels of curcuminoids and related compounds. The exact roles of all of the members of this class of PKS in ginger and turmeric are still unknown. Nevertheless, our results so far suggest that the large array of PKS-derived compounds in ginger and turmeric may be the result of multiple PKS enzymes that catalyze slightly different reactions with different substrate specificities and product outcomes. Genes from this PKS subclass are excellent candidates to be involved in these processes.

43 Furthermore, the β-ketoacyl-CoA synthase-like subclass (Fig. 1), which is also noticeably expanded in ginger and turmeric relative to other species, may also play a role in production of these compounds. Other enzymes required to decorate or modify the diarylheptanoid and gingerolrelated backbone structures could well belong to the other major gene families evaluated for expression. For example, specific reductases and hydroxylases are likely to be involved in elimination of double bonds and in forming hydroxyl groups on the compounds found in these plants, and these classes of enzymes were well represented in the ArREST database. Many of these potential genes show relatively high expression in all or most tissue types of ginger/turmeric (see Supplemental Table S5). These results provide important clues for further research to elucidate the pathways/ networks involved in producing diarylheptanoids and gingerol-related compounds in these plants.

Terpenoid Biosynthesis in Ginger and Turmeric Terpenoids are another major class of bioactive compounds found in ginger and turmeric [22, 25, 33-37]. Isopentenyl diphosphate and dimethylallyl diphosphate (IPP and DMAPP), the common building blocks for mono-, sesqui- and other terpenoids, appear to be produced mainly by the plastidic MEP pathway in these species because ESTs for all enzymes in this pathway were readily identified in the ArREST database at high levels (Supplemental Table S4), especially the potential regulatory enzyme DOXP synthase. In contrast, two important enzymes of the cytosolic mevalonate (MVA) pathway, phosphomevalonate kinase and pyrophosphomevalonate decarboxylase, were not detected in the database, and other genes in the MVA pathway were represented by very

44 low EST levels for all tissues, even those producing high levels of sesquiterpenoids, which are derived from farnesyl diphosphate, believed to be synthesized in the cytosol of most plants. These results suggest that the MEP pathway is the essential pathway for production of precursor IPP/DMAPP involved in the biosynthesis of terpenoids found at high levels in ginger and turmeric, including sesquiterpenoids. Furthermore, the transport of IPP/DMAPP out of the plastid to the cytosol is likely to also occur in these plants, as has been shown for other plants such as snapdragon and sweet basil [19, 38]. Only one terpene synthase (TPS), a germacrene D synthase from ginger, has been reported from either of these species [11]. However, the ArREST database contains 45 contigs identified as putative TPSs including: 19 monoterpene synthases, 11 sesquiterpene synthases, 2 diterpene synthases, 3 triterpene synthases and 10 tetraterpene synthases (Supplemental Table S5). Most of the TPS contigs in the ArREST database possess few ESTs (average 2.47), several may represent different regions of the same gene (such as 5´ and 3´ regions), and all putative triterpene synthases appear to be exclusive to the rhizomes. Two of the contigs appeared to represent full-length monoterpene synthase (MTS) cDNAs, one from ginger rhizome and the other from turmeric leaf. The corresponding recombinant protein from ginger rhizome was expressed in E. coli and assayed for enzymatic activity. The ginger MTS catalyzed the formation of 1,8-cineole and small amounts of p-menth-1-en-8-ol, which is a intermediate product during GPP conversion to 1,8-cineole. Although all ginger tissues produce 1,8-cineole, the rhizome contains a lot more than root or leaf tissues. Gene expression profiling from microarray also verify that 1,8-cineole synthase is dominantly expressed in the rhizome (Supplemtary Fig. S6). Although turmeric rhizome also

45 produces lots of 1,8-cineole, this ginger MTS was not expressed in turmeric rhizome, which suggests that there is/are other 1,8-cineole synthase(s) in the turmeric rhizome. In addition to the TPSs discussed above, other gene families possibly involved in ginger and turmeric terpenoid biosynthesis are easily identified in the database, including P450s. Of these, a limonene hydroxylase-like enzyme (CYP71D class) is one of the most highly expressed P450s (see Supplemental Table S6), based on EST counts. This enzyme class is associated with the biosynthesis of oxygenated monoterpenoids (i.e. carvone, menthone, menthol, and pulegone, etc.). Because these specific compounds have not been detected in ginger or turmeric, CYP71D in ginger and turmeric plants is likely to be involved in producing highly accumulating compounds such as curlone and tumerones, suggesting that this enzyme class may have evolved unique functions in these two plants.

Conservation of Rhizome-enriched Genes In an attempt to discover genes involved in defining rhizome tissue identity and rhizome development, we compared 1,223 rhizome-specific ESTs from Sorghum (see Supplemental Information) [12] to our ArREST database using TBLASTX. As a result, 2,383 ginger/turmeric contigs containing 8,017 ESTs were identified as having significant homology (E ≤ 1×10-10) to the Sorghum rhizome ESTs. Of these, 1,606 ginger/turmeric contigs (6,425 ESTs) were expressed in tissues besides the rhizome, leaving 777 contigs (1,592 ESTs) that appeared to be exclusively expressed in ginger or turmeric rhizome (according to EST data). Within this group of 777 rhizome-enriched contigs, 70.6 % or 547 contigs (1,124 ESTs) had GO annotations in “biological process”, compared to 87.6% of the entire ArREST database. The remaining unknown contigs,

46 while lacking any known biological function, appear to represent actual genes and not random or “junk” sequence data. This result corresponds to earlier findings for Johnson grass rhizomes [12, 39]. The rhizome-enriched ESTs were enriched (2-fold more compared to leaf or root) for genes involved with “protein modification” (GO:0006464; Supplemental Fig. S7 and Table S7), with 50% of these possessing homology to genes associated with kinasemitigated signal transduction, and the remainder having homology to other serinethreonine kinases or ubiquitin-associated activities. Such post-translational protein modifications are suggestive of possible roles in biotic/abiotic stress response and phytohormone signal transduction [40-43]. In contrast, a number of GO categories were noticeably deficient. For example, few contigs were found with GOs associated with transport and cell organization and biogenesis (Fig. 2), which were primarily devoted to cell wall biosynthesis and lignification [44, 45], while 37% of the ESTs in the “catabolism” gene ontology are actually involved with the early stages of the phenylpropanoid pathway [46, 47]. Other GO categories underrepresented in the rhizome include nucleotide/nucleic acid metabolism and generation of precursor metabolites/energy (Supplemental Fig. S7). This apparent lack of a diversity of metabolic processes displays a bias in the rhizome toward processes associated with cell wall biosynthesis and remodeling as well as specific specialized metabolic pathways.

Identification of Transcriptional Regulators in Ginger and Turmeric Rhizomes MYB factors easily dominated other classes of transcription factors in both contig and EST numbers in the total ArREST database (Fig. 2, Supplemental Table S8). The

47 other major groups of transcription factors are the NAC, WRKY, homeobox, bZIP and CONSTANS classes. This contrasts with what has been found in either Arabidopsis thaliana or Oryza sativa, where the basic helix-loop-helix (bHLH) family is one of the largest families of transcription factors, closely followed in number by the MYB proteins [48-51]. In the case of ginger and turmeric, this trend is reversed: the MYB class is the most highly expressed class of transcription factors and the bHLH class is one of the lowest (see Fig. 2, Supplemental Table S8), based on EST counts, although additional expression profiling (see below) confirmed these observations. This trend may merely be a reflection of the genes being transcribed rather than the actual genomic content. These two classes of transcription factors have been shown to complex together and regulate a variety of processes, most notably the specification of hairy trichomes, root hairs and conical cells in Arabidopsis and Antirrhinum majus [52-55]. In addition, both ginger and turmeric possess noticeably expanded numbers of the WRKY and NAC types of transcription factors compared to Arabidopsis and rice. Both WRKY and NAC transcription factors have also been shown to play integral roles in plant defense, stress response and development [56-59]. In the case of ginger, turmeric and Sorghum rhizomes, it will be interesting to see which genes are regulated by these classes of transcription factors, because rhizomes lack trichomes and root hairs and do not typically accumulate appreciable levels of anthocyanins. Anthocyanin production and trichome development in plants are processes known to be regulated by MYB proteins [60, 61]. A plausible role for MYB proteins in ginger and turmeric rhizomes might lie in the regulation of rhizome-specialized metabolism (particularly the phenylpropanoid-derived diarylheptanoids or gingerols in

48 ginger and turmeric) or general rhizome structure and development. To verify the rhizome-enriched expression for specific transcriptional regulators, we analyzed the expression patterns of 745 of the 777 rhizome-enriched data set using a custom oligonucleotide-based microarray (the other 32 genes did not yield good quality oligos for inclusion in the array). Ten contigs putatively encoding transcriptional regulators were expressed at higher levels (expression coefficients >2 and p-values ≤0.05) in rhizome versus other tissues in various tissue and/or age specific comparisons in both ginger and turmeric: a MYB, an ERF, 2 MADs, 4 ARFs and 2 AUX/IAA transcriptional regulators (Fig. 3 and Supplemental Fig S4). It is notable that 7 of the 10 transcriptional regulators identified in this manner (the ERF, 4 ARFs and 2 AUX/IAAs, Fig. 3A-C, Supplemental Fig. S4A-D) appear to be phytohormone related proteins that are significantly up-regulated in the rhizome tissues at several time points in both ginger and turmeric. Two other genes shown by the microarray experiments to be up-regulated within the rhizome of ginger or turmeric, respectively, were the MADS box genes 07406_01 and 14247_01 (Fig. 3D&E). Whether these two genes play complementary or different roles in these two species remains unclear. Nevertheless, these results suggest roles for auxin and ethylene in the establishment or maintenance of rhizome cell fate or rhizome apical dominance [62-64].

A Model for Rhizome Development AUX/IAA proteins have been implicated in the development of auxin-dependent vascular tissues [65]. The presence of AUX/IAA proteins with rhizome-enriched expression is notable because auxin has been proposed to repress the initiation of shoots

49 from rhizomes in other rhizomatous species [15, 16]. The simplest explanation for the role of these transient proteins is that the AUX/IAA ESTs observed represent the basal transcripts produced in the rhizomes. The corresponding translated AUX/IAA proteins would bind to and inhibit their ARF counterparts that otherwise directly control gene expression via DNA binding [66, 67]. However, since auxin from the shoot is readily available in the rhizome, the AUX/IAA proteins would be quickly degraded by a complex analogous to the auxin responsive SCF complex [68], allowing for ARF-DNA binding. As a result, the ARF would not play the role of a transcriptional activator, but rather of a transcriptional repressor. As repressors, these ARF proteins would bind to their respective promoter regions and repress shoot development, as well as possible transcription of relevant ARF genes. This would help explain the lack of putative ARF genes in the 777 ESTs common to rhizomes from ginger/turmeric and Sorghum. Although apical dominance is pronounced in rhizomes of S. halepense [69], it appears to be reduced in ginger and turmeric, possibly due to the differential presence/absence of specific NAC transcription factors. Such a hypothesis is plausible because the rhizome is a stem and mutations in NAC transcription factors have been associated with loss of apical dominance in stems [70]. NAC proteins, which may also regulate various aspects of meristematic development like rhizome bud dormancy, are known to be expressed in monocot meristems [71] and to be regulated by auxin via a similar auxin-responsive ubiquitination process [72, 73]. Ethylene was also implicated in the maintenance of the rhizome as a distinct tissue, which has been suggested for other species [17, 18], by the abundance of ESTs in the rhizomes of ginger, turmeric, and Sorghum for genes associated with ethylene

50 signaling (ERF proteins). Ethylene may play a role in both the promotion of rhizome elongation and the suppression of shoot development. Could shoot-derived auxin be stimulating ethylene evolution in the rhizome, thereby repressing shoot formation? This idea has been hinted at in earlier experiments where the addition of auxin resulted in increased production of ethylene from exposed plant tissues [74, 75], although rhizomes were not tested. However, this hypothesis does not completely explain the previously observed roles of gibberellins in rhizome growth and development [18, 76]. A possible explanation may be that gibberellins may be acting as agents in the crosstalk between auxin stimulus and the ethylene response pathways. Such a relationship has been suggested for other tissues such as stem, root, and tuber [77-79], but has not been established for rhizomes. MADS box transcription factors may also play an important role in rhizome initiation and development. Three (GT1_00188_02, GT1_07406_01, and GT1_14247_01) of the eight MADS box contigs that were identified in rhizomes appeared to be expressed exclusively in the rhizome of ginger, turmeric and Sorghum, based on EST data. Microarray analysis confirmed rhizome-specific expression for one of these genes GT1_07406_01 (Fig. 3D). This gene appears to be homologous to MADS box genes whose positions are close to quantitative trait loci (QTLs) (Fig. 4) associated with rhizomatousness in Oryza and Sorghum [12, 39]. In addition, related rice MADS box proteins have been implicated as possibly having roles in flower development; flower tissue was not included in our analysis due to the difficulty in obtaining this tissue from these plants. Possible functional overlap between flower and rhizome development is not unreasonable and some MADS box transcription factors have been implicated in the

51 development of both floral and vegetative tissues such as tubers and rhizomes [13, 14]. It will be very interesting to determine if the proteins encoded by these three contigs do in fact play some role in controlling or directing rhizomatous growth and development.

MATERIALS AND METHODS

cDNA Library Construction and Sequencing Using the white ginger (GW), yellow ginger (GY), and red/orange turmeric (T3C) lines described previously [2, 5], total RNA was extracted from rhizomes, young leaves, and roots using the method of Dong and Dunstan [80]. Poly(A)+ RNA was purified from 1000 µg of total RNA using the PolyATract® mRNA isolation kit (Promega, USA) and cDNA was synthesized from 1 µg of poly(A)+ RNA using a Uni-ZAP ® XR cDNA synthesis kit (Stratagene, USA) according to the manufacturers’ instructions. The directionally cloned (EcoRI/XhoI) cDNA libraries were then mass-excised in vivo and the resulting phagemids (pBluescript SK(-)) were propagated in the E. coli strain TJC-121 [81]. Individual cDNA clones containing inserts were sequenced from the 5´and 3´ ends using the T7 and T3 promoter sequencing primers, respectively. The resulting ESTs have been submitted to the NCBI dbEST under accession nos. DY344695 – DY395309.

Production of the EST Database in PAVE ESTs were assembled with the Program for Assembly and Viewing of ESTs (PAVE, Soderlund et al. unpublished). Most approaches to assembling ESTs first cluster them, then assemble the clusters with Phrap (www.phrap.org) or CAP3 [82]. PAVE also

52 clusters using Megablast [83] and assembles with CAP3, but it incrementally feeds sets of ESTs to CAP3 based on mate-pair and overlap information, reducing the chance of incorporating an incorrect EST, which in turn could prevent correct ESTs from being placed in the contig. Instead of assembling each cluster with CAP3, the PAVE software analyzes each cluster and only inputs sets of ESTs that are supported by mate-pair information (when possible). When mate-pairs are split across contigs, they retain the same cluster number (e.g. 0001_02 is the second contig in the first cluster).

Classifying Contigs by Gene Ontology All contigs within the ArREST database were assigned UniProt IDs based on BLASTX results. Microsoft Access was used to assign GO terms from the Gene Ontology Annotation (GOA) Database file, gene_association.goa_uniprot.gz (http://www.ebi.ac.uk/GOA/index.html) to 15,012 of 20,599 ginger and turmeric EST contigs based on their corresponding UniProt IDs. Of the remaining 5,587 contigs lacking GOs, 1,179 had GOs using the GO annotation search tool (http://www.arabidopsis.org/tools/bulk/go/index.jsp). GO annotations were assigned to additional 327 contigs via the ArREST-PAVE website. The remaining 4,081 contigs that completely lacked GO annotation were compared again against both the Swiss-Prot and Trembl databases separately using BLASTX (E-value ≤ 1E-10). We examined the 5 best hits from both BLASTX results for each contig and, if necessary, considered all remaining hits until we found a hit possessing a GO annotation. This approach allowed us to annotate 87.6 % (18,049) of our ginger/turmeric contigs, leaving 12.4 % (2,550) of the contigs unannotated. The contigs were assigned to their appropriate gene ontology

53 categories using the map2slim program and the goslim_plant.obo file downloaded from the Gene Ontology website (http://www.geneontology.org/GO.slims.shtml).

Rhizome-enriched Transcripts from Ginger, Turmeric and Sorghum To determine rhizome-enriched transcripts common to ginger/ turmeric and sorghum, a subtractive-reciprocal best BLAST hit approach was used [84] followed by direct TBLASTX (E-value ≤ 1E-10) comparisons of the species- and tissue-specific libraries (turmeric leaf vs. turmeric rhizome; combined ginger leaf vs. ginger rhizome; and combined turmeric leaf/rhizome vs. combined ginger leaf/root/rhizome). Nonredundant contig sequences were sorted using Microsoft Access into three categories: unique to rhizome, other tissue, and shared in all tissues. Contigs exclusive to ginger or turmeric rhizome were used in a TBLASTX (E-value ≤ 1E-10) comparison with 1,223 publicly available Sorghum rhizome EST sequences (see Supplementary Information, http://www.ncbi.nlm.nih.gov/) from either S. propinquum or S. halepense [12]. The reciprocal best BLAST hit approach was used, but in contrast to the comparisons in and between ginger and/or turmeric, the reciprocal best hits produced in this assessment were considered to contain possible orthologs required for rhizomatous tissue identity and/or function.

Identification and Evaluation of Probable Transcriptional Regulators in Ginger/turmeric In order to identify possible trans-acting transcriptional regulators within the ArREST database, we queried the contig library for sequences with the associated gene

54 ontology identifier for DNA binding: GO0003677. These queries produced 1,372 nonredundant contigs with this particular gene ontology identification, which were then analyzed using the protein motif identification program INTERPROSCAN [85] to identify any possible non-generalized DNA binding domains. Following analysis with INTERPROSCAN, the 1,372 contigs were manually curated to purge contigs possibly associated with generalized transcriptional machinery, yielding a total of 818 contigs that were then tallied to determine the number of contigs or ESTs belonging to each of the DNA binding domain categories.

Mapping of Putative Ginger/turmeric MADS-box Transcription Factors to Rice To determine if the ginger/turmeric MADS-box transcription factors corresponded to known QTLs associated with rhizomatousness [39], 3 rhizome-enriched ginger/turmeric contigs identified as having significant homology to ESTs found in Sorghum rhizomes [12] were compared to the IGRSP build 4.0 pseudomolecules/annotations (International Rice Genome Sequencing Project 2005). A number of rice genes were identified as possible orthologs of the ginger/turmeric contigs. The annotations of these genes were retrieved using the various search tools available on Gramene [86]. Furthermore, annotations for all predicted rice MADS-box proteins, QTLs and their associated simple sequence repeat (SSR) primer pairs were also retrieved using Gramene [12, 39, 87, 88]. These annotations were converted manually into a general feature format (GFF) file and loaded into the Apollo genome editor [89], along with the appropriate IGRSP build 4.0 pseudomolecule (International Rice Genome Sequencing Project 2005). As a result, a number of virtual maps of rhizomatousness QTLs and their

55 probable spatial relationships to the positions of ginger/turmeric/rice MADS box transcription factors on the IGRSP psuedomolecule were produced.

Microarray analysis of ginger and turmeric genes Spot intensities were extracted from the scanned microarray images using Agilent Feature Extraction software, and data anlaysis was performed using R [90], Bioconductor (PUBMED: 16939789), and limma [91, 92]. Normalization within and between arrays was carried out using the limma normalizeWithinArrays and normalizeBetweenArrays functions, utilizing the loess method [93] for within array normalization and the quantile method for between array normalization. A linear model containing each of the sample types (as defined by the combination of turmeric type, time, and tissue), plus a term to account for differences in intensity due to the labelling fluorochrome (Cy3 vs. Cy5), was then applied to the data using the limma lmFit function. The contasts of interested were calculated using the contrasts.fit function and their significance was determined using the eBayes function. The resulting p-values were adjusted for mutiple comparisons using the write.fit function employing the Benjamini-Hochberg false-discovery rate adjustment [94].

Cloning, expression and enzyme assay of 1,8-cineole synthase PCR product amplified with 5’-ATGAGGAGGTCGGGAAATTACCA-3’ and 5’GAGCTGGACAGGCTCGATCA-3’ using Pfu polymerase was inserted into the pCRT7CT-TOPO vector (Invitrogen), which was transformed into BL21 (DE3) CodonPlus RIL (Stratagene) and expressed for 18 h at 18 °C with 0.005 - 0.4 mM of

56 IPTG. After induction, the pellet of E. coli was vortexed with Washing Buffer (20 mM Tris-HCl, pH 7.0, 50 mM KCl) and then centrifuged. Protein Extraction Buffer (50 mM 3-(N-morpholino)-2-hydroxypropanesulfonic acid, pH 7.0, 10% [v/v] glycerol, 5 mM MgCl2, 5 mM DTT, 5 mM sodium ascorbate, 0.5 mM phenylmethylsulfonyl fluoride) was added to washed E. coli pellet and vortexed, sonicated and centrifuged. Supernatant was recovered and the buffer was changed to Enzme Assay Buffer (10 mM 3-(Nmorpholino)-2-hydroxypropanesulfonic acid, pH 7.0, 10% [v/v] glycerol, 1 mM DTT) using PD-10 column (GE Healthcare Life Sciences). Divalent cations (20 mM MgCl2, 0.5 mM MnCl2 at final concentration), protease inhibitors (0.2 mM NaWO4, 0.1 mM NaF at final concentration) and either geranyl diphosphate (GPP, 10 µg) or farnesyl diphosphate (FPP, 10 µg) were added to total 500 µl of Enzyme Assay Buffer containing soluble proteins and incubated for 3 h at 30 °C with 200 µl of top layered pentane. Either top pentane and/or vortexed, centrifuged pentane was used for metabolite analysis in Thermo Finnigan Trace GC 2000 coulpled with DSQ mass spectrometer with Rtx-5MS w/5m Integra-Guard Column (Restek, 0.25mm ID, 0.25µm df, 30 m).

ACKNOWLEDGEMENTS We would like to thank James Hatfield and Karl Haller for their initial work on PAVE.

57 LITERATURE CITED 1.

College, J.N.M., The Dictionary of Traditional Chinese Medicine. 1985, Shanghai: Shanghai Sci-Tech Press.

2.

Ma, X.-Q. and D.R. Gang, Metabolic profiling of turmeric (Curcuma longa L.) plants derived from in vitro micropropagation and conventional greenhouse cultivation J. Agric. Food. Chem., 2006. 10.1021/jf061658k.

3.

Jiang, H.L., B.N. Timmermann, and D.R. Gang, Use of liquid chromatographyelectrospray ionization tandem mass spectrometry to identify diarylheptanoids in turmeric (Curcuma longa L.) rhizome. Journal of Chromatography A, 2006. 1111(1): p. 21-31 DOI:10.1016/j.chroma.2006.01.103

4.

Jolad, S.D., et al., Fresh organically grown ginger (Zingiber officinale): composition and effects on LPS-induced PGE2 production. Phytochemistry, 2004. 65(13): p. 1937-1954.

5.

Ma, X.-Q. and D.R. Gang, Metabolic profiling of in vitro micropropagated and conventionally greenhouse grown ginger (Zingiber officinale). Phytochemistry, 2005. 67(24): p. 2239-2255. doi:10.1016/j.phytochem.2006.07.012.

6.

Xie, Z., X.-Q. Ma, and D.R. Gang, Metabolite modules predict the existence of biosynthetic modules in plant specialized metabolism: an example from curcuminoid biosynthesis in turmeric. Journal Of Experimental Botany, 2008. doi:10.1093/jxb/ern263

7.

Ramirez-Ahumada, M.C., B.N. Timmermann, and D.R. Gang, Biosynthesis of curcuminoids and gingerols in turmeric (Curcuma longa) and ginger (Zingiber officinale): Identification of curcuminoid synthase and hydroxycinnamoyl-CoA thioesterases. Phytochemistry, 2006. 67(18): p. 2017-2029. doi:10.1016/j.phytochem.2006.06.028

8.

Mauseth, J.D., Plant Anatomy. 1988, Menlo Park, CA: The Benjamin/Cummings Publishing Company, Inc. 560.

9.

Choi, K.H., R.A. Laursen, and K.N. Allen, The 2.1 A structure of a cysteine protease with proline specificity from ginger rhizome, Zingiber officinale. Biochemistry, 1999. 38(36): p. 11624-33.

58 10.

Chen, Z., et al., cDNA cloning and characterization of a mannose-binding lectin from Zingiber officinale Roscoe (ginger) rhizomes. J Biosci, 2005. 30(2): p. 21320.

11.

Picaud, S., et al., Cloning, expression, purification and characterization of recombinant (+)-germacrene D synthase from Zingiber officinale. Arch Biochem Biophys, 2006. 452(1): p. 17-28.

12.

Jang, C.S., et al., Functional classification, genomic organization, putatively cisacting regulatory elements, and relationship to quantitative trait loci, of sorghum genes with rhizome-enriched expression. Plant Physiol, 2006. 142(3): p. 1148-59.

13.

Kim, S.H., K. Mizuno, and T. Fujimura, Isolation of MADS-box genes from sweet potato (Ipomoea batatas (L.) Lam.) expressed specifically in vegetative tissues. Plant Cell Physiol, 2002. 43(3): p. 314-22.

14.

Skipper, M., Genes from the APETALA3 and PISTILLATA lineages are expressed in developing vascular bundles of the tuberous rhizome, flowering stem and flower Primordia of Eranthis hyemalis. Ann Bot (Lond), 2002. 89(1): p. 83-8.

15.

Fisher, J.B., Control of shoot-rhizome dimorphism in the woody monocotyledon, Cordyline (Agavaceae). Am J Bot, 1972. 59: p. 1000-1010.

16.

Leakey, R.R.B. and J. Chancellor, Parental factors in dominance of lateral buds on rhizomes of Agropyron repens (L.) Beauv. Planta, 1975. 123: p. 267-274.

17.

Ogura-Tsujita, Y. and H. Okubo, Effects of low nitrogen medium on endogenous changes in ethylene, auxins, and cytokinins in in vitro shoot formation from rhizomes of Cymbidium kanran. In Vitro Cell Dev Biol Plant, 2006. 42: p. 614616.

18.

Zheng, C., et al., Involvement of ethylene and gibberellin in the development of rhizomes and rhizome-like shoots in oriental cymbidium hybrids. J Japan Soc Hort Sci, 2005. 74: p. 306-310.

19.

Xie, Z., J. Kapteyn, and D.R. Gang, A systems biology investigation of the MEP/terpenoid and shikimate/phenylpropanoid pathways points to multiple levels of metabolic control in sweet basil glandular trichomes The Plant Journal, 2008. 54: p. 349–361.

20.

Jiang, H., B.N. Timmermann, and D.R. Gang, Characterization and identification of diarylheptanoids in ginger (Zingiber officinale Rosc.) using high-performance

59 liquid chromatography/electrospray ionization mass spectrometry. Rapid Commun Mass Spectrom, 2007. 21(4): p. 509-18. 21.

Jiang, H., et al., Analysis of curcuminoids by positive and negative electrospray ionization and tandem mass spectrometry. Rapid Commun Mass Spectrom, 2006. 20(6): p. 1001-12.

22.

Ma, X. and D.R. Gang, Metabolic profiling of in vitro micropropagated and conventionally greenhouse grown ginger (Zingiber officinale). Phytochemistry, 2006. 67(20): p. 2239-55.

23.

Jiang, H., B.N. Timmermann, and D.R. Gang, Use of liquid chromatographyelectrospray ionization tandem mass spectrometry to identify diarylheptanoids in turmeric (Curcuma longa L.) rhizome. J Chromatogr A, 2006. 1111(1): p. 21-31.

24.

Ma, J., et al., Diarylheptanoids from the rhizomes of Zingiber officinale. Phytochemistry, 2004. 65(8): p. 1137-43.

25.

Jiang, H., et al., Metabolic profiling and phylogenetic analysis of medicinal Zingiber species: Tools for authentication of ginger (Zingiber officinale Rosc). Phytochemistry, 2006. 67(15): p. 1673-85.

26.

Jiang, H., et al., Instrument dependence of electrospray ionization and tandem mass spectrometric fragmentation of the gingerols. Rapid Communications in Mass Spectrometry, 2006. 20(20): p. 3089-3100.

27.

Jiang, H., et al., Characterization of gingerol-related compounds in ginger rhizome (Zingiber officinale Rosc.) by high-performance liquid chromatography/electrospray ionization mass spectrometry. Rapid Commun Mass Spectrom, 2005. 19(20): p. 2957-64.

28.

Li, J., et al., Comparison of chemical components between dry and fresh Zingiber officinale. Zhongguo Zhongyao Zazhi, 2001. 26(11): p. 748-751.

29.

Ramirez-Ahumada, M.C., B.N. Timmermann, and D.R. Gang, Biosynthesis of curcuminoids and gingerols in turmeric (Curcuma longa) and ginger (Zingiber officinale): identification of curcuminoid synthase and hydroxycinnamoyl-CoA Phytochemistry, 2006. 67(18): p. 2017-29.

30.

Austin, M.B. and J.P. Noel, The chalcone synthase superfamily of type III polyketide synthases. Nat Prod Rep, 2003. 20(1): p. 79-110.

60 31.

Brand, S., et al., A type III polyketide synthase from Wachendorfia thyrsiflora and its role in diarylheptanoid and phenylphenalenone biosynthesis. Planta, 2006. 224(2): p. 413-28.

32.

Katsuyama, Y., et al., In vitro synthesis of curcuminoids by type III polyketide synthase from Oryza sativa. Journal of Biological Chemistry, 2007. 282(52): p. 37702-37709.

33.

Ma, X. and D.R. Gang, Metabolic profiling of turmeric (Curcuma longa L.) plants derived from in vitro micropropagation and conventional greenhouse cultivation. J Agric Food Chem, 2006. 54(25): p. 9573-83.

34.

Negi, P.S., et al., Antibacterial activity of turmeric oil: a byproduct from curcumin manufacture. J Agric Food Chem, 1999. 47(10): p. 4297-300.

35.

Nishiyama, T., et al., Curcuminoids and sesquiterpenoids in turmeric (Curcuma longa L.) suppress an increase in blood glucose level in type 2 diabetic KK-Ay mice. J Agric Food Chem, 2005. 53(4): p. 959-63.

36.

Ji, M.J., et al., Induction of apoptosis by Ar-turmerone on various cell lines. Int. J. Mol. Med., 2004. 14(2): p. 253-256.

37.

Aratanechemuge, Y., et al., Selective induction of apoptosis by ar-turmerone isolated from turmeric (Curcuma longa L) in two human leukemia cell lines, but not in human stomach cancer cell line. Int J Mol Med, 2002. 9(5): p. 481-4.

38.

Dudareva, N., et al., The nonmevalonate pathway supports both monoterpene and sesquiterpene formation in snapdragon flowers. Proceedings of the National Academy of Sciences of the United States of America, 2005. 102(3): p. 933-938.

39.

Hu, F.Y., et al., Convergent evolution of perenniality in rice and sorghum. Proc Natl Acad Sci U S A, 2003. 100(7): p. 4050-4.

40.

Sano, H. and S. Youssefian, Light and nutritional regulation of transcripts encoding a wheat protein kinase homolog is mediated by cytokinins. Proc Natl Acad Sci USA, 1994. 91: p. 2582-2586.

41.

Lange, J., et al., A gene encoding a receptor-like protein kinase in the roots of common bean is differentially regulated in response to pathogens, symbionts and nodulation factors. Plant Sci, 1999. 142: p. 133-145.

61 42.

Chono, M., et al., A semidwarf phenotype of barley uzu results from a nucleotide substitution in the gene encoding a putative brassinosteroid receptor. Plant Physiol, 2003. 133: p. 1209-1219.

43.

Vinagre, F., et al., SHR5: a novel plant receptor kinase involved in plant-N2fixing endophytic bacteria association. J Exp Bot, 2006. 57(3): p. 559-69.

44.

O'Donoghue, E.M., et al., Xyloglucan endotransglycosylase: a role after growth cessation in harvested asparagus. Aust J Plant Physiol, 2001. 28: p. 349-361.

45.

Kalluri, U.C. and C.P. Joshi, Differential expression patterns of two cellulose synthase genes are associated with primary and secondary cell wall development in aspen trees. Planta, 2004. 220(1): p. 47-55.

46.

Meyermans, H., et al., Modifications in lignin and accumulation of phenolic glucosides in poplar xylem upon down-regulation of caffeoyl-coenzyme A Omethyltransferase, an enzyme involved in lignin biosynthesis. J Biol Chem, 2000. 275(47): p. 36899-909.

47.

Liu, B., et al., Benzophenone synthase and chalcone synthase from Hypericum androsaemum cell cultures: cDNA cloning, functional expression, and sitedirected mutagenesis of two polyketide synthases. Plant J, 2003. 34(6): p. 847-55.

48.

Davuluri, R.V., et al., AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics, 2003. 4: p. 25.

49.

Guo, A., et al., DATF: a database of Arabidopsis transcription factors. Bioinformatics, 2005. 21(10): p. 2568-9.

50.

Gao, G., et al., DRTF: a database of rice transcription factors. Bioinformatics, 2006. 22(10): p. 1286-7.

51.

Palaniswamy, S.K., et al., GRIS and AtRegNet: A platform to link cis-regulatory elements and transcription factors into regulatory networks. Plant Physiol, 2006. 140: p. 818-829.

52.

Larkin, J.C., et al., Arabidopsis GLABROUS1 gene requires downstream sequences for function. Plant Cell, 1993. 5(12): p. 1739-1748.

53.

Glover, B.J., M. Perez-Rodriguez, and C. Martin, Development of several epidermal cell types can be specified by the same MYB-related plant transcription factor. Development, 1998. 125(17): p. 3497-508.

62 54.

Lee, M.M. and J. Schiefelbein, WEREWOLF, a MYB-related protein in Arabidopsis, is a position-dependent regulator of epidermal cell patterning. Cell, 1999. 99(5): p. 473-83.

55.

Kirik, V., et al., The ENHANCER OF TRY AND CPC1 gene acts redundantly with TRIPTYCHON and CAPRICE in trichome and root hair cell patterning in Arabidopsis. Dev Biol, 2004. 268(2): p. 506-13.

56.

Xie, Q., et al., GRAB proteins, novel members of the NAC domain family, isolated by their interaction with a geminivirus protein. Plant Mol Biol, 1999. 39(4): p. 647-56.

57.

Johnson, C.S., B. Kolevski, and D.R. Smyth, TRANSPARENT TESTA GLABRA2, a trichome and seed coat development gene of Arabidopsis, encodes a WRKY transcription factor. Plant Cell, 2002. 14(6): p. 1359-75.

58.

Liu, Y., M. Schiff, and S.P. Dinesh-Kumar, Involvement of MEK1 MAPKK, NTF6 MAPK, WRKY/MYB transcription factors, COI1 and CTR1 in N-mediated resistance to tobacco mosaic virus. Plant J, 2004. 38(5): p. 800-9.

59.

Takada, S., et al., The CUP-SHAPED COTYLEDON1 gene of Arabidopsis regulates shoot apical meristem formation. Development, 2001. 128(7): p. 112735.

60.

Lloyd, A.M., V. Walbot, and R.W. Davis, Arabidopsis and Nicotiana anthocyanin production activated by maize regulators R and C1. Science, 1992. 258(5089): p. 1773-5.

61.

Deluc, L., et al., Characterization of a grapevine R2R3-MYB transcription factor that regulates the phenylpropanoid pathway. Plant Physiol, 2006. 140(2): p. 499511.

62.

Skoog, F. and K.V. Thimann, Further experiments on the inhibition of the development of lateral buds by growth hormone. Proceedings Of The National Academy Of Sciences Of The United States Of America, 1934. 20: p. 480-485.

63.

Thimann, K.V. and F. Skoog, Studies on the growth hormone of plants III The inhibiting action of the growth substance on bud development. Proceedings Of The National Academy Of Sciences Of The United States Of America, 1933. 19: p. 714-716.

63 64.

Haver, D.L., U.K. Schuch, and C.J. Lovatt, Exposure of petunia seedlings to ethylene decreased apical dominance by reducing the ratio of auxin to cytokinin. Journal of Plant Growth Regulation, 2002. 21(4): p. 459-468.

65.

Groover, A.T., A. Pattishall, and A.M. Jones, IAA8 expression during vascular cell differentiation. Plant Mol Biol, 2003. 51(3): p. 427-35.

66.

Ulmasov, T., G. Hagen, and T.J. Guilfoyle, ARF1, a transcription factor that binds auxin response elements. Science, 1997. 276: p. 1865-1868.

67.

Tiwari, S.B., et al., Aux/IAA proteins are active repressors and their stability and activity are modulated by auxin. Plant Cell, 2001. 13: p. 2809-2822.

68.

Gray, W.M., et al., Auxin regulates SCF(TIR1)-dependent degradation of AUX/IAA proteins. Nature, 2001. 414(6861): p. 271-6.

69.

Anderson, J.V., W.S. Chao, and D.P. Horvath, Review: A current review on the regulation of dormancy in vegetative buds. Weed Science, 2001. 49(5): p. 581589.

70.

Kim, Y.S., et al., A membrane-bound NAC transcription factor regulates cell division in Arabidopsis. Plant Cell, 2006. 18(11): p. 3132-44.

71.

Zimmermann, R. and W. Werr, Pattern formation in the monocot embryo as revealed by NAM and CUC3 orthologues from Zea mays L. Plant Mol Biol, 2005. 58: p. 669-685.

72.

Xie, Q., et al., Arabidopsis NAC1 transduces auxin signal downstream of TIR1 to promote lateral root development. Genes Dev, 2000. 14(23): p. 3024-36.

73.

Xie, Q., et al., SINAT5 promotes ubiquitin-related degradation of NAC1 to attenuate auxin signals. Nature, 2002. 419(6903): p. 167-70.

74.

Burg, S.P. and E.A. Burg, The interaction between auxin and ethylene and its role in plant growth. Proc Natl Acad Sci USA, 1965. 55: p. 262-269

75.

Franklin, D. and P.W. Morgan, Rapid production of auxin-induced ethylene. Plant Physiol, 1978. 62: p. 161-162.

76.

Jacobs, W.P. and W. Davis, Effects of gibberellic acid on the rhizome and rhizoids of the algal coenocyte, Caulerpa prolifera, in culture. Ann Bot, 1983. 52: p. 39-41.

77.

Poapst, P.A., et al., Identification of ethylene in gibberellic-acid-treated potatoes. J Sci Fd Agric, 1968. 19: p. 325-327.

64 78.

O'Neill, D.P. and J.J. Ross, Auxin regulation of the gibberellin pathway in pea. Plant Physiol, 2002. 130: p. 1974-1982

79.

Frigerio, M., et al., Transcriptional regulation of gibberellin metabolism genes by auxin signaling in Arabidopsis. Plant Physiol, 2006. 142: p. 553-563.

80.

Dong, J. and D.I. Dunstan, A reliable method for extraction of RNA from various conifer tissues. Plant Cell Reports, 1996. 15(7): p. 516-521.

81.

Zhang, D., et al., Construction and evaluation of cDNA libraries for large-scale expressed sequence tag sequencing in wheat (Triticum aestivum L.). Genetics, 2004. 168(2): p. 595-608.

82.

Huang, X.Q. and A. Madan, CAP3: A DNA sequence assembly program. Genome Res., 1999. 9: p. 868-877.

83.

Zhang, Z., et al., A greedy algorithm for aligning DNA sequences. J. Comput. Biol., 2000. 7: p. 203-214.

84.

Mbéguié-A-Mbéguiéa, D., et al., Use of suppression subtractive hybridization approach to identify genes differentially expressed during early banana fruit development undergoing changes in ethylene responsiveness Plant Sci, 2007. 172(5): p. 1025-1036.

85.

Mulder, N.J., et al., New developments in the InterPro database. Nucleic Acids Res, 2007. 35(Database issue): p. D224-8.

86.

Jaiswal, P., et al., Gramene: a bird's eye view of cereal genomes. Nucleic Acids Res, 2006. 34(Database issue): p. D717-23.

87.

McCouch, S.R., et al., Development and mapping of 2240 new SSR markers for rice (Oryza sativa L.). DNA Res, 2002. 9(6): p. 199-207.

88.

Temnykh, S., et al., Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential. Genome Res, 2001. 11(8): p. 1441-52.

89.

Lewis, S.E., et al., Apollo: a sequence annotation editor. Genome Biol, 2002. 3(12): p. RESEARCH0082.

90.

R_Development_Core_Team, R: A Language and Environment for Statistical Computing. 2008.

91.

Smyth, G.K., Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor, R., 2005: p. 397-420.

65 92.

Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3(12): p. Article3.

93.

Smyth, G.K. and T. Speed, Normalization of cDNA microarray data. Methods, 2003. 31(4): p. 265-73.

94.

Benjamini, Y. and Y. Hochberg, CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING. Journal of the Royal Statistical Society Series B-Methodological, 1995. 57(1): p. 289-300.

66

GT1 00116 03 2 GT1 00052 03 6 GT1 01399 01 4 Q7XEM4 ORYSA GT1 06343 01 2 GT1 07385 01 2 GT1 06046 03 2 GT1 08180 01 5 GT1 00096 05 6 GT1 02741 01 5 Q33AY1 ORYSA GT1 01053 01 4 GT1 04718 01 6 GT1 12149 01 2 GT1 08523 01 3 GT1 12322 01 1 GT1 09545 01 3 Q8S761 ORYSA GT1 10984 01 2 Q7G629 ORYSA GT1 13630 01 6 Rice small clade P54157 BCSA BACSU

beta-ketoacyl-CoA synthase-like

A.t. & Varied Putative PKSs Q5FBU7 ARATH

Rice PKS Clade More Rice PKSs Q8LIL0 ORYSA BAC79571.1 O sativa Rice “CUS” os07g17010 Q5ZBB6 ORYSA 0.15 Q84R77 ORYSA TBSYN HYPAN Grasses PKSs * indicates full length cDNA sequence in EST database MARPA PKSs GT1 00653 01 5* GT1 00176 03 3* GT1 00033 02 2 GT1 03890 01 3* GT1 00166 02 2 GT1 00169 03 4* WtPKS2-Like GT1 02423 01 3 GT1 02841 01 2* GT1 04948 01 4 GT1 11833 01 6 GT1 03593 01 3 WtPKS2 ( Q3L5A7 WACTH, uncharacterized) GT1 12110 02 1

GT1 05013 01 3 WtPKS1 (Q3ZMG6 WACTH) GT1 00213 01 2 GT1 00382 02 1 WtPKS1-Like GT1 00490 01 1* GT1 03064 01 4* GT1 00989 01 6 GT1 09250 01 6* GT1 00628 01 3* Other Chalcone Synthases (CHSs, including Ginger and Turmeric CHSs)

Figure 1 Ginger and turmeric PKSs that are not bona fide chalcone synthases cluster into three major groups, WtPKS1-like, WtPKS2-like and beta-ketoacyl synthase-like, in a neighbor-joining similarity tree (parsimony and maximum likelihood produce similar trees). Contigs in the database that contain a full-length cDNA are indicated by bold and *. WtPKS1 and rice CUS are indicated by bold and underlining.

67 400

Znf_SBP Znf_RING Znf_PHD Znf_GATA Znf_DOF Znf_CW Znf_CCCH Znf_C5HC2 Znf_C2H2 Znf_BED Znf_AN1 WRKY WD40 SAND NAC MYB MADS HSF HOMEOBOX GRAS HD_ZIP ERF CONSTANS bZIP bHLH B3 ARF

Rhizome-specific ESTs/Total ESTs X 104

350

300

250

200

150

100

ZO_Ed_ESTs_ArREST (2817, 6510; 135, 229)

ZO_Ec_ESTs_Ar REST (2945, 6253; 160, 250)

ZO_ESTs_ArREST (5771, 12763; 295, 479)

CL_Eb_ESTs_ArREST (2728, 6832; 131, 253)

ZO_CL_Rh_ESTs_ArREST (8499, 19595; 426, 732)

0

ArRESTs (21928, 50537; 819, 1909)

50

Rhizome library

Figure 2 Fraction of ESTs (standardized per 10,000 ESTs) of different classes of probable transcriptional regulators (GO category 0003677) in ginger and turmeric libraries. Library descriptions are listed, followed by additional information per library in the parenthesis: (number of contigs, number of ESTs, number of contigs with GO:0003677, number of ESTs with GO:0003677). ArRESTs: EST collection of all ginger and turmeric libraries within ArREST. ZO_CL_Rh_ESTs_ArREST: combined ESTs of all ginger and turmeric rhizome libraries within ArREST. CL_Eb_ESTs_ArREST: turmeric rhizome library. ZO_ESTs_ArREST: combined EST collection of two ginger rhizome libraries. ZO_Ec_ArREST: White ginger rhizome EST library. ZO_Ed_ArREST: Yellow ginger rhizome library. Values per category are shown in Supplemental Table S8.

68

A

1600

G:00544_01 ERF

1400 1200 1000

1500

800 600 400

1000 500 0 GL2M

50000

GL7M

GR2M

GR7M

GRh2M GRh3M GRh4M GRh6M GRh7M

G:11067_01 Auxin Regulated Protein

FL7M

25000

40000

20000

30000

15000

20000

10000

10000

5000

0 GL7M

C 45000 40000

GR2M

GR7M

GRh2M GRh3M GRh4M GRh6M GRh7M

G:07788_01 Aux/IAA

35000 30000 25000 20000 15000 10000 5000 0

FRh7M

TL7M

TR7M

TRh3M

TRh5M

TRh7M

T:11067_01 Auxin Regulated Protein

FL7M

FRh3M FRh5M FRh7M

35000

TL7M

TR7M

TRh3M TRh5M

TRh7M

TRh3M TRh5M

TRh7M

T:07788_01 Aux/IAA

30000 20000 15000 10000 5000 0

GL7M

4000 3500 3000 2500

GR2M

GR7M

GRh2M GRh3M GRh4M GRh6M GRh7M

G:07406_01 MADS

FL7M

FRh3M FRh5M FRh7M

60

TL7M

TR7M

T:07406_01 MADS

50 40 30

2000 1500 1000 500

20 10 0

0 GL2M

GL7M

GR2M

GR7M

GRh2M GRh3M GRh4M GRh6M GRh7M

G:14247_01 MADS

2500 2000

FL7M

FRh3M

FRh5M

1000

FRh7M

TL7M

TR7M

TRh3M

TRh5M

TRh7M

TRh3M

TRh5M

TRh7M

TRh3M TRh5M

TRh7M

T:14247_01 MADS

800

1500

600

1000

400

500

200

0

0 GL2M

F

FRh5M

25000

GL2M

E

FRh3M

0 GL2M

D

T:00544_01 ERF

2000

200 0

B

2500

GL7M

2000

GR2M

GR7M

GRh2M GRh3M GRh4M GRh6M GRh7M

G:09719_01 MYB

FL7M

FRh3M

5000

FRh5M

FRh7M

TL7M

TR7M

T:09719_01 MYB

4000

1500

3000 1000 2000 500

1000

0

0 GL2M

GL7M

GR2M

GR7M

GRh2M GRh3M GRh4M GRh6M GRh7M

FL7M

FRh3M

FRh5M FRh7M

TL7M

TR7M

Figure 3 Comparisons of microarray signal intensities of several rhizome-enriched contigs with rhizome-specific expression in ginger and turmeric. Each of these selected genes possesses signal intensities with rhizome–enriched expression in at least one of the species compared to other tissues and/or time-points with coefficients ≥ 2 and p-values ≤ 0.05.

69

Rice 02 35

Rice 03

Rice 06

Os02g52340

GT1_00188_02



30 † GT1_07406_01

*

GT1_07406_01 30

Rice 07

Os03g54160 QRn6

Os06g49840

Os02g45770

25

QRl6

25

Os06g45650 † GT1_07406_01

25

QRdb2

Os07g41370

Rice 10

QRl7 Os02g36924

† GT1_14247_01

Lengt h (MB )

30

20

20

20

20

*GT1_14247_0120

15

15

15

15

15



Os10g39130

QRn10

QRdb2

QRz3

10

10

10

*GT1_00188_025

Os06g11330 QRn7 5

10

10

Rhz2 5

GT1_00188_02

5

Os03g08754 †

0

0

Rhizome QTL

GT1_07406_01

† GT1_14247_01

5

Os06g06750

0

Os06g01890 † GT1_07406_01

Rice MA DS box

Os07g01820

0

Ginger/Turmeric BlastX Hit

Figure 4 Associations between MADS box transcription factors from ginger/turmeric, sorghum rhizome-enriched ESTs, and rice rhizomatous QTLs ESTs on rice chromosomes 2, 3, 6, 7, and 10. White boxes filled with black stars represent rhizome QTLs, while triangles indicate the spatial locations of MADS box genes on the rice chromosomes. Rice MADS box genes are illustrated using white triangles; black triangles are ginger/turmeric Blastx hits (e-values < e20). * denotes the best Blastx hits for the ginger and turmeric contig/rice gene comparison. † denotes a MADS box gene shown by microarray analysis to be preferentially differentially expressed in the rhizome when compared to leaf or root tissues (Fig. 3D&E).

70 Table 1. Entry point enzymes that regulate carbon partitioning into specific metabolic pathways T3C GW GY Descriptive name zEC # Abbr. Rh L Rh L R Rh L R Primary/core metabolism sucrose synthase 2.4.1.13 SUS 25 0 39 7.1 38 23 14 14 pyruvate kinase 2.7.1.40 KPY 8.8 4.4 16 2.8 6.3 12 0 7.1 pyruvate 1.2.1.51 PNO 1.8 3.0 6.4 5.7 6.3 3.1 10.1 3.5 dehydrogenase Shikimate pathway DAHP synthetase 2.5.1.54 DAHPS 3.5 0 6.4 0 1.6 14 3.4 7.1 Phenylpropanoid pathway phenylalanine 4.3.1.5 PAL 11 0 4.8 16 13 3.1 0 11 ammonia lyase Terpenoid pathway MEP pathway DOXP 1.1.1.267 DXR 1.8 0 6.4 0 0 0 0 20 reductoisomerase MVA pathway HMG-CoA reductase 1.1.1.34 HMGR 0 0 0 0 0 0 0 3.5 Common steps isopentenyldiphosphate 5.3.3.2 IDI 3.5 0 4.8 1.4 6.3 7.7 0 0 ∆-isomerase farnesyl diphosphate 2.5.1.10 FPPS 12 8.9 6.4 0 11 34 6.7 11 synthase terpene synthases N/A CS 30 7.4 16 0 14 36 0 20 One carbon metabolism methionine synthase (cobalamin2.1.1.14 METE 7 22 18 4.3 17 17 3.4 21 independent) Values given are normalized total EST number (TEN) (×104). TEN, an indicator of EST expression levels, was calculated as the sum of all ESTs that are associated with gene products carrying out a specific enzymatic activity, divided by the total EST number within a particular library.

71

Supplemental Table S1. cDNA library sources for sequences and unigene sets described in this study Species lines Tissue LibID Contigs Singletons ESTs Rhizome CL__Ea 2275 772 5703 GY Leaf ZO__Ea 2046 564 6006 Root ZO__Ee 2292 551 5672 Ginger Rhizome ZO__Ec 2374 1226 6239 Leaf ZO__Eg 2520 944 7141 GW Root ZO__Ef 2480 908 6333 Rhizome ZO__Ed 2261 1038 6483 Turmeric T3C Leaf CL__Eb 2371 879 6832 Total 13717 6882 50409

72

Supplemental Table S2. The most abundantly represented transcripts in the ArREST database (EST number ≥ 40)* EST Contig ID Annotation E-value Libraries (EST number) count In total cDNA libraries Ribulose bisphosphate carboxylase small chain, chloroplast [Precursor] 04881_01 119 2e-81 ZO_Ea (82), ZO_Eg (37) (RBS_MUSAC, Musa acuminata). Ribulose bisphosphate carboxylase small chain, chloroplast [Precursor] 08346_01 91 4e-81 ZO_Ea (48), ZO_Eg (43) (RBS_MUSAC, Musa acuminata). 06922_01 79 Metallothionein-like protein type 3 (MT3_MUSAC, Musa acuminata). 9e-20 ZO_Ea (49), ZO_Eg (29), CL_Ea (1) Ribulose bisphosphate carboxylase/oxygenase activase, chloroplast 00040_04 51 1e-92 ZO_Ea (19), ZO_Eg (31), ZO_Ec (1) [Precursor] (RCA_ARATH, Arabidopsis thaliana). ZO_Ea (5), ZO_Eg (4), ZO_Ec (8) ZO_Ee (6), 00003_19 51 S-adenosylmethionine synthetase 1 (METK_ORYSA, Oryza sativa). 0 ZO_Ef (18), CL_Ea (10) ZO_Ea (12), ZO_Eg (11), ZO_Ec (12), ZO_Ee (3), 00018_07 50 Catalase isozyme A (CATA1_ORYSJ, Oryza sativa subsp. Japonica). e-129 ZO_Ef (6), CL_Ea (6) Ribulose bisphosphate carboxylase/oxygenase activase, chloroplast 10891_01 49 0 ZO_Ea (21), ZO_Eg (28) [Precursor] (RCA_MALDO, Malus domestica). ZO_Ea (9), ZO_Eg (8), ZO_Ec (13), ZO_Ee (5), 00001_086 49 4-coumarate--CoA ligase 2 (4CL2_ARATH, Arabidopsis thaliana). 5e-38 ZO_Ef (4), CL_Ea (8), CL_Eb (2) ZO_Ea (7), ZO_Eg (13), ZO_Ec (10) ZO_Ee (2), 00018_04 47 Catalase isozyme A (CATA1_ORYSJ, Oryza sativa subsp. Japonica). 0 ZO_Ef (3), CL_Ea (11), CL_Eb (1) ZO_Ea (8), ZO_Eg (8), ZO_Ec (8), ZO_Ee (5), 10130_01 46 S-adenosylmethionine synthetase 1 (METK_ORYSA, Oryza sativa). 0 ZO_Ef (13), CL_Ea (4) ZO_Ea (20), ZO_Eg (20), ZO_Ec (1), ZO_Ef (3), 09591_01 46 Metallothionein-like protein type 3 (MT3_MUSAC, Musa acuminata). 4e-21 CL_Ea (2) Fructose-bisphosphate aldolase, chloroplast [Precursor] 00017_04 46 0 ZO_Ea (26), ZO_Eg (19), CL_Ea (1) (ALFC_ORYSJ, Oryza sativa subsp. Japonica). Heat shock cognate 70 kDa protein 2 (HSP72_SOLLC, Solanum ZO_Ea (2), ZO_Eg (13), ZO_Ec (5), ZO_Ee (6), 00007_13 44 0 lycopersicum). ZO_Ef (6), CL_Ea (12) Ribulose bisphosphate carboxylase/oxygenase activase 2, chloroplast 11167_01 42 0 ZO_Ea (21), ZO_Eg (20), CL_Eb (1) [Precursor] (RCA2_LARTR, Larrea tridentata). Glycine dehydrogenase [decarboxylating], mitochondrial [Precursor] 00144_04 40 0 ZO_Ea (29), ZO_Eg (8), ZO_Ec (2), CL_Ea (1) (GCSP_SOLTU, Solanum tuberosum). * Protein entry names and E-values were got from best hit using BLASTX (nucleotide to protein) against Swiss-Prot database; Annotations, and original plant names were copied from http://au.expasy.org/sprot/ using the protein entry name search Swiss-Prot/TrEMBL database.

73

Supplemental Table S3. The most abundantly represented transcripts in specific ginger and turmeric rhizome libraries (EST number ≥ 10 per library) EST count of EST count Annotation E-value rhizome libraries* Contig ID From From ZO_Ec ZO_Ed CL_Ea total rhizome Sorted by GW rhizome library (ZO_Ec) 00001_086 49 21 4-coumarate-CoA ligase 2 (4CL2_ARATH, Arabidopsis thaliana). 5e-38 8 0 13 00018_07 50 18 Catalase isozyme A (CATA1_ORYSAJ, Oryza sativa subsp. japonica). e-129 6 0 12 Mitochondrial carnitine/acylcarnitine carrier-like protein 00264_03 27 12 2e-18 0 0 12 (MCAT_ARATH, Arabidopsis thaliana) Metallothionein-like protein 2C (MT2C_ORYSI, Oryza sativa subsp. 00004_31 38 18 3e-21 7 0 11 indica). 00018_04 47 21 Catalase isozyme A (CATA1_ORYSAJ, Oryza sativa subsp. japonica). 0 11 0 10 Sorted by GY hizome library (ZO_Ed) Heat shock cognate 70 kDa protein 2 (HSP72_SOLLC, Solanum 00007_13 44 17 0 5 0 12 lycopersicum). 00078_05 16 12 No hit. 0 0 12 00018_04 47 21 Catalase isozyme A (CATA1_ORYSAJ, Oryza sativa subsp. japonica) 0 10 0 11 00003_19 51 18 S-adenosylmethionine synthetase 1 (METK_ORYSA, Oryza sativa). 0 8 0 10 00009_25 21 12 No hit. 2 0 10 Sorted by T3C rhizome library (CL_Ea) 00031_02 29 18 4-coumarate-CoA ligase 1 (4CL1_TOBAC, Nicotiana tabacum). 2e-50 0 0 18 01767_01 22 16 Ubiquitin (BIQ_WHEAT, Triticum aestivum). 9e-36 0 0 16 Dehydration-responsive protein RD22 [Precursor] (D22_ARATH, 00271_02 16 16 2e-46 0 0 16 Arabidopsis thaliana). 00035_07 15 15 Farnesyl pyrophosphate synthetase (FPPS_MAIZE, Zea mays) e-151 0 0 15 09251_01 19 14 Ubiquitin (UBIQ_WHEAT, Triticum aestivum). 1e-35 0 0 14 00006_10 14 12 Cytochrome P450 71D7 (C71D7_SOLCH, Solanum chacoense) e-129 0 0 12 00237_02 17 11 S-adenosylmethionine synthetase 1 (METK_ORYSA, Oryza sativa). 0 0 0 11 00284_03 13 11 Tubulin alpha-1 chain (TBA1_ORYSJ, Oryza sativa subsp. japonica). e-150 0 0 11 00090_03 11 11 S-adenosylmethionine synthetase 1 (METK_ORYSA, Oryza sativa). 0 0 0 11 06990_01 11 10 Probable aquaporin PIP1-5 (PIP15_ARATH, Arabidopsis thaliana). e-172 0 0 10 11059_01 11 10 No hit. 0 0 10 *Values given are normalized total EST number (TEN) (×104). TEN, an indicator of EST expression levels, was calculated as the sum of all ESTs that are associated with gene products carrying out a specific enzymatic activity, divided by the total EST number within a particular library.

74

Supplemental Table S4: Normalized EST expression levels for selected enzymes in ginger and turmeric metabolic pathways* T3C GW GY EC Descriptive name Abbr. number Rh L Rh L R Rh L R Primary/core metabolism sucrose synthase 2.4.1.13 SUS 24.6 0.0 38.5 7.1 38.0 23.2 13.5 14.1 glucose-6-phosphate 1.1.1.49 GPD 1.8 3.0 1.6 0.0 0.0 3.1 1.7 8.8 dehydrogenase 6-phosphogluconate 1.1.1.44 6PGD 0.0 8.9 0.0 1.4 3.2 3.1 0.0 3.5 dehydrogenase ribulose-5-phosphate 5.1.3.4 RPI 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 isomerase ribulose-5-phosphate 5.1.3.18 RPE 1.8 8.9 3.2 10.0 3.2 15.5 13.5 8.8 epimerase transketolase 2.2.1.1 TKT 3.5 4.4 1.6 7.1 0.0 13.9 1.7 3.5 transaldolase 2.2.1.2 TAL 3.5 0.0 6.4 2.8 3.2 7.7 6.7 5.3 pyrophosphate: fructose-6PFPP 12.3 3.0 6.4 0.0 1.6 0.0 3.4 0.0 phosphate 12.7.1.90 phosphotransferase aldolase 4.1.2.13 ADL 17.6 91.8 24.1 66.8 14.3 4.6 87.5 21.2 glyceraldehyde-3-phosphate G3P 31.7 41.5 17.7 39.8 15.8 23.2 85.8 21.2 1.2.1.12 dehydrogenase PGK 0.0 16.3 4.8 17.1 4.8 0.0 18.5 3.5 phosphoglycerate kinase 2.7.2.3 PG mutase PGM 0.0 3.0 0.0 0.0 0.0 3.1 0.0 0.0 5.4.2.2 (phosphoglucomutase) ENO 1.8 4.4 0.0 0.0 1.6 3.1 3.5 0.0 enolase 4.2.1.11 3-phosphoglycerate 1.1.1.95 3PGD 0.0 4.4 9.6 8.5 7.9 1.5 18.5 10.6 dehydrogenase pyruvate kinase 2.7.1.40 KPY 8.8 4.4 16.1 2.8 6.3 12.4 0.0 7.1 pyruvate dehydrogenase 1.2.1.51 PNO 1.8 3.0 6.4 5.7 6.3 3.1 10.1 3.5 acetoacetyl-CoA thiolase 2.3.1.9 AACT 54.5 22.2 22.5 46.9 14.3 40.2 16.8 17.7 Shikimate pathway DAHP synthetase 2.5.1.54 DAHPS 3.5 0.0 6.4 0.0 1.6 13.9 3.4 7.1 3-dehydroquinate synthase 4.2.3.4 DHQS 1.8 0.0 0.0 2.8 1.6 3.1 0.0 0.0 3-dehydroquinate dehydratase /shikimate 1.1.1.25 DHQSD 5.3 10.4 14.4 8.5 33.3 13.9 6.7 30.1 dehydrogenase shikimate kinase 2.7.1.71 AROK 0.0 0.0 0.0 0.0 0.0 3.1 1.7 3.5 EPSP synthase 2.5.1.19 AROA 3.5 0.0 0.0 0.0 1.6 3.1 0.0 0.0 chorismate synthase 4.2.3.5 AROC 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 glutamine-2-oxoglutarate 1.4.1.13 GOGAT 7.0 1.5 0.0 5.7 0.0 0.0 16.8 0.0 aminotransferase glutamine synthetase 6.3.1.2 GS 1.8 22.2 1.6 19.9 9.5 18.6 30.3 14.1 aminotransferase 2.6.1.52 AMT 3.5 0.0 0.0 0.0 0.0 9.3 0.0 0.0 Phenylpropanoid pathway phenylalanine ammonia 4.3.1.5 PAL 10.6 0.0 4.8 15.6 12.7 3.1 0.0 10.6 lyase 1.14.13.1 cinnamate 4-hydroxylase C4H 28.1 45.9 35.3 39.8 36.4 63.4 23.6 15.9 1 4-coumarate-CoA ligase 6.2.1.12 4CL 5.3 35.6 8.0 11.4 15.8 32.5 11.8 17.7 p-coumaroyl-CoA:shikimate N/A CST 3.5 17.8 9.6 10.0 6.3 34.0 8.4 1.8 p-coumaroyl transferase p-coumaroylshikimate-3'N/A CS3'H 14.1 17.8 14.4 25.6 30.1 23.2 10.1 8.8 hydroxylase caffeoyl-CoA OCCOM 2.1.1.104 14.1 0.0 14.4 7.1 4.8 1.5 3.4 7.1 methyltransferase T Terpenoid pathway MEP pathway 1-deoxy-D-xylulose-52.2.1.7 DXS 3.5 10.4 1.6 14.2 11.1 7.7 16.8 10.6 phosphate (DOXP) synthase

75 DOXP reductoisomerase 1.1.1.267 DXR 1.8 0.0 6.4 0.0 0.0 CDP-ME kinase 2.7.1.148 CMK 0.0 0.0 0.0 0.0 4.8 2-C-methyl-D-erythritol 2,44.6.1.12 MCS 0.0 3.0 0.0 2.8 0.0 cyclodiphosphate synthase 4-hydroxy-3-methylbut-21.17.4.3 GCPE 1.8 11.9 0.0 0.0 3.2 en-1-yl diphosphate synthase 1-hydroxy-2-methyl-butenyl 1.17.1.2 HDR 0.0 0.0 3.2 2.8 0.0 4-diphosphate reductase 4-hydroxy-3-methylbut-2en-1-yl diphosphate 1.17.1.2 IDS 0.0 3.0 3.2 7.1 0.0 reductase MVA pathway HMG-CoA synthase 2.3.3.10 HMGS 0.0 0.0 0.0 0.0 6.3 HMG-CoA reductase 1.1.1.34 HMGR 0.0 0.0 0.0 0.0 0.0 MVA kinase 2.7.1.36 MVK 0.0 0.0 1.6 0.0 0.0 Common steps isopentenyl-diphosphate 5.3.3.2 IDI 3.5 0.0 4.8 1.4 6.3 delta-isomerase geranyl pyrophosphate 2.5.1.10 FPPS 12.3 8.9 6.4 0.0 11.1 synthase geranyl geranyl diphosphate 2.5.1.29 GGPPS 6.7 0.0 3.2 1.5 7.1 synthase terpene synthases N/A CS 29.9 7.4 16.1 0.0 14.3 One carbon metabolism phosphoserine phosphatase 3.1.3.3 PSP 0.0 0.0 3.2 0.0 0.0 serine 2.1.2.1 GLYM 10.6 22.2 0.0 7.1 11.1 hydroxymethyltransferase glycine dehydrogenase 1.4.4.2 GCSP 8.8 5.9 4.8 21.3 0.0 5,10-methylene-THF 1.5.1.20 MTHR 3.5 0.0 3.2 4.3 0.0 reductase 5,10-methenyl-THF 3.5.4.9 FOLD 0.0 1.5 1.6 4.3 0.0 cyclohydrolase 5,10-methylene-THF 1.5.1.5 FOLDD 0.0 1.5 1.6 1.4 0.0 dehydrogenase 10-formyl-THF synthetase 6.3.4.3 FTHS 0.0 0.0 0.0 1.4 0.0 methionine synthase 2.1.1.14 METE 7.0 22.2 17.7 4.3 17.4 (cobalamin-independent) ATP synthase 3.6.3.14 ATPS 19.3 20.7 24.1 22.7 9.5 adenylate kinase 2.7.4.3 KAD 14.1 3.0 8.0 2.8 1.6 adenosine kinase 2.7.1.20 ADK 5.3 3.0 6.4 0.0 11.1 S-adenosylmethionine 2.5.1.6 SAM 45.7 44.4 62.6 35.5 148.9 synthetase S-adenosylhomocysteine 3.3.1.1 SAHC 10.6 14.8 9.6 11.4 11.1 hydrolase * Total normalized EST number (TEN) was determined as described for Supplemental Table S3.

0.0 0.0

0.0 0.0

19.5 0.0

3.1

0.0

0.0

4.6

6.7

3.5

1.5

3.4

0.0

1.5

10.1

0.0

0.0 0.0 0.0

0.0 0.0 0.0

0.0 3.5 0.0

7.7

0.0

0.0

34.0

6.7

10.6

4.8

3.5

8.9

35.6

0.0

19.5

3.1

3.4

0.0

9.3

20.2

3.5

0.0

85.8

0.0

0.0

0.0

3.5

7.7

0.0

0.0

7.7

0.0

0.0

0.0

0.0

0.0

17.0

3.4

21.2

9.3 18.6 9.3

25.2 0.0 0.0

31.8 7.1 7.1

74.2

45.4

46.0

13.9

3.4

23.0

76 Supplemental Table S5: Normalized EST expression levels for selected gene families* T3C GW Potential function Contigs ESTs (based on best hit in UniProt) Rh L Rh L Polyketide synthases (PKSs) type III polyketide synthase 33.7 10.0 12.8 30.9 20 132.7 isoforms 1 and 2 chalcone/naregenin-chalcone 4 21.6 6.8 8.5 0.0 0.0 synthase 0.0 5.7 6.4 15.5 putative PKSs 18 61.3 Total 42 206.3 33.5 32.7 19.2 39.8 Terpene Synthases (TPSs) Monoterpene 3.5 0.0 0.0 0.0 isoprene synthase 7 22.1 limonene synthase 4 16.4 0.0 0.0 3.2 0.0 10.5 0.0 4.8 0.0 γ-terpinene synthase 3 15.3 0.0 0.0 0.0 0.0 2 13.5 (–)-β-pinene synthase 3.5 0.0 0.0 0.0 (+)-sabinene synthase 1 3.5 0.0 0.0 4.8 0.0 geraniol synthase 1 4.8 linalool synthase 1 2.9 0.0 2.9 0.0 0.0 Sesquiterpene germacrene d synthase 5 23.9 3.5 0.0 3.2 0.0 3.5 2.9 0.0 0.0 (+)-∆-cadinene synthase 2 6.4 1 3.2 0.0 0.0 3.2 0.0 (E)-β-farnesene synthase 0.0 0.0 0.0 0.0 cascarilladiene synthase 1 1.8 patchoulol synthase 1 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 3.5 (E)-α-bergamotene synthase Diterpene levopimaradiene synthase 2 6.6 3.5 1.5 1.6 0.0 Triterpene cycloartenol synthase 1 3.1 0.0 0.0 0.0 0.0 3.5 0.0 0.0 0.0 oxidosqualene cyclase 1 3.5 squalene synthase 1 1.5 0.0 0.0 0.0 0.0 Tetraterpene phytoene synthase 10 41.7 0.0 23.4 3.2 8.4 31.6 30.7 24.0 8.4 Total 45 177.0 NAD(P)H-dependent Reductases/Dehydrogenases 10-hydroxygeraniol 0.0 0.0 0.0 2.8 3 16.1 oxidoreductase 1.8 0.0 0.0 2.8 2'-hydroxyisoflavone reductase 8 27.4 3-hydroxy-3-methylglutaryl 0.0 0.0 0.0 0.0 1 3.5 coenzyme A reductase 3-oxo-5-alpha-steroid 44 9.6 1.8 2.9 0.0 0.0 dehydrogenase 5.3 0.0 3.2 2.8 alcohol dehydrogenase 3 11.3 alcohol dehydrogenase class III 2 9.1 0.0 0.0 3.2 2.8 21.0 5.9 14.4 11.2 allyl alcohol dehydrogenase 14 78.7 anthocyanidin reductase. 2 12.3 0.0 0.0 0.0 0.0 7.0 0.0 3.2 1.4 cinnamoyl-CoA reductase 7 18.1 cinnamyl alcohol dehydrogenase 5 19.0 0.0 0.0 0.0 0.0 hydroxyphenylpyruvate 1 3.3 0.0 0.0 0.0 0.0 reductase 0.0 0.0 0.0 5.6 hydroxypyruvate reductase 5 23.8 mannitol dehydrogenase 6 19.5 3.5 2.9 1.6 0.0 NAD-dependent formate 2 22.8 0.0 0.0 8.0 2.8 dehydrogenase NADPH-dependent 11 39.4 0.0 8.8 3.2 12.6 oxidoreductase (aldo/keto) phenylcoumaran benzylic ether 0.0 2.9 0.0 0.0 1 2.9 reductase 0.0 4.4 3.2 0.0 6 22.2 progesterone 5-β-reductase 0.0 0.0 0.0 0.0 secoisolariciresinol 1 3.1

GY L

R

Rh

R

10.6

20.6

5.3

8.9

0.0

6.3

0.0

0.0

8.8 14.3

6.3 32.5

14.1 18.5

4.4 15.9

0.0 7.1 0.0 8.8 0.0 0.0 0.0

17.0 6.2 0.0 1.5 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.6 0.0 0.0 3.2 0.0 0.0 0.0

0.0 0.0 0.0 1.8 0.0 3.5

7.7 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0

9.5 0.0 0.0 0.0 3.2 0.0

0.0

0.0

0.0

0.0

0.0 0.0 0.0

3.1 0.0 1.5

0.0 0.0 0.0

0.0 0.0 0.0

0.0 21.2

0.0 37.0

6.7 6.7

0.0 17.4

7.1

3.1

0.0

3.2

8.8

6.2

0.0

7.9

3.5

0.0

0.0

0.0

0.0

1.5

3.3

0.0

0.0 0.0 10.6 0.0 0.0 0.0

0.0 3.1 13.9 12.3 0.0 6.2

0.0 0.0 1.7 0.0 3.3 3.3

0.0 0.0 0.0 0.0 3.2 9.5

0.0

0.0

3.3

0.0

0.0 3.5

1.5 0.0

16.7 1.7

0.0 6.3

7.1

0.0

3.3

1.6

7.1

6.2

0.0

1.6

0.0

0.0

0.0

0.0

3.5 0.0

0.0 3.1

0.0 0.0

11.1 0.0

77 dehydrogenase short-chain alcohol dehydrogenase short-chain dehydrogenase sinapyl alcohol dehydrogenase tropinone reductase Total BAHD Acyltransferases p-coumaroyl-CoA:shikimate pcoumaroyl transferase anthocyanin 5-aromatic acyltransferase 3'-N-debenzoyltaxol Nbenzoyltransferase taxadienol acetyl transferase benzoyl CoA benzyl alcohol benzoyl transferase anthranilate Nbenzoyltransferase protein EIG-I24 protein Nhydroxycinnamoyl/benzoyltransf erase putative alcohol acetyltransferase acetylglucosamine acyltransferase lecithine cholesterol acyltransferase glyoxysomal beta-ketoacylthiolase similarity to lysophosphatidic acid acyltransferase putative mono-or diacylglycerol acyltransferase glycerol-3-phosphate acyltransferase diacylglycerol acyltransferase glucose acyltransferase 10-deacetylbaccatin III-10-Oacetyl transferase fatty acid elongase taxadien-5-alpha-ol Oacetyltransferase Total 2-Oxoglutarate-Dependent Dioxygenases 4,5-DOPA dioxygenase extradiol lipoxygenase acireductone dioxygenase putative lipoxygenase 9-cis-epoxy-carotenoid dioxygenase 4-hydroxyphenylpyruvate dioxygenas (4HPPD) β-carotene dioxygenase leucoanthocyanidin dioxygenase ethylene-forming-enzyme-like dioxygenase 1-aminocyclopropane-1carboxylate oxidase 2-oxoglutarate-dependent dioxygenase (2ODD)

36

208.5

28.1

14.6

17.6

19.6

24.7

54.0

18.3

31.6

4 5 2 129

13.3 32.5 4.5 600.9

8.8 0.0 0.0 77.2

0.0 4.4 2.9 49.8

0.0 0.0 0.0 57.7

1.4 0.0 0.0 65.8

0.0 12.3 0.0 88.2

0.0 3.1 0.0 114.1

0.0 1.7 0.0 56.6

3.2 11.1 1.6 91.6

28

91.4

3.5

17.8

9.6

10.0

6.3

34.0

8.4

1.8

7

27.9

0.0

11.4

0.0

0.0

3.5

9.5

3.5

0.0

18

61.5

6.7

8.5

14.5

3.1

3.5

14.3

3.5

7.4

8

30.6

3.4

0.0

9.6

0.0

0.0

15.8

1.8

0.0

4

14.5

0

0

0

4.6

0

6.3

0

0

1

3.4

3.4

0.0

0.0

0.0

0.0

0.0

0.0

0.0

1

3.5

0.0

0.0

0.0

0.0

0.0

0.0

3.5

0.0

9

32.2

3.4

5.7

6.4

7.7

0.0

1.6

0.0

7.4

2

8.6

0.0

0.0

0.0

1.5

0.0

0.0

7.0

0.0

6

21.5

1.7

0.0

3.2

0.0

7.1

9.5

0.0

0.0

2

8.6

0.0

0.0

0.0

1.5

0.0

0.0

7.0

0.0

8

35.3

3.4

0

0.0

13.9

3.5

1.6

7.0

5.9

3

6.3

0.0

0.0

0.0

3.1

0.0

3.2

0.0

0.0

2

8.8

0.0

5.7

0.0

3.1

0.0

0.0

0.0

0.0

7

20.4

1.7

2.8

3.2

6.2

1.8

4.8

0.0

0.0

1 14

3.2 38.6

0.0 5.0

0.0 5.7

0.0 8.0

0.0 3.1

0.0 5.3

3.2 3.2

0.0 5.3

0.0 3.0

4

19.2

0.0

0.0

8.0

3.1

3.5

1.6

0.0

3.0

2

6.5

0.0

0.0

0.0

0.0

3.5

0.0

0.0

3.0

1

3.5

0.0

0.0

0.0

0.0

3.5

0.0

0.0

0.0

128

445.5

32.2

57.6

62.5

60.9

41.5

108.6

46.6

31.5

7 30 8 9

41.5 115.5 36.6 22.2

3.4 11.8 5.0 1.7

2.8 5.7 1.4 5.7

9.6 14.4 8.0 4.8

3.1 26.3 18.6 1.5

10.6 12.4 0.0 1.8

1.6 9.5 0.0 3.2

0.0 8.8 3.5 3.5

10.4 26.7 0.0 0.0

4

18.7

0.0

4.3

1.6

0.0

0.0

11.1

1.8

0.0

2

4.4

0.0

2.8

0.0

0.0

0.0

1.6

0.0

0.0

4 6

13.5 28.0

6.7 0.0

1.4 5.7

0.0 0.0

0.0 3.1

5.3 0.0

0.0 14.3

0.0 3.5

0.0 1.5

5

17.6

0.0

8.5

0.0

0.0

0.0

3.2

0.0

5.9

21

135.6

6.7

21.3

6.4

1.5

10.6

76.0

7.0

5.9

3

9.8

3.4

0.0

6.4

0.0

0.0

0.0

0.0

0.0

78 putative dioxygenase 20 71.2 5.0 5.7 8.0 23.2 5.3 43.7 65.3 59.2 77.3 46 Total 119 514.6 SABATH Carboxylmethyltransferases salicylic acid 6.7 2.8 0.0 0.0 0.0 4 17.1 carboxylmethyltransferase carboxyl methyltransferase 2 6.2 0.0 0.0 0.0 6.2 0.0 Total 6 23.3 6.7 2.8 0 6.2 0 Small Molecule Omethyltransferases SMOMTs orcinol O-methyltransferase 8 31.1 5.0 4.3 8.0 0.0 0.0 3.4 0.0 8.0 3.1 0.0 caffeic acid O-methyltransferase 6 20.5 1.7 0.0 0.0 0.0 0.0 flavonoid O-methyltransferase 2 3.3 caffeoyl-CoA O14.1 0.0 14.4 7.1 4.8 6 52.4 methyltransferase 24.2 4.3 30.4 10.2 4.8 Total 22 107.3 * Total normalized EST number (TEN) was determined as described for Supplemental Table S3.

9.5 130

7.0 35.1

7.4 57.8

1.6

0.0

5.9

0.0 1.6

0.0 0

0.0 5.9

6.3 1.6 1.6

0.0 0.0 0.0

7.4 4.4 0.0

1.5

3.4

7.1

11.0

3.4

18.9

79 Supplemental Table S6: Normalized EST expression levels for cytochrome P450 monooxygenases* Potential function T3C GW GY From Tree Contigs EST (based on best hit in UniProt) Rh L Rh L R Rh L allene oxide synthase CYP74A group1 10 41 0 2.9 3.2 9.8 0 20 5 CYP74A group2 5 17 0 0 1.6 1.4 0 12 1.7 fatty acid hydroperoxide lyase CYP74F 1 3.1 0 0 0 0 0 3.1 0 CYP86A 1 3.5 0 0 0 0 3.5 0 0 ω-hydroxylase for fatty acids# obtusifoliol-14-demethylase CYP51 1 1.6 0 0 0 0 0 0 0 CYP51G 7 24.2 1.8 0 4.8 0 0 3.1 6.7 berbamunine synthase CYP80 1 1.4 0 0 0 1.4 0 0 0 carotenoid e-ring hydroxlation CYP97A 1 1.6 0 0 0 0 0 0 0 CYP97B 1 1.5 0 1.5 0 0 0 0 0 N-demethylase or ring-methyl CYP71A group1 2 4.5 0 2.9 1.6 0 0 0 0 hydroxylase flavonoid 3'-hydroxylase CYP75B 3 9.4 0 1.5 6.4 0 0 1.5 0 flavonoid 3' 5'-hydroxylase CYP706C 1 2.8 0 0 0 2.8 0 0 0 CYP75A 2 6.7 3.5 0 3.2 0 0 0 0 cinnamate-4-hydroxylase CYP73A 5 29.2 0 2.9 0 5.6 3.5 9.3 0 p-coumaroyl shikimate 3'CYP98A 6 28.6 0 1.5 0 5.6 7.1 1.5 6.7 hydroxylase hydroxylation of indole to CYP71C 8 28.9 0 7.3 3.2 8.4 3.5 0 3.3 benzoxazinones (Bxs) 8'-hydroxylase for ABA CYP707A 3 9.4 0 5.9 0 0 3.5 0 0 limonene hydroxylase CYP71D 13 80.7 7 10 6.4 2.8 8.8 28 5 ent-kaurene oxidase CYP701A 2 5 1.8 0 3.2 0 0 0 0 ent-kaurenoic acid oxidase CYP88A 1 2.8 0 0 0 2.8 0 0 0 unkown CYP703A 5 22.9 11 2.9 6.4 0 0 3.1 0 CYP704A 4 9.9 3.5 0 4.8 0 0 0 0 7 25.9 7 0 4.8 0 0 7.7 0 CYP71 group1† CYP71 group2 1 3.5 0 0 0 0 3.5 0 0 CYP714B 1 3.2 0 0 3.2 0 0 0 0 CYP714C 1 3.1 0 0 0 0 0 3.1 0 CYP71A group2 1 3.2 0 0 3.2 0 0 0 0 CYP71G 3 6.3 0 1.5 0 0 0 1.5 3.3 CYP71J 3 9.2 0 0 0 1.4 0 6.2 1.7 CYP71P 2 3.2 1.8 0 0 1.4 0 0 0 CYP71R 3 4.5 0 1.5 0 1.4 0 0 1.7 CYP71T 4 8 0 2.9 0 0 3.5 1.5 0 CYP722A 1 1.8 1.8 0 0 0 0 0 0 CYP72A 32 119 8.8 13 1.6 32 8.8 6.2 20 CYP734A 1 1.6 0 0 0 0 0 0 0 CYP77A 1 2.8 0 0 0 2.8 0 0 0 CYP78A 1 1.6 0 0 1.6 0 0 0 0 CYP89B group1 3 7.9 0 1.5 0 0 0 3.1 3.3 CYP89B group2 4 12.2 0 2.9 0 2.8 0 0 3.3 CYP92A 7 23.7 0 0 0 13 0 0 0 CYP94B 3 11.7 0 0 1.6 0 7.1 3.1 0 CYP94C 2 9.3 0 0 0 0 0 9.3 0 CYP94D 2 5.7 0 2.9 0 2.8 0 0 0 CYP96B group1 4 15 8.8 2.9 0 0 0 0 3.3 Total 170 618 56 69 61 98 53 123 65 * Total EST number (TEN) was determined as described for Supplemental Table S3. # ω-hydroxylase for fatty acids: ω-hydroxylase for saturated and unsaturated C12 to C18 fatty acids. † CYP71 group1 clade is adjacent to the CYP99A clade, while CYP71 group2 clade is adjacent to CYP71K, X ,Y clades as illustrated in Supplemental Figure S4).

R 0 0 0 0 1.6 7.9 0 1.6 0 0 0 0 0 7.9 6.3 3.2 0 13 0 0 0 1.6 6.3 0 0 0 0 0 0 0 0 0 0 28 1.6 0 0 0 3.2 11 0 0 0 0 93

80 Supplemental Table S7: Normalized percentage of ArREST ESTs with GO associations Normalized EST abundance (%) ArREST Rhizome GO description ESTs Rhizome ESTs enriched 0006464 protein modification 10.42 11.97 24.72 0006350 transcription 8.51 9.16 12.19 0006412 protein biosynthesis 6.72 7.06 9.34 0006810 transport 9.15 9.10 5.83 0008152 metabolism 9.02 8.29 7.40 0009987 cellular process 7.96 8.41 7.08 0006118 electron transport 5.64 5.47 6.95 0000004 biological process unknown 5.31 5.40 1.51 0009058 biosynthesis 4.16 4.32 4.13 0019538 protein metabolism 3.93 4.29 3.71 0005975 carbohydrate metabolism 3.55 3.66 2.57 0015979 photosynthesis 3.46 0.71 0.18 0016043 cell organization and biogenesis 2.91 3.32 2.90 0006139 nucleotide and nucleic acid metabolism 2.50 2.77 1.08 0006091 generation of precursor metabolites and energy 2.72 1.27 0.48 0006519 amino acid and derivative metabolism 2.09 1.98 0.76 0007165 signal transduction 1.60 2.09 1.74 0006950 response to stress 1.86 1.93 1.15 0006629 lipid metabolism 1.43 1.57 1.61 0009056 catabolism 1.54 1.49 0.80 0009628 response to abiotic stimulus 1.45 1.44 1.21 0006259 DNA metabolism 0.99 1.05 0.80 0007582 physiological process 0.55 0.51 0.17 0007049 cell cycle 0.32 0.44 0.05 0019748 secondary metabolism 0.28 0.26 0.42 0009607 response to biotic stimulus 0.36 0.39 0.28 0008219 cell death 0.24 0.32 0.15 0009719 response to endogenous stimulus 0.26 0.31 0.12 0008150 biological_process 0.22 0.25 0.30 0007275 development 0.25 0.24 0.00 0019725 cell homeostasis 0.23 0.22 0.06 0030154 cell differentiation 0.05 0.05 0.18 0009605 response to external stimulus 0.09 0.08 0.00 0000003 reproduction 0.04 0.04 0.04 0009790 embryonic development 0.03 0.04 0.04 0009835 ripening 0.04 0.02 0.04 0009653 morphogenesis 0.02 0.01 0.00 0009991 response to extracellular stimulus 0.01 0.02 0.00 0009606 tropism 0.02 0.01 0.00 0009791 post-embryonic development 0.01 0.02 0.00 0040029 regulation of gene expression, epigenetic 0.01 0.01 0.00 0009908 flower development 0.01 0.01 0.00 0016049 cell growth 0.01 0.01 0.00 0009856 pollination 0.00 0.01 0.00 0007267 cell-cell signaling 0.00 0.00 0.00 Normalized EST abundance = (Σ GO contribution)/(total # GO classified ESTs in dataset). GO contribution = (# ESTs comprising a particular contig)/(# GO identifiers for a particular contig).

81 Supplemental Table S8: Probable transcriptional regulator classes within ArREST associated with GO:0003677 ArRESTs ZO_CL_Rh CL_Eb ZO_Ec_Ed ZO_Ec ZO_Ed TYPE Contigs ESTs Contigs ESTs Contigs ESTs Contigs ESTs Contigs ESTs Contigs ESTs ARF 21 42 8 15 5 10 3 5 2 3 1 2 B3 13 23 6 9 4 5 2 4 0 0 2 4 bHLH 12 33 8 17 3 9 5 8 4 6 1 2 bZIP 84 150 41 58 12 18 29 40 19 25 10 15 CONSTANS 46 128 19 32 1 2 18 30 7 10 11 20 ERF 21 61 14 24 3 6 11 18 7 10 4 8 HD_ZIP 40 95 20 37 5 12 15 25 5 8 10 17 GRAS 2 3 0 0 0 0 0 0 0 0 0 0 HOMEOBOX 74 117 31 39 7 9 24 30 12 13 12 17 HSF 24 41 11 15 4 5 7 10 4 5 3 5 MADS 8 16 6 10 1 2 5 8 3 5 2 3 MYB 151 346 80 139 33 61 47 78 23 36 24 42 NAC 86 262 45 78 9 21 36 57 14 20 22 37 SAND 1 2 1 2 1 2 0 0 0 0 0 0 WD40 3 7 1 1 0 0 1 1 1 1 0 0 WRKY 96 241 55 110 23 52 32 58 16 29 16 29 Znf_AN1 29 116 22 50 6 13 16 37 14 31 2 6 Znf_BED 2 11 1 2 1 2 0 0 0 0 0 0 Znf_C2H2 8 23 4 9 0 0 4 9 3 7 1 2 Znf_C5HC2 3 3 1 1 1 1 0 0 0 0 0 0 Znf_CCCH 18 30 10 14 4 7 6 7 5 6 1 1 Znf_CW 3 6 1 2 0 0 1 2 1 2 0 0 Znf_DOF 12 27 7 11 3 5 4 6 4 6 0 0 Znf_GATA 34 74 20 39 4 10 16 29 8 16 8 13 Znf_PHD 7 14 4 5 0 0 4 5 2 2 2 3 Znf_RING 2 2 0 0 0 0 0 0 0 0 0 0 Znf_SBP 19 36 10 13 1 1 9 12 6 9 3 3 Total 819 1909 426 732 131 253 295 479 160 250 135 229

82 GenBank Accessions for Rhizome-specific Transcripts from Sorghum GI61115604 - GI61115077, GI61099298 - GI61099027, GI56932432 GI54413984, GI54413982, GI54413980, GI54413965, GI54413963, GI54413961, GI54413959, GI54413957, GI54413955, GI54413953, GI54413951, GI54413949, GI54413946, GI54413944, GI54413942, GI54413940, GI54413938, GI54413936, GI54413934, GI54413932, GI54413930, GI54413928, GI54413926, GI54413924, GI54413922, GI54413920, GI54413918, GI54413916, GI54413914, GI54413912, GI54413910, GI54413908, GI54413906, GI54413904, GI54413902, GI54413900, GI54413898, GI54413896, GI54413894, GI54413892, GI54413890, GI54413888, GI54413886, GI54413884, GI54413882, GI54413880, GI54413878, GI54413876, GI54413874, GI54413872, GI54413870, GI54413868, GI54413865, GI54413863, GI54413861, GI54413859, GI54413857, GI54413855, GI54413853, GI54413851, GI54413849, GI54413847, GI54413845, GI54413844, GI54413842, GI54413840, GI54413839, GI54413837, GI54413835, GI54413833, GI54413831, GI54413829, GI54413827, GI54413825, GI54413823, GI54413821, GI54413818, GI54413815, GI54413813, GI54413811, GI54413809, GI54413807, GI54413805, GI54413803, GI54413801, GI54413795, GI31072290 – GI31072275, GI21998059 – GI21788249.

>1500

1401-1500

1301-1400

1201-1300

1101-1200

1001-1100

901-1000

801-900

701-800

601-700

501-600

401-500

301-400

201-300

4000 3500 3000 2500 2000 1500 1000 500 0

151-200

Contig number

83

Sequence length (nt) Supplemental Figure S1. The length of ESTs. The average EST sequence length was 817 bp, with contig lengths ranging from 151 to 4021 bp, with the greatest number of contigs having between 701 and 800 bp, and with 95% exceeding 300 bp

84 dev elopment

c ell death

C L_Eb (T 3C L) C L_Ea (T 3C R h)

s ec ondary metabolis m

Z O_Ef (GYR ) D N A metabolis m

Z O_Ee (GWR )

res pons e to endogenous s timulus

Z O_Ed (GYR h) Z O_Ec (GWR h)

s ignal trans duc tion

Z O_Ea (GYL) Z O_Eg (GWL)

c ell organiz ation and biogenes is c ell homeos tas is

phy s iologic al proc es s

c ell c y c le nuc leobas e, nuc leos ide, nuc leotide and nuc leic ac id metabolis m trans port

protein modific ation

biologic al_proc es s

bios y nthes is amino ac id and deriv ativ e metabolis m biologic al proc es s unk now n elec tron trans port

protein metabolis m lipid metabolis m res pons e to biotic s timulus c atabolis m

Supplemental Figure S2. Gene Ontology (GO) annotations of ArREST contigs. 87.6 % of the ArREST contigs possessed GO annotations. Eight GO categories had EST abundances greater than 5%, including protein modification (10.5%), transport (9.2%), metabolism (9.0%), transcription (8.6%), cellular process (8.0%), protein biosynthesis (6.9%), electron transport (5.6%), and biological process unknown (5.3%). In contrast, the GO category secondary metabolism contained relatively few ESTs (0.3 %).

c arbohy drate metabolis m

photosynthesis generation of precursor metabolites and energy protein bios y nthes is

c ellular proc es s res pons e to abiotic s timulus metabolis m

res pons e to s tres s

trans c ription 0

5

10

E ST abundance (% )

15

85

Sucrose

O

+

Su cro se Syn th a se (0 .3 2 % )

F6P

Gl u co se -6 Ph o sp h a te D e h yd ro g e n a se ( 0 .0 4 % )

N AD P+

2X IPP

PPi

R5P Xu5P Tra n ske to l a se (0 .0 7 % )

ME4P Pi N AD H N AD +

Al d o l a se (0 .6 5 %) Pi

PP i

F6P

3PG

3PHP Glu 2-OG

N AD + N AD H

SHK

ATP

N AD P+

DHQ 3 -D e h yd ro q u i n a te D e h yd ra ta se ( 0 .2 4 % )

H 2O

NH+4 +

THF H yd ro xyme th yl Tra n sfe ra se (0 .1 7 % )

N5,N10-Methylene-THF

Gly

5 10 N ,N -Me th yl e n e TH F R e d u cta se ( 0 .0 3 % )

CO2

DHS

N AD PH

THF

N5-Methyl-THF

Met

HomoCys

Me th i o n i n e Syn th a se (0 .2 2 % )

Gin

Sy AT Pi ( 0 nt h P . 3 as 3% e )

L-Phenylalanine

ADP

AMP + ADP Ad e n o si n e Ki n a se (0 .0 8 % )

Adenosine

ATP

H2O

NH+4

S-Ad e n o syl h o mo cyste i n e H yd ro l a se (0 .2 0 % )

SAM Syn th a se (1 .0 0 % )

Pi PP i

SAM

Cinnamic Acid C4H (0 .5 8 % )

Ph o sp h o se ri n e Ph o sp h a ta se (0 .1 5 % )

Ser

D e h yd ro q u i n a te Syn th a se (0 .0 2 % )

H 2O + C O2

PAL (0 .11 %)

3P-Ser Pi

DAHP Pi

NH+4

GS (0 .2 4 % )

2-OG

D AH P Syn th a se ( 0 .0 7 % )

E4P

3 -D eh yd ro sh i ki ma te R ed u cta se (0 .2 4 %)

Sh ik im ate Ki na se ( 0.0 2 % ) AD P

Glu

GOGAT (0 .0 6 % )

Arogenate Aro g e n a te D e h yd ra ta se

S3P PEP

Glu

Pyru va te Ki n a se (0 .1 2 % )

PEP

MVA

Pi

EP SP Sy n tha se ( 0.0 2 %)

P yro p ho s ph a te Fru c tos e6 -P ho sp h a te 1P ho sp h o ryl a se ( 0.0 5 % )

Pi

Ami n o tra n sfe ra se ( 0 .0 7 % )

AD P

HMG-CoA

2X CoASH

H2 O

e e n as ) ci n ly e % G r o g . 25 yd ( 0 h

EPSP

Pi

Prephenate

2GP

3X Acetyl-CoA

Mevalonate5-Phosphotransferase ( 0.003%)

ATP C o ASH 2 N AD P+ 2 N AD PH AD P

De

Chorismate

H2O

3PG

FBP

C h o ri sma te Syn th a se ( 0 .0 1 % )

C h ri sma te Mu ta se

Pyruvate ATP

GAP

ATP

P ho sp h oGl yc er ate Ki na se (0 .1 3% )

TPI

DHAP

AD P

1,3-BPG

Gl yc er al d e hy de 3 -P ho s ph a te D eh yd ro g e na se (0 .5 5 %)

(0 .0 7% ) Tra ns al d o la se

GAP

DXP

N AD P+ N AD PH

N AD H N AD + C O2 C o ASH

ATP

PMVA

Ad e ny la te Ki n as e (0 .11 % )

S7P

C O2

AD P

Ph o sp h o me va l o n a te Ki n a se

HMG-C oA Reducta se ( 0.01% )

IMP

Th i ol a se a nd H MG -C oA Sy n tha se (0 .4 8 %)

IMP Ki n a se

ATP

A mi no tr an sfe ra se (0 .0 3% )

R i b u l o se -5 Ph o sp h a te Ep i me ra se (0 .1 3 % )

PPMVA

3 -Ph o sp ho g ly ce ra te D eh yd ro g en a se (0 .12 % )

Ru5P R i b u l o se -5 Ph o sp h a te Iso me ra se

AD P

Pyro p h o sp h o me va l o n a te D e ca rb o xyl a se

AD P

DMAPP

IPP Iso me ra se ( 0 .0 5 % )

D XP Syn th as e ( 0.1 5 %)

NADPH

IPP

En o la se (0 .0 3%)

6-Phosphogluconate Dehydrogenase (0.04%)

D X P R e du cta se ( 0.0 6 %)

NADP+

Py ru va te D eh yd ro g e na se (0 .0 8 %)

6-Phospho-gluconate

DMAPP IPP Iso me ra se ( 0 .0 5 % )

C O2 ATP Pi

Ge ra n yl Pyro p h o sp h a te Syn th a se (0 .1 8 %)

Sp o n ta n e o u s

PG M uta se ( 0 .01 %)

H2O

Fa me syl Pyro p h o sp h a te Syn th a se (0 .1 8 % )

GPP

6-Phospho-gluconoδ -lactone

E4P

PP i

Germacrene D

PP i

1 ,8 -C i n e o l e Syn th a se ( 0 .2 4 % )

FPP

Ge rma cre n e D Syn th a se (0 .2 7 % )

α -Terpineol

1,8-Cineole

G6P

N AD PH

PP i

HO

SAHC

O2

H 2O

[OMT]

O

p -Coumaric Acid

OH

O

OH MeO

6-gingerol

HO

HO

C o ASH

O

O

Ma l o n yl -C o A

p -Coumaroyl-CoA

2X Ma l o n yl -C o A

C ST ( 0 .1 8 % )

HO C o ASH

O2

p -Coumaroyl 5-O- Shikimate

C3’H ( 0 .2 9 % )

Caffeoyl 5-O- Shikimate

SAM

Sh i ki mi c a ci d

C ST ( 0 .1 8 % )

Caffeoyl-CoA

O

SAHC

SAM OH Bisdemethoxycurcumin

C C OMT ( 0 .1 0 % )

[H yd ro xyl a se ]

[Po l yke ti d e Syn th a se (s)]

Feruloyl-CoA [Po l yke ti d e Syn th a se (s)]

[Po l yke ti d e Syn th a se (s)]

OH

O OMe SAM

[OMT]

HO

H e xa n yl -C o A Ma l o n yl -C o A

Ma l o n yl -C o A

[Po l yke ti d e Syn th a se (s)]

OH

OH

MeO

Ma l o n yl -C o A

O

O

HO

[Po l yke ti d e Syn th a se (s)] Sh i ki mi c a ci d C o ASH

H e xa n yl -C o A

OH

HO

[H yd ro xyl a se ]

HO

[R e d u cta se (s) ]

[OMT]

[R e d u cta se (s) ]

OH

HO

SAHC

SAM

[R e d u cta se (s) ] 4CL (0 .2 8 % )

OH

O

HO [H yd ro xyl a se ]

HO

OH Demethoxycurcumin

[OMT]

OH

SAHCMeO

[H yd ro xyl a se ]

O Me OH

HO Curcumin

Supplemental Figure S3. Proposed metabolic map showing how the curcuminoids, gingerols and terpenoids are produced from sucrose in a large interconnected network. Names of enzymes identified in the ArREST database are colored blue; % values indicate fraction of contigs in the database that are represented by genes encoding each protein.

86 O48918 S OY B N CY P 71A 10 N-dem ethylase C71A 1 P E RA E 71A 1 CY P 71A A t, S OLME , and S OY B N etc. putative sytochrom e P 450

CY P 71A group1

C71A 8 M E NP I 71A 8 GT1 02222 01 Q76M94 A S P OF cytochrome P -450LX XIA 1 GT 1 13782 01 C71A 1 P E RA E Cytochrom e P 450 71A 1CY P LX X IA 1 A RP -2 CY P 71R Rice GT 1 14810 01 C71A 1 P E RA E Cytochrome P 450 71A 1CY P LX X IA 1 A RP -2 GT1 02113 02 Q76M94 A S P OF cytochrome P450

CY P 71R

GT1 02113 03 Q75T49 9A S PA cytochrom e P 450 CY P 71T Rice G T1 05586 01 Q94FM6 TOB A C elici tor-inducible cytochrome P 450

CY P 71T

GT1 10225 02 Q8W0B 2 ORY S A putative cytochrom e P -450LX X IA 1cyp71A 1famil y GT1 11826 01 Q 8W0B 4 ORY S A putative cytochrome P -450LX X IA 1cyp71A 1famil y GT1 11826 02 Q 8W0B 2 O RY S A putative cytochrome P -450LX X IA 1cyp71A 1fam ily CY P 71A D1 Rice

CY P 71A D

CY P 71C wheat

CY P 71C Rice CY P 71A E GT1 13035 01 Q8V YA 8 M A IZE cytochrome P 450 monooxygenase CY P 71C3v2

CY P 71C (CY P 71A E )

G T1 10605 01 O 04163 NE P RA cytochrom e P -450LX X IA 1 GT1 06287 01 Q 75T49 9A S PA cytochrome P 450 G T1 05557 01 O04163 NE P RA cytochrome P -450LX X IA 1 G T1 04653 01 Q75T49 9A S PA cytochrom e P 450 GT1 04543 01 Q76M 94 A S P OF cytochrome P 450 GT 1 03425 01 Q6ZII0 ORY S A cytochrome P -450LX X IA 1 GT1 11236 01 Q94FM7 TOB A C putative flavonoid 3'-hydroxylase CY P 71A K Q75T 49 M US A R Q 76M 94 A S P O F CY P 71J1 GT1 06878 01 Q76M94 A S P OF cytochrome P 450

CY P 71J

GT1 09709 01 Q75T49 9A S PA cytochrome P 450 G T1 02113 01 Q76M 94 A S P OF cytochrome P 450 C71A 9 S OY B N 71A 9 GT1 10081 01 C71A 9 S OY B N putative cytochrome P 450

CY P 71A group2

CY P 71S CY P 71P 1 Ri ce CY P 71P

GT1 12979 01 C71A 1 P E RA E Cytochrome P 450 71A 1CY P LX X IA 1 A RP -2 G T1 12749 01 Q76M 94 A S P OF cytochrom e P -450LX X IA 1 CY P 71D S OLTU and T OB A C etc. G T1 00006 10 Q 9AV Q2 S OLTU cytochrome P 450 G T1 00643 02 C71D8 S OY B N putative cytochrome P 450

G T1 09014 02 Q9AV Q 2 S OLTU cytochrom e P 450 G T1 03347 02 Q6Y UY 8 ORY S A putati ve cytochrom e P 450 CY P 71D ME NP I, ME NG R, and M E NS P C71D8 S OY B N 71D8 GT 1 12423 02 Q9S WE 3 T OB A C cytochrome P 450 G T1 00435 04 C71D9 S OY B N E -class P 450group I

CY P 71D

CY P 71D S OY B N Q8W1W8 E UP LA Q6P LI7 9GE NT G T1 07261 01 Q9AV Q 2 S OLTU cytochrom e P 450 GT1 03347 01 Q9X HE 8 ME NS P lim onene-6-hydroxylase G T1 00643 01 Q6ZIH2 ORY S A putative cytochrom e P 450 G T1 00001 082 Q9AV Q2 S O LTU elicitor-i nducibl e cytochrom e P 450 GT1 07624 01 C71DB LOTJA A rthropod hemocyanin/insect LSP E -class P 450 group I G T1 07277 01 Q9AV Q 2 S OLT U elicitor-inducible cytochrom e P 450 GT1 01928 01 Q9AV Q2 S OLTU eli citor-i nducibl e cytochrom e P 450 CY P 99A GT1 00435 02 Q9AV Q2 S OLTU am orpha-4 11-diene C-12 oxidase G T1 05770 01 Q6ATT7 ORY S A cytochrom e P 450putative GT1 00967 01 Q6Z IG2 ORY S A A rthropod hem ocyanin/insect LS P E -class P 450g GT1 07312 01 C71D8 S OY B N putative cytochrom e P 450

CY P 71 group1 (next to CY P 99A )

GT1 13117 01 Q6ATT7 ORY S A cytochrom e P 450putative GT1 14845 01 Q6ATT7 ORY S A cytochrome P 450putative GT1 02880 01 Q8G V L3 O RY S A putative cytochrome P 450 CY P 71V CY P 71U CY P 71K CY P 71A F 1 Rice CY P71Y Rice

CY P 71A F

CY P 71A B CY P 71X CY P 71A C CY P71 group2 (next to CY P 71K , X ,Y )

G T1 04425 01 Q6ZIG2 ORY S A putative cytochrom e P 450 CY P 71W CY P 71Z CY P 71A A

CY P 71B

CY P 71E CY P 71Q Q8V Z Y 2 M US A C -1 CY P 71G A S PO F CY P 71G

G T1 12664 02 O04163 NE P RA cytochrome P 450 GT1 00162 06 O04163 NE P RA cytochrome P 450

GT1 01685 01 Q 6ATT7 O RY S A putative cytochrom e P 450 CY P83A 1 A t4g13770 oxidation of p-hydroxyphenyl-acetaldoxim e or indole-3-acetyl doxi me

CY P 83A

CY P 83A 2 CY P 84A CY P 736A CY P 75A P E THY, E US GR, and S OLT U etc. Flavonoi d 3' 5'-hydroxylase CY P 75A

CY P 75A 11 Rice GT1 01579 01 Q6J210 9LA MI flavonoid 3'5'-hydroxylase GT1 11316 01 Q 6Y LS3 S OY B N F lavonoid 3' 5'-hydroxyl ase 1

CY P 75

CY P 75B P E THY, IP OP U, and S OY B N etc. Flavonoid 3' or 3' 5'-hydroxylase Q6QHJ9 A LLCE Flavonoid 3'-hydroxylase GT1 07207 01 Q 4JE G2 S ORB I flavonoi d 3'-hydroxylase

CY P 75B

G T1 01127 01 Q4JE G2 S ORB I flavonoid 3'-hydroxylase CY P 75B Rice G T1 01127 02 Q4JE G 2 S ORB I flavonoid 3'-hydroxylase CY P 92A TOB A C and P E A CY P 92A Rice G T1 00260 02 Q69P 73 ORY S A putati ve eli citor-inducible cytochrom e P 450 GT1 00496 05 Q7X 7Q3 O RY S A putative elicitor-induci ble cytochrome P 450 G T1 05558 01 Q69P 72 ORY S A putative eli citor-i nducibl e cytochrom e P 450 Q8V YA 9 M A IZ E monooxygenase CY P 92A 1 CY P 92A 9 Rice

CY P92A

GT1 05558 02 Q69P 72 ORY S A putative elicitor-inducible cytochrome P 450

CY P 92

G T1 00260 03 Q 69P 73 ORY S A putati ve el icitor-inducible cytochrom e P 450 GT1 05558 03 Q69P 73 O RYS A putative elicitor-inducible cytochrom e P 450 G T1 13650 01 Q69P 73 ORY S A putati ve eli citor-inducible cytochrom e P 450 CY P 92A 11 Ri ce CY P 92A 11 Ri ce CY P 92A 15 Rice CY P 92C1 Rice

CY P 92C Q7M1E6 P E THY 92B 1

CY P 92B

CY P 79 CY P 703A P E THY, A t, and Rice etc. GT1 00007 30 Q7E ZR4 O RY S A putative cytochrom e P 450 protein G T1 14865 01 Q8LE 07 A RATH CY P 703A 2

CY P 703A

GT 1 00035 03 Q8W0R8 S ORB I putative cytochrome P 450 fam ily G T1 05055 01 Q7E ZR4 ORY S A putative cytochrome P 450 protein GT 1 11113 01 Q7E Z R4 ORY S A putative cytochrom e P 450 protein CY P 72A CATRO and A t P utative cytochrome P 450 CY P 72A M A IZE , LO LRI and Ri ce GT1 00001 023 Q5ZE 52 O RYS A putative cytochrom e P 450 GT1 07792 01 Q8V Y B 0 M A IZE putati ve cytochrom e P 450 GT 1 05993 01 Q 9ATU2 LOLRI putative cytochrom e P 450 GT 1 00119 01 Q 9ATU1 LOLRI putative cytochrom e P 450 GT1 09793 01 Q5Z E 52 ORY S A putative cytochrome P 450 GT1 02090 01 Q5ZE 52 ORY S A putative cytochrome P 450 GT1 04702 01 Q 5Z E 52 O RY S A putative cytochrom e P 450 GT 1 11585 01 Q8LGM8 MA IZE putative cytochrom e P 450 GT1 07163 01 Q5ZE 52 ORY S A putative cytochrome P 450 G T1 02959 01 Q8LGM 8 M A IZE putative cytochrome P 450 G T1 00046 04 Q9ATU2 LOLRI putati ve cytochrom e P 450 GT1 05903 01 Q 8LG M8 MA IZE cytochrom e P 450 m onooxygenase CY P 72A 5 GT1 00019 05 Q5ZE 52 ORY S A cytochrome P 450-like GT1 04266 01 Q8LGM 8 MA IZE putative cytochrome P 450 GT1 03004 01 Q8LGM 8 MA IZE cytochrom e P 450 m onooxygenase CY P 72A 5 CY P72A 25 Rice

CY P 72A (CY P 72C)

GT1 09207 01 Q8LGM 8 M A IZE putative cytochrome P 450 GT 1 13458 01 Q9ATU1 LO LRI putative cytochrome P 450 GT 1 07698 01 Q9ATU5 LO LRI putative cytochrome P 450 GT1 08116 01 Q9ATU2 LOLRI putati ve cytochrom e P 450 GT1 00201 03 Q 9FDZ1 ORY S A putative cytochrome P 450 GT 1 00331 02 Q9FDZ1 ORY S A putati ve cytochrom e P 450 GT1 00001 128 Q9FDZ1 ORY S A putati ve cytochrom e P 450 G T1 02231 01 Q 9ATU1 LOLRI putative cytochrom e P 450 GT1 00030 08 Q9AT U1 LOLRI putative cytochrome P 450 GT 1 06457 01 Q8LGM 8 MA IZE cytochrom e P 450 m onooxygenase CY P 72B CY P 72C CY P 72C1 A t1g17060 P utati ve cytochrom e P 450 CY P 72A NICP L, S OLTU, and LY CE S GT1 03283 01 Q9LUD0 A RATH putative cytochrome P 450 G T1 00119 08 Q9FDZ1 ORY S A putati ve cytochrom e P 450 G T1 00001 159 Q 9FDZ1 ORY S A putative cytochrome P 450 GT1 12781 01 Q9FDZ 1 O RY S A putative cytochrom e P 450 G T1 04702 02 Q 9F DZ1 ORY S A putative cytochrome P 450 GT1 00201 01 Q8LL74 MA IZE cytochrom e P 450 m onooxygenase CY P 72A 16 G T1 00331 03 Q8LL74 M A IZE cytochrome P 450 monooxygenase CY P 72A 16 CY P 734A

GT1 04247 02

CY P734A 5 Rice CY P 721 CY P 709 CY P 735 CY P 715 CY P 714C LO LRI and Ri ce CY P 714C

CY P 714C2 Rice GT1 00002 003 Q93Z79 A RATH Cytochrom e P 450 fam ily protein CY P 714B 1 Ri ce CY P 714D1 Rice

CY P 714

CY P 714B

GT1 11005 01 Q7X HW5 ORY S A Cytochrome P 450 fam ily protei n CY P 714D

CY P 714A CY P 94A CY P 94D A t P utative cytochrome P 450 CY P 94D15 Rice G T1 04696 01 Q8H810 ORY S A P utative cytochrome P 450

CY P94D

G T1 09229 01 Q8H810 ORY S A Cytochrome P 450 fam ily protei n CY P 94D Rice CY P 94D A t P utative cytochrome P 450 CY P94B 2 A t3g01900 P utative cytochrome P 450 CY P94B Rice

CY P 94

CY P 94B

G T1 05825 01 Q9S MP 5 A RATH Cytochrom e P 450 fam ily protein GT1 00029 02 Q9S MP 5 A RATH Cytochrome P 450 fam ily protein G T1 12542 01 Q9S MP 5 A RAT H Cytochrome P 450 fam ily protei n CY P 94C1 A t2g27690 P utative cytochrome P 450 CY P 94C2 Ri ce GT1 07247 01 Q53P Z7 ORY S A Cytochrome P 450 fam ily protei n

CY P 94C

GT1 04538 01 Q94DD8 ORY S A putati ve cytochrom e P 450 CY P 94C Rice CY P 94E Q50E K 3 P INTA CY P D CY P704A A t P utati ve cytochrom e P 450 CY P 704A 3 Rice G T1 11681 01 Q7X TY 4 ORY S A O S JNB a0019K 04.13 GT1 00010 37 Q 8S 7S 6 ORY S A cytochrome P 450 monooxygenase CY P 704B

CY P704A

CY P704A Rice

CY P 704

GT1 00898 01 Q8S 7S 4 ORY S A cytochrom e P 450-l ike protein CY P 704A 4 Rice GT1 11513 01 Q 8S 7S 6 ORY S A cytochrome P 450-li ke protein CY P 704B CY P 86C CY P 86E

CY P86E 1 Rice

CY P 86A WHE AT and Rice GT1 02691 01 Q7X S Q8 ORY S A OS JNB a0084K 11.4

CY P 86A

CY P 86A A t P utati ve cytochrom e P 450

CY P 86

CY P 86A A t and Ri ce CY P 86B CY P 96D CY P96B Rice GT1 06886 01 Q9S V B 0 A RATH Cytochrome P 450 fam ily protei n G T1 06743 01 Q9LG S 8 ORY S A Cytochrome P 450 fam ily protein CY P 96B group1

GT1 06426 01 Q9LGS 8 ORY S A Cytochrom e P 450 fami ly protein CY P 96B 4 Rice CY P 96B 5 Rice G T1 06029 01 Q9S V B 0 A RATH Cytochrom e P 450 famil y protein CY P 96E CY P96E 1 Rice CY P96B group2 CY P 96A CY P 730A CY P 97B P E A , S O Y B N, and A t P utative cytochrome P 450 Q6J4G9 GINB I P utative 97B 2-like

CY P 97B

CY P 97B 4 Rice G T1 14099 01 Q6H516 ORY S A putative cytochrome P 450

CY P 97

CY P 97C1 A t3g53130 carotenoid e-ring hydroxlation

CY P 97C

CY P97C2 Rice

CY P 97A 4 Ri ce GT1 00002 128 Q93V K 5 A RATH cytochrom e P 450 m onooxygenase CY P 97A

CY P 97A

CY P 74A PA RA R, LINUS , and A t etc. A ll ene oxide synthase CY P 74A A t5g42650 al lene oxide synthase GT1 07317 01 Q6RW10 MA IZE allene oxide synthase GT1 00356 01 Q6RW10 MA IZE allene oxi de synthase G T1 00009 24 Q6RW10 M A IZE allene oxide synthase G T1 00678 03 Q6RW10 M A IZE allene oxide synthase G T1 00452 02 Q6RW10 M A IZE allene oxide synthase

CY P 74A group1

GT1 00452 03 Q8W4X 8 9S OLA allene oxide synthase GT 1 12037 01 Q 6RW10 MA IZ E al lene oxi de synthase GT 1 09317 01 C74A 1 ORY S A putative all ene oxide synthase G T1 00366 03 C74A 1 O RY S A putative allene oxide synthase GT1 05491 01 C74A 1 ORY S A putative allene oxide synthase

CY P 74

CY P 74C CY P 74A 4 Ri ce G T1 00366 02 Q84V 85 CITS I al lene oxide synthase CY P 74A 5 Ri ce CY P 74A group2

GT1 00678 01 C74A 2 ORY S A P utative allene oxi de synthase G T1 00083 05 C74A 2 O RYS A P utative all ene oxide synthase GT1 00452 01 C74A 2 O RY S A P utati ve al lene oxi de synthase G T1 00009 38 C74A 2 ORY S A P utative all ene oxide synthase CY P 74E CY P 74B CY P 74F M A IZE and Ri ce Hydroperoxide lyase

CY P 74F

GT1 07982 01 Q76LM 3 9RO S I hydroperoxide l yase CY P 51G S ORB I and Rice GT1 09288 02 Q84Y E 6 S ORB I Cytochrom e P 450 51putative GT1 06645 02 CP 51 WHE AT Cytochrome P 450 51putative Q673E 9 S O LCH O btusifol iol 14al pha-dem ethylase Q8GV D5 TOB A C Obtusifoliol -14-dem ethylase Q8G ZV 0 T OB A C O btusifoliol -14-dem ethylase GT1 04326 01 Q84Y E 6 S ORB I cytochrome P 450-l ike protein GT 1 10256 01 Q84Y E 6 S ORB I cytochrome P 450-l ike protein

CY P 51G CY P 51

GT1 09288 01 Q84Y E 6 S ORB I cytochrome P 450-l ike protein G T1 05896 01 Q673E 9 S OLCH obtusifoliol 14alpha-demethylase CY P 51G1 CY P 51G2 GT1 14572 01 Q673E 9 S OLCH obtusifoliol 14alpha-dem ethylase CY P 51G 3 Rice CY P 51H group1 CY P 51H group2 G T1 05896 02 CP 51 WHE AT Cytochrom e P 450 51putative CY P 85A CY P720A CY P 90 (CY P 724) CY P 87 CY P 702A CY P 708A CY P 88A M AIZE , Rice and HO RV U GT1 11370 01 C88A 1 M A IZE Cytochrome P 450 88A 1 Dwarf3 protein CY P 707A CUCMA , A t and P E A m ultifunctional ent-kaurenoic aci d oxi dase

CY P88A

Q6J4H0 G INB I P utati ve ent-K aurenoic acid hydroxylase-like CY P 729A CY P 707A T OB A C and A t 8'-hydroxylase for A B A CY P 707A 2 A t2g29090 8'-hydroxyl ase for A B A CY P707A 5 Rice

CY P 707A

GT1 10547 01 Q949P 1 A RAT H A B A 8'-hydroxylase CY P 707A 1 GT1 05160 01 Q949P 1 A RATH A B A 8'-hydroxylase 1 GT1 06884 01 O65624 A RATH CY P 707A 1 CY P 722A 1 A t1g19630 P utative cytochrome P 450

G T1 00226 04 Q 9FWR7 A RATH Hypothetical protein

CY P 722A

CY P 722

CY P 722B 1 Rice CY P 722B CY P 733A 1 Rice CY P 733A CY P 725 CY P 716 CY P 718 A t2g42850 P utative cytochrom e P 450

CY P 718

CY P 728 CY P 710 CY P 727A 1 Rice

CY P 727A

CY P 711A CY P 77A S OLM E , S O Y B N and A t P utative cytochrome P 450 GT1 03555 01 C77A 3 S OY B N O SJNB a0042I15.13

CY P 77A

CY P 77

CY P 77B CY P 89A CY P 89B Rice GT1 12563 02 Q9FW87 ORY S A putative cytochrome P 450

CY P 89B group1

G T1 00204 02 Q6E 439 CUCME A CT 11D09.3 G T1 14351 01 Q 6E 439 CUCM E A CT11D09.3 CY P 89B 11 Rice G T1 02213 01 Q8W352 ORY S A cytochrome P 450 monooxygenase CY P 89A CY P 89B group2

GT1 04076 01 Q9FW87 ORY S A putative cytochrom e P 450

CY P 89 (CY P 723A )

G T1 03473 01 Q9FW80 ORY S A putati ve cytochrom e P 450 GT1 04097 01 Q9FW92 ORY S A putative cytochrome P450 CY P 89C CY P 89D CY P 89D1 Rice CY P 89E CY P 89E 1 Ri ce CY P 89G CY P 89G1 Ri ce CY P 89F1 Ri ce

CY P 89F

CY P723A CY P719 Q6S 9B 1 9ROS A E nt-kaurene oxidase CY P 701A CUCMA and P E A E nt-kaurene oxidase CY P 701A 3 A t5g25900 m ultifunctional ent-kaurene oxidase GT1 14650 01 Q84UV 3 P RUP E ent-kaurene oxi dase

CY P701A

G T1 10459 01 Q673G 1 HORV D cytochrome P 450 monooxygenase CY P 701A CY P 701A S TE RE K aurene oxi dase CY P 701A Rice

Supplemental Figure S4. Phylogenetic tree of P450 monooxygenases. Neighbor joining tree was generated with 1136 P450s including 170 ginger and turmeric P450s from ETS data, 247 Arabidopsis P450s, 350 rice P450s and 369 of other plant P450s, and classified according to clades.

CY P 73A CATRO , CA M A C, and P OP K I etc. Cinnam ic acid 4-hydroxylase Q 6QHK 2 A LLCE Cinamm ate 4-hydroxylase GT1 02760 02 Q 9A X P 9 9ROS I Ci nnami c acid 4-hydroxylase CY P 73A 5 A t2g30490 cinnam ic acid 4-hydroxylase CY P 73A P E TCR, 9A P IA and RUTGR Cinnam ate 4-hydroxylase CY P73A 35P Rice

CY P 73A

GT1 07533 01 Q9A X P 9 9ROS I cinnam ate 4-hydroxylase G T1 00037 02 Q9A X P 9 9RO S I cinnam ate 4-hydroxylase GT 1 02760 01 Q 6QHK 2 A LLCE cinam mate 4-hydroxylase G T1 00306 01 Q40909 P OP K I Trans-cinnam ate 4-monooxygenase Q9X E H8 P INTA Trans-cinnam ate 4-hydroxyl ase CY P73A S ORB I, WHE AT and Rice Ci nnami c aci d 4-hydroxylase CY P 73A P HAV U, TO B A C, and Rice etc. CY P 78A M AIZE and Ri ce G T1 14533 01 C78A B ORY S A putative cytochrome 450 CY P 78A 7 A t5g09970 P utati ve cytochrom e P 450 C78A 4 P INRA 78A 4

CY P 78

CY P 78A & B P INRA , Rice and A t CY P 78C A t and Rice CY P 78D1 Rice CY P 98A S ORB I, Rice, and WHE AT GT 1 08812 02 C98A 2 S OY B N CY P 98A 2p GT 1 03363 01 C98A 2 S OY B N CY P 98A 2p GT1 03286 01 C98A 2 S OY B N CY P 98A 2p CY P 98A OCIB A , S OY B N, and AT etc. P -coum aryol shikim ate 3'-hydroxylase isoform Q8V ZH6 P INTA P -coumarate 3-hydroxylase

CY P 98A

G T1 00002 042 C98A 2 S OY B N CY P 98A 2p GT 1 01039 01 C98A 2 S O Y B N CY P 98A 2p G T1 11747 02 Q5IDT9 P INTA coumarate 3-hydroxylase Q84V U3 S OLS C P utative CY P 98A LIT E R and S E S IN CY P 98A AT P utative cytochrom e P 450

CY P 76

CY P 80 B E RS T, E S CCA , and 9M A GN etc. CY P 80 GT 1 01106 02 Q93X J2 P E RFR flavonoi d 3'-hydroxylase CY P 706B Q947C2 G OS A R CY P 706B 1 +-d-cadinene-8-hydroxylase CY P 706A CY P 706C Ri ce

CY P 706 CY P 706C

GT1 10965 01 Q6Y S Z6 O RY S A putative P 450 m onooxygenase

CY P81

CY P 82

CY P 93A -D, F and G etc. Isoflavone synthase and fl avon synthases CY P93 Q9X HC6 S OY B N monooxygenaseCY P 93D1 CY P 705

87

O HO

Polyketide Synthase

NH2

OH

A

HO

OMT

OH

O MeO

C

HO O

Reductase(s)

SAHC

SAM

OH

HO

HO

OH

6-Gingerol

HO Reductase(s)

O

Hydroxylase

OH

O MeO

OMT

HO HO

OH

SAHC

SAM

OH

O

Hydroxylase

Reductase(s)

O

O

OH

HO OH

O

Polyketide OH Synthase

HO

OH

HO

Polyketide Synthase

OMe

B

OH

Hexanyl-CoA

Hexanyl-CoA L-Phe

Polyketide Synthase

Malonyl-CoA Malonyl-CoA

O

Hexanyl-CoA Malonyl-CoA

Malonyl-CoA

PAL

Malonyl-CoA

Polyketide Synthase

O O

OH

OH

CoAS

7

8

5

CST+CS3'H

4CL

1 6

2

HO

Shikimic Acid

3

4

O CST

OH

CCOMT

Caffeoyl 5-OShikimate

OH

p-Coumaric Acid

p-Coumaroyl-CoA

OMe

OH

OH

OH

OH

OH

OH

Feruloyl-CoA

Caffeoyl-CoA

2X

2X Malonyl-CoA

Polyketide Synthase

Malonyl-CoA

O

SAHC

SAM

OH

Malonyl-CoA

Polyketide Synthase

OH

O

Bisdemethoxycurcumin

OH

Hydroxylase

Hydroxylase

HO

OH

Demethoxycurcumin OMT

SAHC

SAM

OMe

OMT

HO

CoAS

O

CoAS

O

O

9

C4H

Cinnamic Acid

O

OH

OMT Hydroxylase

Hydroxylase

OH OH

HO

A

OH

SAM

OH

O

MeO HO

Curcumin

OMe OH

OMT

SAHC

O

Polyketide Synthase

SAHC

O

OH

HO HO

B

OMe

SAM

OH

Supplemental Figure S5. Proposed biosynthetic pathway from L-Phe to diarylheptanoids and gingerol-related compounds in ginger and turmeric. Enzymes are as follows: PAL = phenylalanine ammonia lyase; C4H = cinnamate 4-hydroxylase; 4CL = 4-coumarate:CoA ligase; CST = pcoumaroyl shikimate transferase; C3´H= p-coumaroyl 5-O-shikimate 3´-hydroxylase; OMT = Omethyltransferase; CCOMT = caffeoyl-CoA O-methyltransferase; SAMS = S-adenosylmethionine synthetase; SAHC = S-adenosylhomocysteine. All conversions have been demonstrated in other species, except for those catalyzed by the polyketide synthases, the reductases, and the hydroxylases and OMTs that would convert bisdemethoxycurcumin via demethoxycurcumin to curcumin (indicated by dashed arrows).

88

Cineole synthase 3.0E+08 Cineole

Peak area

Microarray 2.0E+08

1.5E+08

1.0E+08

5000

4000

3000

Spot intensity

2.5E+08

2000

5.0E+07

1000

0.0E+00

0

Supplemental Figure S6. Comparison of metabolite profiling by GC/MS and gene expression profiling by microarray experiment. 1,8-Cineole contents either in turmeric or ginger are shown as bar graph with error bar (standard error). Gene expression level from normalized microarray spot intensity is shown as scattered graph with error bar (standard error). The origins of ESTs for 1,8cineole synthase are yellow and white ginger rhizomes. Both metabolite profiling and gene expression profiling in 1,8-cineole synthase match very well. In this case, all three data, EST data, metabolite data and microarray data strongly suggest that 1,8-cineole synthase is responsible major production of 1,8-cineole in ginger rhizome.

89

protein modification transcription protein biosynthesis transport metabolism cellular process electron transport biological process unknown biosynthesis protein metabolism carbohydrate metabolism photosynthesis cell organization and biogenesis nucleotide and nucleic acid metabolism generation of precursor metabolites and energy amino acid and derivative metabolism signal transduction

ArRESTESTs Rhizome ESTs Rhizome-enriched ESTs

response to stress lipid metabolism catabolism response to abiotic stimulus DNAmetabolism physiological process cell cycle secondary metabolism response to biotic stimulus

0 0

0.5 5

10

25

P ercentage (% ) Supplemental Figure S7. Overall gene expression in rhizomes is similar to but distinct from that observed for other plant tissues. ESTs found exclusively in ginger or turmeric rhizomes, ESTs with shared expression in ginger or turmeric rhizomes and at least one other ginger or turmeric tissue, and total ArREST ESTs are represented as black, white, and grey bars, respectively. Values used to generate this graph are presented in Supplemental Table S7.

90

APPENDIX B - CLONING AND CHARACTERIZATION OF SEVERAL TERPENE SYNTHASES EXPLAIN TERPENOID PRODUCTION IN GINGER AND TURMERIC TISSUES

Manuscript “Cloning and characterization of several terpene synthases explain terpenoid production in ginger and turmeric tissues”, this manuscript is currently being edited for submission.

91

Cloning and characterization of several terpene synthases explain terpenoid production in ginger and turmeric tissues

Hyun Jo Kooa,b, David R. Ganga,b,* a

Department of Plant Sciences, College of Agriculture and Life Sciences, University of Arizona, Tucson, AZ 85721, USA b Bio5 Institute, University of Arizona, Tucson, AZ 85721, USA *

Corresponding author: David R. Gang Department of Plant Sciences and BIO5 Institute, University of Arizona, Tucson, AZ 85721-0036, USA Tel: 520-621-7154 Fax: 520-621-7186 email: [email protected]

92

Abstract The essential oils of Ginger (Zingiber officinale, Rosc.) and turmeric (Curcuma longa L.) rhizomes containing a large variety of terpenoids have been tested for medically related effects and some of them are known as anticancer, antiulcer, and antioxidant agents. Despite their importance, as far as we know, only two terpene synthases are verified from Zingiberaceae family; (+)-germacrene D synthase from ginger (Zingiber officinale, Rosc.) rhizome [1] and α-humulene synthase from shampoo ginger (Zingiber zerumbet Smith) rhizome [2]. Here, we report 25 mono- and 16 sesquiterpene synthase sequences cloned from ginger and turmeric and 13 mono- and 11 sesquiterpene synthases including paralogs with their functions identified. Studies on the modeled structure of these paralogs suggest an evolution pattern of terpene synthases in ginger and turmeric. We also suggest that tumerone and curlone are in fact α-turmerone and βturmerone, respectively, based on ginger and turmeric metabolite data, the function of αzingiberene/β-sesquiphellandrene synthase, and reanalysis of earlier data regarding the identification of these compounds. Finally, we identify a putative α-zingiberene/β-sesquiphellandrene oxidase, which we propose to be involved in production of the most important and abundant sesquiterpenoids, α-turmerone and β-turmerone, in turmeric.

93

Introduction

Ginger (Zingiber officinale, Rosc.) and turmeric (Curcuma longa L.) are tropical and sub-tropical perennial plants, whose rhizomes have been used for both culinary and medicinal purposes in different societies for thousands of years. Both ginger and turmeric belong to the Zingiberaceae, the ginger family and they are normally propagated by rhizomes. Ginger has been used for a long time for the treatment of a variety of human ailments including common colds, fever, rheumatic disorders, gastrointestinal complications, motion sickness, diabetes, and cancer [3] and also has anti-bacterial [4] and anti-fungal [5] activities. Many of these medicinal activities including anti-cancer and anti-inflammatory [6] are believed to be due to the presence of active phenolic compounds such as gingerol, paradol and shogoal [7-9]. However, terpenoids in ginger are also reported to have important roles. β-Elemene arrests the cell cycle and induces apoptotic cell death in lung cancer cells [10] and elemene is good for treatment of patients with chylothorax[11]. Zingiberene as well as 6-gingerol significantly inhibited gastric lesions [12] and further research revealed β-sesquiphellandrene, β -bisabolene, arcurcumene and 6-shogaol as anti-ulcer active principles in ginger [13]. Turmeric has anti-inflammatory [14] and anti-cancer [15] properties, which is mainly due to curcumin, a diarylheptanoid compound. Turmeric oil, reported to contain ar-turmerone, turmerone and curlone, showed antioxidant effects and may provide an

94

explanation for their antimutagenic action [16]. This turmeric oil also has anti-bacterial activity [17]. Both curcuminoids and sesquiterpenoids in turmeric exhibit hypoglycemic effects via peroxisome proliferator-activated receptor-γ (PPAR-γ) activation as one of the mechanisms and suppress an increase in blood glucose level in type 2 diabetic KK-Ay mice, and the effect was additive or synergistic when both curcuminoids and sesquiterpenoids in turmeric were applied together [18]. Ar-turmerone from turmeric oil displays anti-tumorigenesis activity inhibiting cell proliferation and activating arturmerone-mediated apoptotic protein in human lymphoma U937 cells [19]. It was also found that apoptosis by ar-turmerone was selectively induced in human leukemia Molt 4B and HL-60 cells, but observed not in human stomach cancer KATO III cells. Arturmerone also shows antiplatelet activities that can prevent and treat arteriol thrombosis [21]. Ginger and turmeric contain a variety of specialized compounds including polyketides and terpenoids, some of which are described above. Despite the important roles of these compounds, only two sesquiterpene synthases, (+)-germacrene D synthase from ginger (Zingiber officinale, Rosc.) rhizome [1] and α-humulene synthase from shampoo ginger (Zingiber zerumbet Smith) rhizome [2] have been verified in the Zingiberaceae family. Based on ginger and turmeric EST data from white ginger (rhizome, root and leaf), yellow ginger (rhizome, root and leaf) and turmeric (rhizome and leaf) (released on NCBI database) putative mono- and sesquiterpene synthases were selected, cloned and expressed with GPP and FPP as substrates in E. coli or yeast. Although many of these enzymes are insoluble, we could identify the functions for some

95

of them and report those results here. Some of these are paralogs and we also analyze why they produce different products even though their sequences and structures appear to be very similar according to protein structure modeling. Both ginger and turmeric produce α-zingiberene and β-sesquiphellandrene. However, only turmeric synthesizes α-turmerone and β-turmerone (Fig. 1), which are also describes as tumerone and curlone respectively in some papers [22]. Here we suggest that the major terpenoids produced by turmeric are indeed α-turmerone and β-turmerone, instead of tumerone and curlone.

Materials and methods

Plant material Ginger (Zingiber officinale, Rosc.) and turmeric (Curcuma longa L.) grown in a greenhouse for 5 to 7 months were used. For ginger, two varieties (white ginger and yellow ginger) were used, and they are different from white and yellow gingers which have white and yellow flowers, respectively. The white and yellow gingers used in this study are culinary ginger varieties that have green inflorescences and morphologically are very similar except they have slightly different rhizome color. Hawaiian red turmeric (HRT) variety was used for cloning genes and fat mild orange (FMO) variety was used to show GC/MS total ion chromatogram in Fig. 1. It is hard to identify differences in HRT and FMO GC/MS total ion chromatograms.

96

Cloning of full length cDNA Some contigs identified in the ginger and turmeric EST databases are homologous to TPS and appear to be full length, although others were incomplete. For those genes missing either/both 3’ or/and 5’ end sequences, the SMART RACE (Rapid Amplification of cDNA End) method (Clontech) was used to find the missing 3’ or/and 5’ end(s) except for the gene ST00. 5’ RACE ready cDNAs and 3’ RACE ready cDNAs were synthesized from 8 different total RNA (GW-Rh, GW-R, GW-L, GY-Rh, GY-R, GYL, T-Rh, T-L) extracted by RNeasy Plant Mini kit (Qiagen) using superscript III reverse transcriptase (Invitrogen) for 3’ RACE ready cDNAs and superscript II reverse transcriptase (Invitrogen) for 5’ RACE ready cDNAs respectively. 3’ RACE CDS (AAGCAGTGGTATCAACGCAGAGTAC(T)30VN), 5’ RACE CDS ((T)25VN), and SMART II™ A Oligonucleotide (AAGCAGTGGTATCAACGCAGAGTACGCGGG) were used for RACE ready cDNA synthesis according to manufacturer’s protocol. Amplifications of either the 3’ or 5’ end were carried out using Advantage 2 PCR kit (Clontech) with Universal Primer A Mix (UPM, Long: CTAATACGACTCACTATAGGGCAAGCAGTGGTATCAACGCAGAGT, Short: CTAATACGACTCACTATAGGGC) and gene specific primers for RACE (Supplementary Table S1). Using RACE product, a second round of PCR was done with gene specific nested primers for RACE (Supplementary Table S1) and either N-UP (5’AAGCAGTGGTATCAACGCAGAGT-3’), UP-M (5’CACTATAGGGCAAGCAGTGGT-3’), or UP-S (5’-CTAATACGACTCACTATAGGGC3’). PCR products of the expected estimated size were eluted using the MinElute PCR

97

purification kit (Qiagen) and inserted into pCR2.1-TOPO vector (Invitrogen). For ST00, we found 5’ end by a different method using ZO__Ed (T-Rh) cDNA library stock. Using Zd01L13RR, a ST00 specific primer, Zd01L13RR (Supplementary Table S1) and UPMT7 (5’CTAATACGACTCACTATAGGGCGTAATACGACTCACTATAGGGCGAATTG-3’), cloning vector specific primer, we amplified a ST00 specific fragment, which was subcloned as described above. After sequencing and confirmation of either 3’ and/or 5’ end(s) of sequence, full length cDNAs were amplified from either 3’ RACE ready cDNAs or 5’ RACE ready cDNAs using Pfu thermostable polymerase with two gene-specific primer sets (Supplementary Table S2) and were inserted into pCR2.1-TOPO vector and sub-cloned into the pCRT7/CT-TOPO vector (Invitrogen), the pEXP5/CT-TOPO vector (Invitrogen), the pET101/D-TOPO vector (Invitrogen) or the pH9GW vector for expression in E. coli. In this manner, full length mono- and sesquiterpene synthases were cloned into pCR2.1TOPO vector. Truncated monoterpene synthases (with plastidial peptides removed) were sub-cloned into expression vectors where full length sesquiterpene synthases were subcloned into expression vectors. For pH9GW-MT00, the PCR product produced using the 10N21GtwyAttF and 10N21G-AttR primers (Supplementary Table S2) was first cloned using the Gateway BP reaction by Gateway BP clonase II enzyme mix (Invitrogen) with pDONR207 vector (Invitrogen) to yield pENTR207-10N21 and then the pENTR20710N21-based Gateway LR reaction was performed using Gateway LR clonase II enzyme mix (Invitrogen) with pH9GW. For pH9GW-Zc05I02tt, PCR product with Zc05I02tFt

98

and Zc05I02tR primers (Supplementary Table S2) was first inserted into pENTR/DTOPO vector (Invitrogen) and then pENTR-Zc05I02tt was produced using the Gateway LR reaction with Gateway LR clonase II enzyme mix with pH9GW. There are two versions of pCRT7CT-MT11: with His-tag or without His-tag at 3’ end due to absence or presence of stop codon in gene specific reverse primers, Zc07C01CT-R and Zc07C01tR, respectively (Supplementary Table S2). Several genes were also cloned into the pESCURA vector (Stratagene). PCR products amplified by appropriate primer pairs (Supplementary Table S2) were sub-cloned into pCR8/GW-TOPO (Invitrogen) and fragments produced by digestion of the resulting constructs with BamHI (NEB) and XmaI (NEB) were sub-cloned into pESC-URA vector digested with BamHI and XmaI.

Expression of terpene synthases in E. coli and yeast We used several E. coli strains for expression of terpene synthases; BL21 CodonPlus (DE3) RIL (Stratagene), BL21 CodonPlus (DE3) RILP (Stratagene), Rosetta2 (DE3) pLysS (Novagen),

BL21-AI RIL and BL21 Star (DE3) pMevT pMBI RIL. The E.

coli strain, BL21-AI RIL is BL21-AI (Invitrogen) plus RIL plasmid from BL21 CodonPlus (DE3) RIL. The E. coli strain, BL21 Star (DE3) pMevT pMBI RIL is BL21 Star (DE3) (Invitrogen) plus three additional plasmids, pMevT, pMBI and RIL, where pMevT and pMBI are from Keasling lab [23] and RIL is from BL21 CodonPlus (DE3) RILP. The plasmids, pMevT and pMBI enrich IPP and DMAPP pools inside the E. coli cells. E. coli was grown at 37 °C to OD 0.6 at 600 nm and induced for 18 h at 18 °C

99

with IPTG (0.05 mM ~ 1 mM) and 0.2% arabinose if required. We used several yeast strains for terpene synthase expression; INVSc1 (Invitrogen), EPY219 and EPY224, which are from the Keasling lab [24]. Yeast cells grown for 2 d at 30 °C in SD-URA media were transferred to NB-URA media and grown at 18 °C for 2-8 d.

Enzyme assays of terpene synthases expressed in E. coli Overnight grown E. coli cultures that had been induced to express recombinant proteins were centrifuged to collect cell pellets. The pellets were vortexed with Washing Buffer (20 mM Tris-HCl, pH 7.0, 50 mM KCl) and then centrifuged. Protein Extraction Buffer (50 mM 3-(N-morpholino)-2-hydroxypropanesulfonic acid, pH 7.0, 10% [v/v] glycerol, 5 mM MgCl2, 5 mM DTT, 5 mM sodium ascorbate, 0.5 mM phenylmethylsulfonyl fluoride) was added to washed E. coli pellets and vortexed, sonicated and centrifuged. Supernatant was recovered and the buffer was changed to Enzyme Assay Buffer (10 mM 3-(N-morpholino)-2-hydroxypropanesulfonic acid, pH 7.0, 10% [v/v] glycerol, 1 mM DTT) using PD-10 columns (GE Healthcare Life Sciences). Divalent cations (20 mM MgCl2 and/or 0.5 mM MnCl2 at final concentration), protease inhibitors (0.2 mM NaWO4, 0.1 mM NaF at final concentration) and either geranyl diphosphate (GPP, 10 µg) or farnesyl diphosphate (FPP, 10 µg) were added to total 500 µl of Enzyme Assay Buffer containing soluble proteins and incubated for 3 h at 30 °C with 200 µl of top layered pentane. Either the top pentane removed directly at the end of the assay time and/or removed after it was vortexed with the cell medium and then

100

centrifuged

was used for metabolite analysis.

Terpenoid analysis A Thermo Finnigan Trace GC 2000 with a Rtx-5MS w/5m Integra-Guard Column (Restek, 0.25mm ID, 0.25µm df, 30 m) coupled to a DSQ mass spectrometer was used for Gas chromatography/mass spectrometry (GC/MS) analysis, using methods previously described [25]. Eluted compounds were identified using the NIST/EPA/NIH Mass Spectral Library (NIST 02) and the essential oil GC/MS mass spectra library from Dr. Robert P. Adams [26].

Western blot for terpene synthases expressed in yeast Proteins from yeast expressing terpene synthases were extracted using acidwashed glass beads (425-600 µm, 30-40 U.S. sieve) (Sigma). Yeast cell pellets from 10 to 15 ml of media was vortexed 15 times for 30 seconds on ice with 0.5 g of acid-washed glass beads and Protein Extraction Buffer as described above. Protein concentration was determined by the Bradford Protein Assay (Bio-Rad). After 10 µg of total and soluble proteins were run on SDS-PAGE, the gel was blotted onto PVDF Transfer Membrane (0.45 µm, Thermo Pierce) with transfer buffer (12 mM Tris, 96 mM Glycine, 20% (v/v) MtOH, pH 8.3) in a western blot transfer apparatus according to the protocol of PVDF Transfer Membrane manual. After treatment with blocking solution (TBS containing 5% [w/v] non-fat milk powder), membranes were incubated with blocking solution containing Mouse Anti-c-Myc-tag Monoclonal Antibody (Genscript) for 2 h, washed

101

twice with TBS-T (TBS containing 0.1% (v/v) Tween-20) for 30 min, incubated with blocking solution containing Goat Anti-Mouse IgG (H&L) [HRP] Polyclonal Antibody (Genscript) for 2 h and washed twice with TBS-T for 30 min. The ECL system (SuperSignal West Pico Chemiluminescent Substrate, Thermo Pierce) was used to detect expressed terpene synthases in the yeast. For positive control we used Multiple Tag (Purified) (Genscript).

Protein structural modeling SWISS-MODEL [27] was used to model the putative protein structures and UCSF Chimera [27] was used to visualize the models.

Results and Discussion

Cloning of terpene synthases From ginger and turmeric cDNA libraries, 20 contigs of monoterpenes, 10 contigs of sesquiterpenes, 2 contigs of diterpenes, 3 contigs of triterpenes and 10 contigs of tetraterpenes were found by homology search to known terpene synthases and we focused on mono- and sesquiterpene synthases. Four monoterpene synthase contigs and 3 sesquiterpene synthase contigs were removed from the list because they were very similar to other contigs and had very few polymorphisms or they had some non-specific extra sequences and classified as different contigs with others. Out of the remaining 16 monoterpene synthases and 7 sesquiterpene synthases, 2 monoterpene synthases were full

102

length, 2 monoterpenes synthases were full length but there was a frameshift in each contig, and the others were partial at the 5’ region and/or 3’ region. Nine monoterpene synthase contigs were missing only the 5’ region sequences and 3 monoterpene contigs required both 5’ and 3’ region sequences. For sesquiterpene synthases, 6 contigs needed 5’ region sequences and 1 contig needed both 5’ and 3’ region sequences. Using 5’and 3’ RACE (Rapid Amplification of cDNA End), we could identify unknown regions of most contigs except for one monoterpene synthase, MT15, which is missing the 5’ region sequence and has high homology with MT06 as shown in a phylogenetic tree (Supplementary Fig. S1), which was generated with 45 ginger and turmeric terpene synthase contigs and 181 terpene synthases from GenBank and only ginger and turmeric terpene synthases are shown. Even though most contigs are partial, similar terpene synthases are grouped together. The sesquiterpene synthase clade is separated from other terpene synthase groups, which also separated into mono-, di-, tri-, and tetra-terpene synthase clades. Although MT00 is considered to be a monoterpene synthase, it resides in an outgroup from the monoterpene synthase clade. Two contigs, MT00 and MT11 have full length ESTs without frameshift and were subcloned from the EST plasmid directly into the cloning vector. For other partial contigs, we cloned full length genes from ginger and turmeric tissues (rhizome, root or leaf) after RACE, where each contig has EST(s) from although some contigs were cloned from other tissues than their EST(s) was/were reported. Several contigs, as shown in Table 1, were cloned from different tissues. In the beginning, we tried to clone genes from cDNA from the sample that had ESTs for the corresponding contig. Some genes,

103

were hard to clone and we could clone them from cDNA prepared from other tissues than where ESTs were found. After that, we first checked if the cDNA preparation from each tissue had the gene or not using PCR with two gene specific primers before we cloned the genes. White ginger and yellow ginger are very similar and especially rhizome and root tissues show similar metabolite profiles. For example, MT09 has ESTs in GY-Rh sample and MT18 that is also considered as same with MT09 has ESTs in GW-R sample. So it does not seem strange that even though ST05 was cloned from GW-R cDNA when ESTs were found in GY-R sample and ST07 was cloned from GW-R cDNA when ESTs were found in GW-Rh. For MT17, we cloned this gene from the T-L sample although ESTs were found in T-Rh. Moreover ST03 was cloned from the GY-Rh sample although ESTs was found in the T-L sample. There could be some errors either in preparing cDNA libraries or sequencing ESTs and it is important to check if cDNA has target genes before attempting to clone specific genes. During the cloning process, we sequenced cloned genes from several colonies and found that some contigs have paralogs. Sometimes we could not find the original sequence from the EST databases and could find only paralog(s) instead. A phylogenetic tree generated with full length ginger and turmeric mono- and sesquiterpene synthases has one sesquiterpene synthase clade and two monoterpene synthase clades, one being close to the sesquiterpene clade whereas the other is outgrouped (Fig. 2). Similar to phylogenetic trees generated with partial terpene synthase contigs before cloning, MT00 does not belong to a specific clade. In Fig. 2, the main product(s) of each terpene synthase is/are shown next to the gene name and the tissue where the gene was cloned is

104

also shown. All ginger and turmeric terpene synthases possess the conserved DDXXD motif required for interaction with the diphosphate group and monoterpene synthases that have transit peptide have conserved RRX8W motif at 5’ region except MT00, lonalool/nerolidol synthase (Fig. 3). Most of sesquiterpene synthases have RX9W motif at 5’ region except ST01, β-selinene synthase. Although MT00 does not have even conserved tryptophan in RRX8W motif, ST01 has tryptophan.

Expression of terpene synthases The product lists of ginger and turmeric terpene synthases is summarized in Tables 2 and 3. Some monoterpene synthases produce sesquiterpenes and some sesquiterpene synthases synthesize monoterpenes, when provided the alternative substrate. Major product(s) of each terpene synthase is/are marked with grey blocking in these tables. To identify the products of terpene synthases, E. coli or yeast are used to express the recombinant proteins and either enzyme assay using crude E. coli extract with GPP or FPP as a substrate, or pentane extracts of E. coli or yeast cultures were used (Table 4). All chromatograms in the figures have absolute scale. During identification of products using the GC/MS library, sometimes there are unknown peaks where it is hard to determine which compound they represent. They are called “unknown” because the score from the library comparison is not high enough or there is no information about retention time and it could not be compared accurately with other compounds, in which case the most similar compound is marked with the term “-

105

like”.

α-Zingiberene/β-sesquiphellandrene synthase (ST00A and ST00B) The most abundant terpenes in ginger and turmeric are α-zingiberene and βsesquiphellandrene (Fig. 1) and these compounds are produced by the terpene synthases ST00A and ST00B. ST00A and ST00B are very similar, with 98.8% homology at the DNA sequence level and 98.4% in amino acid sequence, and they synthesize same products. ST00A and ST00B produced very small amount of monoterpenes and large amount of sesquiterpenes, which suggest that these enzymes are sesquiterpene synthases. This is supported by their sequence similarity to other sesquiterpene synthases and absence of a transit peptide. Enzyme assay using E. coli crude extracts from cells expressing either ST00A or ST00B with GPP as a substrate showed β-phellandrene (49.2%) and α-pinene (33.7%) as major products and sabinene (4(10)-thujene) (6.2%), β-pinene (5.9%) and α-phellandrene (5.0%) as minor products although their amounts are very small (Fig. 4). With FPP as a substrate, ST00A and ST00B produced α-zingiberene (49.3%), β-sesquiphellandrene (40.7%), β-bisabolene (6.3%) and an unknown (trans-sesquisabinene hydrate-like2) compound (3.7%) (Fig. 4). ST00A and ST00B expressed in the yeast strain EPY219 produced the same products as when it was expressed in E. coli. The ratios of products differ according to expression temperature and induction time. Expression at lower temperature, 18 °C, yielded more products than expression at 30 °C, and longer induction days normally

106

yielded more (8 days versus 2 days). After 4 days of induction at 18 °C, pentane extracts of EPY219 expressing ST00A were analyzed; α-zingiberene (67%), β-sesquiphellandrene (22.7%), β-bisabolene (6.2%), ar-curcumene (0.9%), unknown (trans-sesquisabinene hydrate-like2) (0.6%), [E]-γ-bisabolene

(0.4%), unknown (trans-sesquisabinene

hydrate-like3) (0.4%), γ-eudesmol (0.4%), γ-curcumene

(0.3%), unknown (7-epi-

sesquithujene-like) (0.3%), unknown (α-eudesmol-like) (0.3%), trans-α-bergamotene (0.2%) and α-acorenol (0.2%). When ST00A was expressed in INVSc1 yeast cells, we also could see α-zingiberene and β-sesquiphellandrene peaks, however the amounts are 2.56% and 2.34% respectively when compared to the amount from EPY224 expressing ST00A. When considering the peak area of [E]-nerolidol that is produced by yeast itself in both INVSc1 and EPY224 as a standard, the produced α-zingiberene/βsesquiphellandrene in the INVSc1 expressing ST00A is only 3.33% of that from EPY224 expressing ST00A and the yield difference is mainly due to overall ability of IPP production in these yeast strains. There is a big difference in the ratio of α-zingiberene and β-sesquiphellandrene in E. coli and yeast expressions. The amount of α-zingiberene in yeast is about 3 times more than the amount of β-sesquiphellandrene, however only 1.2 times more in E. coli. When expressed in E. coli, we did not include stop codon when we insert ST00A or ST00B to pH9GW vector, which caused extra amino acids to be added at 3’ end. Also, there is a His-tag at the 5’ end of the protein expressed from the pH9GW vector. For yeast, translation starts at ATG of ST00A or ST00B and there is a short Myc tag at the 3’ end. Also EPY219 produces the products of ST00A or ST00B in vivo whereas we used in vitro

107

enzyme assays to measure activities of these enzymes expressed in E. coli. Therefore, the product ratio from yeast is more reliable than E. coli for ST00A or ST00B.

Camphene/α-pinene synthases (MT06B, MT09A2 and MT12A-M2) Camphene is the most abundant monoterpene in ginger and three genes, MT06B, MT09A2 and MT12A-M2 produce camphene as a major product. MT12A-M2 has one point mutation at the 140th amino acid, D to G, where other ginger and turmeric monoterpene synthases have either an E in the clade including MT02A, MT04, MT05, MT11 and MT17, which is closer to sesquiterpene synthase clade, or D in other clades including monoterpene synthases (Fig. 2, Fig. 3). Sesquiterpene synthase sequences are not conserved at this amino acid. MT12A-M2 is in the same clade with MT06B and MT09A2 (Fig. 2) and produce the same products. With GPP as a substrate, MT06B synthesized camphene (45.4%), α-pinene (40.4%) as major products and limonene (6.6%), borneol (endo-borneol) (3.3%), pmentha-1,4(8)-diene (terpinolene) (1.5%), β-pinene (1.0%), tricyclene (0.7%), cissabinene hydrate (0.4%), p-menth-1-en-8-ol (α-terpineol) (0.3%) and trans-pinene hydrate (trans-pinan-2-ol) (0.3%) as minor products (Supplementary Fig S2). MT09A2 utilizes GPP and produced camphene (60.1%) and α-pinene (20.4%) as major products and limonene (7.7%), borneol (endo-borneol) (6.1%), p-mentha-1,4(8)-diene (terpinolene) (1.6%), tricyclene (1.5%), β-citronellal (0.9%), β-pinene (0.6%), 8, cis-sabinene hydrate (0.4%), p-menth-1-en-8-ol (α-terpineol) (0.4%), trans-pinene hydrate (trans-pinan-2-ol) (0.3%) and γ-terpinene (0.1%) as minor products (Supplementary Fig. S3). With GPP as a

108

substrate, MT12A-M2 synthesized camphene (55.8%), α-pinene (28.4%) as major products and limonene (7.9%), borneol (endo-borneol) (3.3%), p-mentha-1,4(8)-diene (terpinolene) (1.9%), tricyclene (1.2%), β-pinene (0.9%), p-menth-1-en-8-ol (α-terpineol) (0.4%), trans-pinene hydrate (trans-pinan-2-ol) (0.3%) and cis-sabinene hydrate (0.1%) as minor products (Supplementary Fig. S4). All three proteins did not produce sesquiterpenes with FPP as a substrate. The ratio of camphene/α-pinene is different in these proteins; 1.12 for MT06B, 1.96 for MT12A-M2 and 2.95 for MT09A2 (Table 2). In 7 month old ginger rhizome samples, the ratio of camphene:α-pinene is 2.99:1, which suggests that MT09A2 or related genes (MT09, MT09B, etc) are expressed at higher level than other TPS genes, although it is also possible that there is other genes that make camphene more specifically. Microarray data support this hypothesis; the average intensity of MT06, MT09 and MT12 contigs are 7194, 18343 and 4427 respectively.

α-Pinene/β-pinene synthase (MT04) MT06, MT09A2 and MT06B produce camphene more than α-pinene, however MT04 mainly produces α-pinene. Enzyme assays yielded α-pinene (60.1%), β-pinene (30.7%) as major products and limonene (5.6%), sabinene (4(10)-thujene) (2.7%) and 1,8-cineole (eucalyptol) (0.9%) as minor products with GPP as a substrate and no product with FPP as a substrate (Supplementary Fig. S5). For the limonene peak in GC/MS chromatograms, there are actually two co-eluting peaks in this peak. The one that is smaller and comes earlier looks like limonene and the other one which is bigger and

109

comes later looks like (R)-(+)-m-mentha-6,8-diene (sylvestrene) when considering both retention time and mass spectra, however it is not easy to discriminate due to their similarity. Although MT04 mainly produces α-pinene, maybe most α-pinene is produced by camphene/α-pinene synthases (MT06B, MT09A2 and MT12A-M2). The β-pinene peak from ginger rhizome samples is very small (Fig. 1) and the amount of α-pinene produced by MT04 is about twice that of β-pinene, which suggests that MT04 contributes at most 20.7% of α-pinene production in ginger rhizome. The main monoterpene of turmeric leaf is β-pinene. When considering that MT04 was cloned from yellow ginger root and MT04 produces α-pinene more than βpinene, there should be a turmeric leaf specific β-pinene synthase.

β-Phellandrene synthase (MT08) β-Phellandrene is the second most abundant monoterpene in ginger rhizome and the major product of MT08 is β-phellandrene. From EST data, the sequence of MT08 is complete but has one frameshift, so MT08 with the frameshift eliminated was cloned from white ginger rhizome. However, MT08 proteins expressed in E. coli with/without transit peptide are insoluble when the pCR7CT-TOPO vector was used for expression and did not show any terpene product in the enzyme assay with GPP or FPP. MT08 without transit peptide plus thrombin cleavage site at the N terminus was cloned into pH9GW vector and expressed in BL21 Star (DE3) pMevT pMBI RIL. MT08 expressed in BL21 Star (DE3) pMevT pMBI RIL produced β-phellandrene and trace amounts of α-pinene

110

and [Z]-β-farnesene, however the peaks are small because of low solubility. Enzyme assay of E. coli crude extract with GPP yielded more terpene products and the α-pinene peak was also clearly visible at this point; β-phellandrene (88.3%), α-pinene (11.7%) (Supplementary Fig. S6). β-Phellandrene is also produced by α-zingiberene/β-sesquiphellandrene synthase (ST00A and ST00B), which make β-phellandrene as 49.2% of their monoterpene production. β-Phellandrene and β-sesquiphellandrene are structurally very similar. Although α-zingiberene/β-sesquiphellandrene synthase is very active and produces lots of α-zingiberene and β-sesquiphellandrene, the production of monoterpenes is not as efficient as for sesquiterpenes as shown in Fig. 4. The peak of β-phellandrene from αzingiberene/β-sesquiphellandrene synthase is even smaller than the E. coli background peak. When considering the fact that monoterpenes are mainly produced in plastids, MT04 is thought to be the main enzyme for β-phellandrene production in ginger rhizome.

1,8-Cineole synthase (MT11) Both ginger and turmeric produce 1,8-cineole in a large amount (Fig. 1) and MT11 makes 1,8-cineole. From the sequence from EST data, MT11 is a full length EST and the gene was directly cloned from the cDNA library plasmid to the pCRT7CT-TOPO vector. Enzyme assays with GPP using E. coli crude extract showed 1,8-cineole (78.3%), p-menth-1-en-8-ol (α-terpineol) (17.4%) and α-pinene (4.3%) (Supplementary Fig. S7). Expression of MT11 in BL21 Star (DE3) pMevT pMBI RIL also produced 1,8-cineole and p-menth-1-en-8-ol (α-terpineol), however the peak of α-pinene was not shown due to

111

overall low concentration of extracted terpenes. No visible peaks for sesquiterpenes produced by MT11 were observed from pentane extract of BL21 Star (DE3) pMevT pMBI expressing MT11.

α-Phellandrene synthase (MT03) α-Phellandrene is the most abundant monoterpene in turmeric rhizome (Fig. 1). With GPP as a substrate MT03 synthesizes α-phellandrene (92.2%) as a major product and β-phellandrene (contains limonene) (3.8%), γ-terpinene (2.5%), α-terpinene (1.0%) and p-mentha-1,4(8)-diene (terpinolene) (0.5%) as minor products, however with FPP as a substrate MT03 did not synthesize a specific sesquiterpene (Supplementary Fig. S8). The β-phellandrene peak contains also small amounts of limonene. Limonene and βphellandrene have only a 0.01 min retention time difference in our GC/MS chromatogram and they come out together. The mass spectrum of β-phellandrene peak shows overall similarity with β-phellandrene mass spectra from the library, however mj/z ions 67 and 68 are from limonene (Supplementary Fig. S8).

p-Mentha-1,4(8)-diene (terpinolene) synthase (MT07) Turmeric rhizome produces p-mentha-1,4(8)-diene (terpinolene) as second most abundant monoterpene (Fig. 1). With GPP as a substrate, MT07 synthesized p-mentha1,4(8)-diene (terpinolene) (89.9%) as a major product and α-terpinene (4.3%), αphellandrene (2.7%), limonene (1.5%), 3-carene; 4 (1.2%) and γ-terpinene (0.4%) as minor products (Supplementary Fig. S9). However MT07 did not produce any

112

sesquiterpene with FPP as substrate.

(S)-(+)-Linalool/nerolidol synthase (MT00) and (R)-(-)-linalool/[Z]-α-bisabolene/trans-α-bergamotene synthase (MT17A2) Ginger rhizome and leaf produce linalool and turmeric leaf also produces smaller amounts of linalool. Enzyme assays with MT00 produced linalool (100%) with GPP as a substrate and [E]-nerolidol (100%) with FPP as a substrate (Supplementary Fig. S10). Acetate forms of linalool and [E]-nerolidol were also produced during enzyme assays using crude protein extracts and some proteins from E. coli may be involved in adding the acetate. No acetylated form of linalool and [E]-nerolidol was observed in the enzyme assay using purified MT00, where most E. coli proteins are eliminated. Expression of MT00 in BL21 Star (DE3) pMevT pMBI RIL also produced linalool and [E]-nerolidol. Even without IPTG induction, linalool and [E]-nerolidol were synthesized and IPTG induction elevated terpene production. When MT00 was expressed in BL21 Star (DE3) pMevT pMBI RIL cells, MT00 gene was cloned into pCRT7CT-TOPO vector and T7 promoter in this vector is not tightly regulated. We could see leaky expression in protein gels and resulting terpene production. The cloning of MT17 yielded diverse paralogs; MT17A, MT17A2, MT17B, MT17B2, MT17C and MT17D. Enzyme assays with MT17A2 synthesized linalool (100%) with GPP and a variety of sesquiterpenes with FPP (Supplementary Fig. S11). MT17A2 produced small amount of different sesquiterpenes such as cis-α-bisabolene

113

(22.7%), trans-α-bergamotene (20.6%), β-bisabolene

(13.6%), epi-α-bisabolol (12.6%),

cis-α-bergamotene (7.6%), α-bisabolol (7.3%), [E]-nerolidol (5.8%), βsesquiphellandrene (3.7%), unkown (cis-α-bisabolene-like) (2.1%), γ-curcumene (1.6%), unknown (β-sesquiphellandrene-like) (1.5%) and unknown (7-epi-sesquithujene-like) (0.8%). The mass spectra of three unknown compounds are very similar to either cis-αbisabolene, β-sesquiphellandrene or 7-epi-sesquithujene. However the retention time of these compounds do not match with the retention time of those compounds in the library. Nevertheless, the mass spectrum of unknown peak (7-epi-sesquithujene-like) is very similar with that of 7-epi-sesquithujene and the retention time is also close to that of 7epi-sesquithujene. The peak of β-bisabolene also contains the farnesene peak; the farnesene peak comes earlier and β-bisabolene peak comes later in this peak. The average peak areas of two peaks between peak 6 and peak 7 are 2115009 and 1375920 in Supplementary Fig. S11, panels D2 and E2 respectively, which means peak areas in panel E2 is approximately 65% of those of D2. The peak area of β-bisabolene is calculated after subtracting adjusted farnesene peak area in D2 from the peak of farnesene and βbisabolene in E2. Control enzyme assay without pEXP5CT-MT17A2 also produced [E]nerolidol, and [E]-nerolidol peak in E2 is also calculated in the same way subtracting [E]nerolidol in D2. The best hit from a blast search of MT00 is another linalool synthase, however MT17A2 is more similar with other monoterpene synthases. MT00 produces both linalool and nerolidol in a large amount, however MT17A2 synthesizes sesquiterpenes in

114

very small amounts although it synthesizes a large amount of linalool. When a chiral column was used, we found that MT00 produces (S)-(+)-linalool while MT17A2 synthesizes (R)-(-)-linalool (Supplementary Fig. S12). The origin of these linalool synthases are different and they produce different compounds, (S)-(+)-linalool and (R)-()-linalool. MT17A2 has several paralogs and it will be interesting to check other paralogs if they have same enzyme activity with MT17A2 or not.

Sabinene/nerolidol synthase (MT06 and MT06A) In ginger and turmeric, there are many small terpene peaks and many of them are considered as by-products of other terpenes with big peaks. However the evolution of terpene synthases sometimes leads to different products and/or incomplete enzyme activity. Maybe MT06 and MT06A belong to this category. Although their solubility is small or medium (Table 4), the amount of products by enzyme assay is low; the peak height is similar with the peaks synthesized by E. coli, myrcene or [E]-β-farnesene. However MT06/MT06A produce many different compounds. MT06 and MT06A are 98.4% identical in nucleotide sequences and 98.0% identical in amino acid sequences and the differences are only shown in the transit peptide sequences with one gap. When they are cloned to expression vector, pEXP5CTTOPO vector, transit peptides were removed, therefore truncated MT06 and MT06A are the same. MT06 and MT06A produce a variety of monoterpenes; sabinene (4(10)-thujene) (29.0%), linalool (24.8%), γ-terpinene (9.0%), α-terpinene (8.2%), p-menth-1-en-4-ol (terpinen-4-ol) (8.0%), cis-sabinene hydrate (6.2%), p-menth-1-en-8-ol (α-terpineol)

115

(6.2%), p-mentha-1,4(8)-diene (terpinolene) (5.2%) and α-thujene (3-thujene) (3.4%), and variety of sesquiterpenes; [E]-nerolidol (41.0%), epi-β-bisabolol (22.3%), βcurcumeme (14.5%), unknown (7-epi-sesquithujene-like) (8.1%), unknown (cissesquisabinene hydrate-like) (4.5%), γ-curcumene (4.4%), β-sesquiphellandrene (1.9%), epi-α-bisabolol (1.6%), unknown (trans-sesquisabinene hydrate-like1) (1.2%), [E]-γbisabolene (0.6%) (Supplementary Fig. S13, Supplementary Fig. S14). Although control E. coli without MT06 or MT06A also produces linalool and [E]-nerolidol, MT06 and MT06A are thought to produce linalool and [E]-nerolidol when the amount with myrcene and [E]-β-farnesene peaks are compared to corresponding peaks of the control. Peaks referred to as unknown (7-epi-sesquithujene-like), unknown (cis-sesquisabinene hydratelike) and unknown (trans-sesquisabinene hydrate-like1) and have very similar mass spectra with 7-epi-sesquithujene, cis-sesquisabinene hydrate and trans-sesquisabinene hydrate, however retention time for these samples are slightly deviated from retention time from library (Supplementary Fig. S13, Supplementary Fig. S14, K). Some MT06/MT06A products, such as α-thujene (3-thujene), p-menth-1-en-4-ol (terpinen-4-ol), β-curcumene, epi-β-bisabolol and some unknown compounds, are produced by only MT06/MT06A in the genes that we cloned and characterized, although it is possible that there are other as yet unidentified enzymes that specifically synthesize these products. These minor products are produced at small amount in ginger and turmeric tissues and maybe they are supplied by the terpene synthase like MT06/MT06A.

epi-α-Bisabolol/α-bisabolol synthase (MT02A)

116

There are small amounts of epi-α-bisabolol and α-bisabolol in ginger and turmeric. MT02 has a transit peptide and is similar to monoterpene synthases. However it is insoluble when expressed in E. coli and produced MT02A specific terpene products when expressed in the yeast strains, EPY219 and EPY224. EPY224 with MT02 expression produced epi-α-bisabolol (58.3%) and α-bisabolol (38.7%) as major product although EPY224 control also produced very small amounts of these compounds (Supplementary Fig. S15). MT02A also synthesized trace amounts of [Z]-α-bisabolene (1.7%), β-bisabolene (1.1%), trans-α -bergamotene (0.1%). According to sequence homology, MT02A is classified as a monoterpene synthase, however it did not produce monoterpenes in yeast and we think production of monoterpene is limited at least in yeast strain EPY219 and EPY224.

(-)-Neointermedeol synthase (ST02A4), α-elemol synthase (ST02B) and β-elemene (germacrene A) synthase (ST02C) During the cloning of ST02, many paralogs were found; ST02A, ST02A2, ST02A3, ST02A4, ST02B, ST02B2-FS, ST02C and ST02C2. Many of them are insoluble in both E. coli and yeast, however ST02B and ST02C are soluble in E. coli. ST02A4 is insoluble in both E. coli and yeast when checked by stained protein gel (for E. coli) and Western blot (for yeast), however yeast expressing ST02A4 produced very small amount of products. Pentane extract of yeast strain, EPY224 expressing ST02A4 produced (-)neointermedeol (48.7%) as a major product and β-elemene (12.6%), α-cadinol (7.5%),

117

unknown (cubenol-like) (6.5%), germacrene D (6.1%), epi-α-muurolol (ô-muurolene) (6.0%), α-muurolol (δ-cadinol) (2.5%), unknown (selina-6-en-4-ol-like) (2.4%), αmuurolene (1.6%), δ-cadinene (cadina-1(10),4-diene) (1.2%), δ-elemene

(1.2%), γ-

elemene (1.1%), γ-eudesmol (0.8%), (+)-intermedeol (0.8%), [E]-caryophyllene (0.5%) and unknown (spathulenol-like) (0.4%) as minor products (Supplementary Fig. S16). Enzyme assays using E. coli crude extract expressing ST02B with GPP as a substrate produced several monoterpenes; linalool (25.1%), myrcene (25.0%), limonene (15.5%), (Z)-β-ocimene (14.3%), (E)-β-ocimene (9.8%), p-mentha-1,4(8)-diene (terpinolene) (7.7%) and p-menth-1-en-8-ol (α-terpineol) (2.7%) (Supplementary Fig. S17). Because E. coli crude extract without pET101/D-ST02B plasmid also produced myrcene with GPP as a substrate, the area of the myrcene peak in the control was subtracted from the area of myrcene peak in the ST02B experiment. With FPP as a substrate, enzyme assays using E. coli crude extract expressing ST02B also produced several sesquiterpenes. Enzyme assays were done with overlaid pentane. After 3 hours of 30 ℃ incubation, top pentane was directly injected into GC/MS or whole enzyme assay including remaining top pentane was vortexed, centrifuged and collected pentane was injected into GC/MS. Normally both total ion chromatograms using top pentane or vortexed pentane show very similar profiles, however ST02B showed different ratios of products. The major difference is the amount of α-elemol, which represented 60.5% of synthesized sesquiterpenes in the top overlaid pentane but only 25.3% in vortexed samples. When the peak areas from both top and vortexed pentane are summed α-elemol is 44.3% of all products. During enzyme assays, synthesized terpenes are captured in top

118

pentane layer and synthesized terpenes in the enzyme assay buffer are collected by vortexing with the pentane. So the product ratio is calculated by the sum of both peak areas from the top pentane and vortexed pentane, which suggested that ST02B utilizes FPP and synthesizes α-elemol (44.3%), β-elemene (18.3%), α-copaene (11.2%), unknown (cyclosativene-like) (7.8%), γ-elemene (6.1%), germacrene B (4.9%), δ-cadinene (cadina1(10),4-diene) (2.8%), α-muurolene (3.1%) and germacrene D (1.6%) (Supplementary Fig. S18). The retention time of the peak, unknown ((+)-cyclosativene-like) is 18.14 and there is trace amount of (+)-cyclosativene at RT 18.21 although it is not seen clearly in the figure because the peak is small. Enzyme assays using E. coli crude extract expressing ST02C with GPP as a substrate produced several monoterpenes; myrcene (30.4%), limonene (17.9%), linalool (15.6%), [Z]-β-ocimene (13.5%), p-mentha-1,4(8)-diene (terpinolene) (9.9%), (E)-βocimene (9.3%) and p-menth-1-en-8-ol (α-terpineol) (3.3%) (Supplementary Fig. S19). Because E. coli crude extracts without pET101/D-ST02C plasmid also produce myrcene with GPP as a substrate, the area of the myrcene peak in the control was subtracted from the area of myrcene peak in ST02C experiment. Enzyme assay using E. coli crude extract expressing ST02C with FPP as a substrate produced β-elemene (49.3%) as major product and germacrene D (12.4%), α-muurolene (9.5%), γ-muurolene (4.7%), δ-cadinene (cadina-1(10),4-diene) (4.0%), (+)-cyclosativene (3.2%), germacrene B (2.9%), δelemene (2.9%), γ-elemene (2.7%), [E]-caryophyllene (2.4%), unknown (β-elemenelike) (2.1%), unknown ((+)-cyclosativene-ike) (2.1%) and α-copaene (1.7%) as minor products (Supplementary Fig. S20).

119

Monoterpene products are very similar between ST02B and ST02C, however sesquiterpenes product profiles are different in these two proteins. ST02C makes βelemene as the major product and ST02B synthesizes α-elemol as the major product although ST02B also produces β-elemene. The homology percent between ST02B and ST02C is 96.0% in DNA sequence and 95.7% in amino acids sequence. ST02A4 is 95.3% identical with ST02B and 96.0% with ST02C in amino acid sequence. Germacrene A is known to undergo a thermal cope rearrangement to β-elemene [28] (Fig. 9). Germacrene A was detected when the GC injection port was set to 250 °C, with a small amount of conversion to β-elemene. However, germacrene A completely converted to β-elemene when the sample was injected at 280 °C, using an Agilent 6890N gas chromatography system coupled to an Agilent 5975B mass spectrometer [29]. βElemene was also reported to no longer be observed and was replaced by a broad peak corresponding to germacrene A when the injection port was set between 150°C and 180°C instead of the usual 280°C (specific GC/MS was not mentioned in the paper) [30]. β-elemene was detected in assays with ST02B and ST02C with a 220 °C GC inlet temperature. With the inlet temperature set to 150 °C, the amount of β-elemene decreased and germacrene A was detected in our system. Although the amount of germacrene A is small and the reduction of β-elemene is not so dramatic, we think that the β-elemene that we detect may indeed be from thermal degradation of germacrene A. However, we could not see the peak of (+)-hedycaryol at 150 °C that is considered to produce α-elemol at 220 °C although the peak of α-elemol is reduced at 150 °C (Fig. 9). With an injet temperature of 150 °C, all peaks are smaller than when 220 °C is used and some minor

120

peaks are not seen anymore, so we keep the relative product amount in Table 3 based on injection at 220 °C.

γ-Amorphene synthase (ST03) ST03 is very similar to ST02 derived genes. For example, ST03 is 95.2%, 92.5%, 90.0% and 89.5% identical to ST02A, ST02A4, ST02B and ST02C at the amino acid sequence level, respectively. However, ST03 synthesizes different products than what several ST02 derived genes produce. Enzyme assays using crude E. coli extract with GPP yielded linalool (36.8%), myrcene (27.3%), [Z]-β-ocimene (17.6%), p-mentha-1,4(8)diene (terpinolene) (5.2%), cis-p-menth-2en-1-ol (3.8%), [E]-β-ocimene (3.5%) and pmenth-1-en-8-ol (α-terpineol) (3.3%) (Supplementary Fig. S21). With FPP as a substrate, ST03 synthesized γ-amorphene (65.4%) as major product and allo-aromadendrene (11.8%), germacrene D-4-ol (9.6%), γ-cadinene (8.7%) and germacrene D (4.4%) (Supplementary Fig. S22). In turmeric, γ-amorphene is hardly observed. However both ginger root and leaf have γ-amorphene. In ginger rhizome, it is hard to see if there is γ-amorphene or not because of the huge α-zingiberene peak that comes 0.1 minute earlier than the γamorphene peak in the GC chromatogram. When considering that ST03 was cloned from yellow ginger rhizome cDNA, ginger rhizome is likely to produce γ-amorphene, although that was hard to determine because we cannot see γ-amorphene in the total ion chromatogram of ginger rhizome samples due to the α-zingiberene peak.

121

α-Humulene (α-caryophyllene) synthase (ST05 and ST05A) α-Humulene (α-caryophyllene) is observed in all tissues (rhizome, leaf and root) of both ginger and turmeric. However the amount of α-humulene in turmeric in all tissues and ginger rhizome is very small; in the Fig. 1, a very small peak on the left of the “29 peak ([E]-β-farnesene)” is the α-humulene peak. In ginger root, the most abundant sesquiterpene is α-humulene and α-humulene synthases (ST05 and ST05A) were cloned from white ginger root. Enzyme assays with ST05 and ST05A did not produce noticeable monoterpenes with GPP as a substrate. With FPP as a substrate, ST05 synthesized αhumulene (α-caryophyllene) (83.4%) as major product and [E]-caryophyllene (βcaryophyllene) (14.2%), β-elemene (1.5%) and 1,5,9-trimethyl-1,5,9-cyclododecatriene (0.8%) as minor products (Supplementary Fig. S23). ST05A showed a similar profile with ST05; α-humulene (α-caryophyllene) (88.4%) as major product with [E]caryophyllene (β-caryophyllene) (10.5%) and, β-elemene (1.0%). However, 1,5,9trimethyl-1,5,9-cyclododecatriene peak was not observed due to the overall low concentration (Supplementary Fig. S23). Both ST05 and ST05A produce both α-humulene and [E]-caryophyllene. Interestingly, [E]-caryophyllene is the most abundant sesquiterpene in the ginger leaf and there are small amounts of α-humulene in ginger leaves, which suggests that ginger leaves have very similar terpene synthase with ST05 or ST05A which produce [E]caryophyllene as a major product and α-humulene as one of minor products. ST05 and ST05A are very similar paralogs; 99.1 % homology in DNA sequence (11 bases different) and 98.1% homology in amino acids sequence (3 amino acids

122

different). Although they have high homology, their solubility is different; ST05A is barely soluble and ST05 is quite soluble (Table 4). So we tried to purify ST05A using a HIS-tag. However, the amount of soluble ST05A was small, and some E. coli proteins coeluted, leading only to partial purification. In the enzyme assay above, we used crude extract for ST05 and partially purified ST05A, and their results are very similar. Shampoo ginger (Zingiber zerumbet Smith) α-humulene synthase has 91.5% homology with ST05 and 91.3% homology with ST05A in amino acid sequence and it also makes α-humulene (α-caryophyllene) as major product and small amount of [E]caryophyllene (β-caryophyllene) [2]. However, shampoo ginger α-humulene synthase was not reported to synthesize β-elemene and 1,5,9-trimethyl-1,5,9-cyclododecatriene. It is possible that these compounds were not detected due to low detection ability of GC/MS just like we could not see 1,5,9-trimethyl-1,5,9-cyclododecatriene peak in ST05A assays.

β-Selinene (eudesma-4(14),11-diene) synthase (ST01) The β-selinene (eudesma-4(14),11-diene) peak has a similar retention time to αzingiberene. Unlike α-humulene that comes 0.1 minute later than the huge α-zingiberene peak, the β-selinene peak comes 0.15 minutes earlier, so it is not covered by the αzingiberene peak. As is shown in Fig. 1. β-selinene is produced at noticeable amounts in ginger rhizome, but only at trace amounts in ginger root and leaf, and is not detectable in turmeric. Enzyme assay with ST01 synthesized β-selinene (eudesma-4(14),11-diene)

123

(51.9%) as the major product and 7-epi-α-selinene (14.2%), unknown (ererrophila1(10),11-diene-like) (11.3%), β-elemene (9.7%), 5, unknown (β-chamigrene-like) (7.4%), unknown (guaia-1(5),7(11)-diene-like) (3.4%) and (+)-intermedeol (2.1%) as minor products (Supplementary Fig. S24). With GPP as a substrate, ST01 did not produce any detectable monoterpenes.

Caryophyllenyl alcohol synthase (ST07 and ST07A) ST07 and ST07A are 97.4% and 96.6% identical in nucleotide sequence and amino acid sequence, respectively, and both ST07 and ST07A are insoluble when expressed in E. coli. Although their high homology, ST07A expressed in yeast is highly soluble, however ST07 expressed in yeast is insoluble. EPY224 with ST07A expression mainly produced caryophyllenyl alcohol (99.8%) and trace amounts of [E]-caryophyllene (0.2%) (Supplementary Fig. S25). Yeast strain EPY224 is designed to produce FPP, therefore yeast itself produces lots of [E]-nerolidol and farnesol. With exogenous terpene synthase expression, competition over the FPP pool occurs and yeast derived [E]-nerolidol and farnesol production is limited when exogenous terpene synthases utilize most of the FPP, which explains why [E]-nerolidol and farnesol are not shown in the sample with ST07A expression in Supplementary Fig. S25. After closer comparison to extractions from yeast expressing ST07A, we could find very small peaks in chromatograms of extracts from yeast expressing ST07 representing the same terpenes produced by ST07A.Like ST07A, ST07 also produced

124

caryophyllenyl alcohol (98.7%) and [E]-caryophyllene (1.3%) (Supplementary Fig. S26). There is a small bump at the same retention time with caryophyllenyl alcohol peak in EPY224 control, however this small peak in EPY224 control does not represent caryophyllenyl alcohol. The major ion in the peak from EPY224 control is m/z 109, but the major ion in caryophyllenyl alcohol peak is m/z 111. We also checked the background peaks to see if an ion with m/z 109 is a contaminant from the background and found that an ion with MW m/z is not a dominant ion in the background peak in EPY224 control, which suggest that EPY224 itself does not produce caryophyllenyl alcohol and caryophyllenyl alcohol in EPY224 expressing ST07 came from ST07. Because ST07A also produces [E]-caryophyllene, we checked if ST07 also produces [E]-caryophyllene. There is a very small peak at the same retention time with [E]-caryophyllene peak in ST07A sample and its mass spectrum and single ion chromatogram with m/z 91, which is most abundant ion in [E]-caryophyllene, suggest that this very small peak in ST07 is also [E]-caryophyllene. ST07 and ST07A were cloned from white ginger root cDNA and we could not find caryophyllenyl alcohol peak in ginger and turmeric. Therefore, it is possible that ST07 and ST07A produce [E]-caryophyllene which is observed in all tissues of both ginger and turmeric and yeast oxidize [E]-caryophyllene to caryophyllenyl alcohol. Feeding [E]-caryophyllene to yeast to see if yeast can convert it to caryophyllenyl alcohol or not may suggest the function of ST07 and ST07A.

125

Evolution of terpene synthases in ginger and turmeric investigated through protein structural modeling We identified 25 mono- and 16 sesquiterpene synthases from ginger and turmeric and revealed the function of 13 mono- and 11 sesquiterpene synthases. Some are very similar and considered to be paralogs, of which some have conserved functions whereas others have diverged functions. For example, MT06 and MT06A, 98.0% identical at the amino acid sequence level and only had differences in the transit peptide sequences. Also ST00A and ST00B with 98.4% homology at the amino acid sequence level synthesized the same products. However, although MT06B is 92.9% identical with MT06 and 90.8% with MT06A at the amino acid sequence level, the products of these enzymes are quite different. MT06/MT06A produce sesquiterpenes whereas MT06B cannot. When MT06 and MT06B protein structures were modeled with (4S)-limonene synthase from Mentha spicata [31] as template and compared, their backbone structures are very similar but the side chains near the substrate are different (Fig. 6 ). F327 of MT06B appears to prevent FPP binding whereas MT06 has a leucine at that position and can allow FPP binding, leading to production of sesquiterpenes. Although MT06B has one extra amino acid in the loop (Arrow in Fig. 6 B) when compared to MT06 and (4S)-limonene synthase, tyrosines after loop (Y576 in MT06, Y577 in MT06B) are aligned very well in the modeled structures. These results suggest that replacement of F327 with a Leu allows for expanded substrate versatility and product production. ST02B and ST02C are 95.7% identical at the amino acid sequence level. However, ST02C produces β-elemene (germacrene A) as a major product and ST02B

126

produces α-elemol as a major product and β-elemene (germacrene A) as a minor product. When their protein structures were modeled based on (+)-δ-cadinene synthase from Gossypium arboreum [32], we could not see noticeable side chain differences around the substrate for the ST02B and ST02C modeled structures (Fig. 7).

According to the

modeled structures, the main differences lie in the N-terminal loop, where the N-terminal end loop of ST02B is closer to the C-terminal end loop than that of ST02C and the difference in contact of the N- and C-terminal end loops may affect protein breathing and cause easy access for water molecules to quench the reaction. ST02A4 is also very similar to ST02B and ST02C; 95.5% and 96.6% homology at the amino acid sequence level, respectively. ST02A4 produces (-)-neointermedeol as a major product and βelemene (germacrene A) as a minor product. When the ST02A4 structure was modeled against the (+)-δ-cadinene synthase structure and compared with the modeled ST02B and ST02C structures, there was no difference in side chains near the active site. Again, ST02A4 has a different loop structure at the N-terminal end and the expected structure is more similar to ST02B than ST02C because ST02C is expected to have an extended αhelix and shorter loop (Fig. 8). The loop around tryptophan at the N-terminal end of the RRX8W motif is in contact with the C-terminal end of the helix (Fig. 8, A) and this interaction may stabilize the C-terminal end structure near the active site in all three modeled structures. Although ST02A4 and ST02B synthesize different main products, both compounds are quenched by a water molecule, which can be explained by their similar loop structures at the N-terminal ends. ST02C and the template, (+)-δ-cadinene synthase have similar loop structures at the N-terminal ends, prolonged α-helices and

127

shorter loops, and both produce the terpenes not quenched by water molecules. ST02A4, ST02B and ST02C are considered to diverge recently and their modeled structures are very similar with the same side chains near the active site. Also they produce similar products; ST02A4 produces (-)-neointermedeol, ST02B produces αelemol and ST02C produces β-elemene (germacrene A) as major products. These compounds are synthesized via the same mechanistic pathway through (+)-germacrene A (Fig. 9).

Suggestion of α- and β-turmerones instead of tumerone and curlone During our efforts to identify ginger and turmeric compounds eluted in GC/MS analyses using the GC/MS NIST library, the best hits for two major sesquiterpene peaks in turmeric were called “tumerone” and “curlone”. A search of tumerone (CAS# 18031567-7) in SciFinder Scholar gave only two references: one was published in 1934 and the other one in 1996 [22], which used analysis of fragmented masses in GC/MS and did not use NMR. In this study we researched ginger and turmeric together and observed that both ginger and turmeric produce α-zingiberene and β-sesquiphellandrene but only turmeric synthesizes two oxygenated forms of these products that have best hits with tumerone and curlone in the NIST library. Ginger and turmeric produce more α-zingiberene than βsesquiphellandrene and one oxygenated compound is also more abundant than the other oxygenated compound (Fig. 1, Fig. 10). Additionally we could see a different product pattern for α-zingiberene/β-sesquiphellandrene and the two oxygenated compounds. In

128

Fig. 1 panel B, the amount of α-zingiberene and β-sesquiphellandrene is 2.7 times more than the two oxygenated compounds, however the amount of two oxygenated compounds is 14.6 times more than α-zingiberene/β-sesquiphellandrene in Fig. 1 panel C. When we look at only the α-zingiberene and β-sesquiphellandrene ratio, α-zingiberene is 2.3 times more abundant than β-sesquiphellandrene in Fig. 1 panel B and 1.7 times more abundant in Fig. 1 panel C. However, when we look at α-zingiberene (green) and the oxygenated compound with the earlier retention time (orange) as a first group and βsesquiphellandrene (blue) and the other oxygenated compound with the later retention time (red) as a second group, the ratios of first group over second group are 2.7 and 2.9 respectively in Fig. 1 panel B and C, which is also very similar with the ratio of αzingiberene over β-sesquiphellandrene in Fig. 1 panel A, 3.2. So we conclude that the two oxygenated compounds are possibly derived from α-zingiberene and βsesquiphellandrene and they most likely are α-turmerone and β-turmerone. We need to be careful in proposing that the curlone match in our turmeric samples is indeed β-turmerone because the stereochemistry of curlone (CAS# 87440-60-6) was revealed by NMR [33]. It is possible that the precursor of curlone, βsesquiphellandrene is incorrectly annotated. A non-oxygenated compound related in structure to curlone, the putative substrate of curlone biosynthesis, would be cyclohexene, 3-[(1S)-1,5-dimethyl-4-hexen-1-yl]-6-methylene-, (3S)- (CAS# 251318-35-1), which has only been reported twice in the literature, once where an investigation of the stereochemistry of (+)-β-sesquiphellandrene mentions this compound as a synthetic enantiomer of natural product, (-)-β-sesquiphellandrene [34].

129

We cloned α-zingiberene/β-sesquiphellandrene synthase, which produces αzingiberene and β-sesquiphellandrene with the similar ratio of α-zingiberene over βsesquiphellandrene as shown in ginger and turmeric rhizome samples. Identification of this gene supports the proposal that the blue peak in the metabolite profiling experiments (Fig. 1) should be β-sesquiphellandrene, leading to the conclusion that the red peak should be β-turmerone instead of curlone.

α-Zingiberene/β-sesquiphellandrene oxidase α-Turmerone and β-turmerone are thought to be produced by the oxidation of αzingiberene and β-sesquiphellandrene, respectively. Using EST data, microarray data and metabolite data together, four P450 monooxygenase candidates were selected from 170 ginger and turmeric P450 monooxygenase and named as P1, P2, P3 and P4. P2 and P3 were partial clones in the EST database, missing 5’ ends, and genome walking revealed the complete sequences. P1, P2 and P4 are very similar to each other and belong to the clade (CYP71D) that contains limonene hydroxylase. P3 is different from the other three P450s and it belongs to the clade (CYP71AV1) that contains amorphadiene oxidase. Amorphadiene oxidase catalyzes a three step of oxidation (primary alcohol → aldehyde → acid) at the end of a hydrocarbon chain [24] and limonene hydroxylase hydroxylates on a six carbon ring [35]. Hydroxylation on α-zingiberene and β-sesquiphellandrene (Fig. 11) is more similar of a reaction to that catalyzed by limonene hydroxylase than by amorphadiene oxidase, therefore P1, P2 and P4 were tested for activity Because all three are very similar, P1 and P4 were cloned from turmeric rhizome cDNA and five cloned

130

genes, P1A, P1A2, P4, P4A and P4A2 were expressed in yeast. Co-expression of these P450s with ST00A in EPY224 that was not expressing the P450 reductase gene did not produce hydroxylated forms of α-zingiberene or β-sesquiphellandrene. Co-expression of these P450s with ST00A and sweet basil P450 reductase (Ob_CPR) yielded some products that look like oxidized forms of α-zingiberene and β-sesquiphellandrene, according to comparison of mass spectra of these peaks with α-zingiberene, β-bisabolene and β-sesquiphellandrene (Fig. 12). However, these oxidized products are not registered in the NIST GC/MS library. Peak 2 and 3 have masses of 220, which is the mass of hydroxylated forms of α-zingiberene, β-bisabolene or β-sesquiphellandrene, and peak 1 has a mass of 222. Although mass 222 is not correct for hydroxylated form of αzingiberene, α-bisabolol (MW 220) also shows m/z of 222 in its mass spectrum. Other ions such as m/z 93, 119, etc. also support the proposal that these peaks are likely to be hydroxylated forms of α-zingiberene, β-bisabolene or β-sesquiphellandrene. Thus, the biosyntheses of α-turmerone and β-turmerone likely originate by the action of αzingiberene/β-sesquiphellandrene synthase, followed by the action of α-zingiberene/βsesquiphellandrene hydroxylase (P1, P2, or P4) and is completed by the action of a dehydrogenase that converts the secondary alcohols to the ketone forms.

References

1.

Picaud, S., et al., Cloning, expression, purification and characterization of recombinant (+)-germacrene D synthase from Zingiber officinale. Arch Biochem

131

Biophys, 2006. 452(1): p. 17-28. 2.

Yu, F., et al., Molecular cloning and functional characterization of alphahumulene synthase, a possible key enzyme of zerumbone biosynthesis in shampoo ginger (Zingiber zerumbet Smith). Planta, 2008. 227(6): p. 1291-9.

3.

Kundu, J.K., H.K. Na, and Y.J. Surh, Ginger-derived phenolic substances with cancer preventive and therapeutic potential. Forum Nutr, 2009. 61: p. 182-92.

4.

Lopez, P., et al., Solid- and vapor-phase antimicrobial activities of six essential oils: susceptibility of selected foodborne bacterial and fungal strains. J Agric Food Chem, 2005. 53(17): p. 6939-46.

5.

Ficker, C.E., et al., Inhibition of human pathogenic fungi by ethnobotanically selected plant extracts. Mycoses, 2003. 46(1-2): p. 29-37.

6.

Habib, S.H., et al., Ginger extract (Zingiber officinale) has anti-cancer and antiinflammatory effects on ethionine-induced hepatoma rats. Clinics (Sao Paulo), 2008. 63(6): p. 807-13.

7.

Nigam, N., et al., Induction of apoptosis by [6]-gingerol associated with the modulation of p53 and involvement of mitochondrial signaling pathway in B[a]Pinduced mouse skin tumorigenesis. Cancer Chemother Pharmacol, 2009. 24: p. 24.

8.

Jeong, C.H., et al., [6]-Gingerol suppresses colon cancer growth by targeting leukotriene A4 hydrolase. Cancer Res, 2009. 69(13): p. 5584-91.

9.

Shukla, Y. and M. Singh, Cancer preventive properties of ginger: a brief review. Food Chem Toxicol, 2007. 45(5): p. 683-90.

10.

Wang, G., et al., Antitumor effect of beta-elemene in non-small-cell lung cancer cells is mediated via induction of cell cycle arrest and apoptotic cell death. Cell Mol Life Sci, 2005. 62(7-8): p. 881-93.

11.

Jianjun, Q., et al., Treatment of chylothorax with elemene. Thorac Cardiovasc Surg, 2008. 56(2): p. 103-5.

12.

Yamahara, J., et al., The anti-ulcer effect in rats of ginger constituents. J Ethnopharmacol, 1988. 23(2-3): p. 299-304.

13.

Yamahara, J., et al., [Stomachic principles in ginger. II. Pungent and anti-ulcer effects of low polar constituents isolated from ginger, the dried rhizoma of

132

Zingiber officinale Roscoe cultivated in Taiwan. The absolute stereostructure of a new diarylheptanoid]. Yakugaku Zasshi, 1992. 112(9): p. 645-55. 14.

Jurenka, J.S., Anti-inflammatory properties of curcumin, a major constituent of Curcuma longa: a review of preclinical and clinical research. Altern Med Rev, 2009. 14(2): p. 141-153.

15.

Ravindran, J., S. Prasad, and B.B. Aggarwal, Curcumin and Cancer Cells: How Many Ways Can Curry Kill Tumor Cells Selectively? Aaps J, 2009. 10: p. 10.

16.

Jayaprakasha, G.K., et al., Evaluation of antioxidant activities and antimutagenicity of turmeric oil: a byproduct from curcumin production. Z Naturforsch [C], 2002. 57(9-10): p. 828-35.

17.

Negi, P.S., et al., Antibacterial activity of turmeric oil: a byproduct from curcumin manufacture. J Agric Food Chem, 1999. 47(10): p. 4297-300.

18.

Nishiyama, T., et al., Curcuminoids and sesquiterpenoids in turmeric (Curcuma longa L.) suppress an increase in blood glucose level in type 2 diabetic KK-Ay mice. J Agric Food Chem, 2005. 53(4): p. 959-63.

19.

Lee, Y., Activation of apoptotic protein in U937 cells by a component of turmeric oil. BMB Rep, 2009. 42(2): p. 96-100.

20.

Aratanechemuge, Y., et al., Selective induction of apoptosis by ar-turmerone isolated from turmeric (Curcuma longa L) in two human leukemia cell lines, but not in human stomach cancer cell line. Int J Mol Med, 2002. 9(5): p. 481-4.

21.

Lee, H.S., Antiplatelet property of Curcuma longa L. rhizome-derived arturmerone. Bioresource Technology, 2006. 97(12): p. 1372-1376.

22.

Hiserodt, R., et al., Characterization of powdered turmeric by liquid chromatography mass spectrometry and gas chromatography mass spectrometry. Journal of Chromatography A, 1996. 740(1): p. 51-63.

23.

Martin, V.J., et al., Engineering a mevalonate pathway in Escherichia coli for production of terpenoids. Nat Biotechnol, 2003. 21(7): p. 796-802.

24.

Ro, D.K., et al., Production of the antimalarial drug precursor artemisinic acid in engineered yeast. Nature, 2006. 440(7086): p. 940-3.

25.

Jiang, H., et al., Metabolic profiling and phylogenetic analysis of medicinal

133

Zingiber species: Tools for authentication of ginger (Zingiber officinale Rosc). Phytochemistry, 2006. 67(15): p. 1673-85. 26.

Adams, R.P., Identification of Essential Oil Components by Gas Chromatography/Mass Spectroscopy. 1995, Illinois, USA: Allured Publishing Co

27.

Arnold, K., et al., The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics, 2006. 22(2): p. 195-201.

28.

de Kraker, J.W., et al., (+)-Germacrene A biosynthesis . The committed step in the biosynthesis of bitter sesquiterpene lactones in chicory. Plant Physiol, 1998. 117(4): p. 1381-92.

29.

Gopfert, J.C., et al., Identification, functional characterization and developmental regulation of sesquiterpene synthases from sunflower capitate glandular trichomes. BMC Plant Biol, 2009. 9(86): p. 86.

30.

Wang, G., et al., Terpene biosynthesis in glandular trichomes of hop. Plant Physiol, 2008. 148(3): p. 1254-66.

31.

Hyatt, D.C., et al., Structure of limonene synthase, a simple model for terpenoid cyclase catalysis. Proc Natl Acad Sci U S A, 2007. 104(13): p. 5360-5.

32.

Gennadios, H.A., et al., Crystal structure of (+)-delta-cadinene synthase from Gossypium arboreum and evolutionary divergence of metal binding motifs for catalysis. Biochemistry, 2009. 48(26): p. 6175-83.

33.

Kiso, Y., et al., SESQUITERPENOIDS .59. STEREOSTRUCTURE OF CURLONE, A SESQUITERPENOID OF CURCUMA-LONGA RHIZOMES. Phytochemistry, 1983. 22(2): p. 596-597.

34.

Kreiser, W. and F. Korner, Stereospecific synthesis of (+)-beta-sesquiphellandrene. Helvetica Chimica Acta, 1999. 82(9): p. 1427-1433.

35.

Ponnamperuma, K. and R. Croteau, Purification and characterization of an NADPH-cytochrome P450 (cytochrome c) reductase from spearmint (Mentha spicata) glandular trichomes. Arch Biochem Biophys, 1996. 329(1): p. 9-16.

134

Contig MT00 MT01 MT02 MT03 MT04 MT05 MT06 MT07 MT08 MT09 MT10 (= MT03) MT11 MT12 MT13 (= MT08) MT14 (= MT02) MT15 MT16 MT17 MT18 (= MT09) MT19 ST00 ST01 ST02 ST03 ST04 ST05 ST06 ST07 ST08 ST09

RACE 5 5 5 5 3 5 5 3

Turmeric Rh L 2

White ginger Rh R L 3

2 3 5 4 1 2

2 2

5 3 5 5 3 5 5 5 5 5

2 1

2 1

5 5 3

Yellow ginger Rh R L

1 2

2 2 4 2 1 2 2 5

5 5 5 5 5 3 5 5 5 5

2 2 2 2 4 2 2 2 1

Table 1 Mono- and sesquiterpene synthases identified in the ginger and turmeric EST database created from cDNA libraries from different tissues: Rh = rhizome, R = root, L=leaf. The number represents EST number per cDNA library in the database for each contig. Four contigs, MT10, MT13, MT14 and MT18 were considered to belong to other contigs as indicated after close investigation. In the RACE column, 5 or 3 means 5’ or 3’ RACE was required to obtain full-length clones and, bolded and underlined means RACE was finished. Each contig was cloned for further characterization from the grey boxed sample. MT00 and MT11 were subcloned from the original cDNA clones directly without requiring RT-PCR.

135 MT00

MT03

MT04

MT06/MT06A

tricyclene

MT06B

MT07

MT08

0.7

α-thujene (3-thujene)

MT09A2

MT11

1.5

MT12A-M2

MT17A2

ST00A/ST00B

ST02B

ST02C

ST03

1.2

3.4

α-pinene

60.1

40.4

camphene

11.7

45.4

sabinene (4(10)-thujene)

2.7

β-pinene

20.4

4.3

60.1

28.4

33.7

55.8

29.0

30.7

6.2 1.0

0.6

0.7

5.9

myrcene

25.0

30.4

27.3

15.5

17.9

2.7

(Z)-β-ocimene

14.3

13.5

17.6

(E)-β-ocimene

9.8

9.3

3.5

α-phellandrene

92.2

2.7

3-carene

1.2

α-terpinene

1.0

limonene

8.2 5.6

β-phellandrene

4.3 6.6

1.5

3.8

1,8-cineole (eucalyptol)

7.7

2.5

cis-sabinene hydrate p-mentha-1,4(8)-diene (terpinolene)

0.5 100.0

7.9

88.3

49.2

0.9

γ-terpinene

linalool

5.0

78.3

9.0 6.2

0.4

5.2

1.5

0.4

0.1 0.4

0.1

89.9

1.6

1.9

24.8

100.0

7.7

9.9

5.2

25.1

15.6

36.8

cis-p-menth-2en-1-ol

3.8

trans-pinene hydrate (trans-pinan-2-ol)

0.3

0.3

3.3

6.1

0.3

0.4

β-citronellal

0.3

0.9

borneol (endo-borneol) p-menth-1-en-4-ol (terpinen-4-ol)

8.0

p-menth-1-en-8-ol (α-terpineol)

6.2

3.3 17.4

0.4

2.7

3.3

Table 2 Monoterpenes produced by ginger and turmeric terpene synthases. The number represents percentage from total products in each terpene synthase. Grey boxes indicate the most abundant compounds produced by specific TPS proteins.

3.3

136

MT00

δ-elemene unknown ((+)-cyclosativene-ike) (+)-cyclosativene α-copaene unknown (β-elemene-like) β-elemene unknown (7-epi-sesquithujene-like) cis-α-bergamotene [E]-caryophyllene (β-caryophyllene) trans-α-bergamotene γ-elemene

MT02A

MT06/MT06A

ST00A/ST00B

ST01

ST02A4

ST02B

ST02C

ST03

ST05

ST05A

ST07A

ST07

0.2

1.3

2.9 7.8

2.1 3.2

11.2

1.7

18.3

49.3

1.5

1

2.4

14.2

10.5

83.4

88.4

2.1 9.7 8.1

0.8

12.6

0.3

7.6 0.5 0.1

20.6

0.2 1.1

6.1

2.7

3.4 100 11.8 11.3 4.7

unknown (β-chamigrene-like) ar-curcumene γ-curcumene

7.4 0.9 4.4

unknown (zonarene-like) unknown (β-sesquiphellandrene-like) germacrene D unknown (allo-aromadendrene-like) β-selinene (eudesma-4(14),11-diene) α-zingiberene ((-)-Zingiberene) γ-amorphene

β-curcumene

MT17A2

1.2

unknown (guaia-1(5),7(11)-diene-like) [Z]-bata-Farnesene α-humulene (α-caryophyllene) allo-aromadendrene unknown (ererrophila-1(10),11-diene-like) γ-muurolene

α-muurolene β-bisabolene [Z]-α-bisabolene γ-cadinene

MT08

1.6

0.3

1.5 6.1

1.6

12.4

4.4

51.9 67 65.4 1.6 1.1

13.6

1.7

22.7

3.1

9.5

6.2 8.7

14.5

137 7-epi-α-selinene

14.2

β-Sesquiphellandrene δ-cadinene (cadina-1(10),4-diene) [E]-γ-bisabolene unkown (cis-α-bisabolene-like) 1,5,9-trimethyl-1,5,9-cyclododecatriene unknown (cis-sesquisabinene hydrate-like) α-elemol germacrene B [E]-nerolidol Caryophyllenyl alcohol germacrene D-4-ol unkown (β-eudesmol-like) unknown (trans-sesquisabinene hydrate-like1) unknown (trans-sesquisabinene hydrate-like2) unknown (selina-6-en-4-ol-like) unknown (cubenol-like) unknown (spathulenol-like) unkown (α-eudesmol-like) γ-eudesmol unknown (trans-sesquisabinene hydrate-like3) α-acorenol cadin-4-en-7-ol epi-α-muurolol (τ-muurolol) α-muurolol (δ-cadinol) α-cadinol (-)-neointermedeol (+)-intermedeol epi-β-bisabolol epi-α-bisabolol α-bisabolol

1.9

3.7

22.7 1.2

0.6

2.8

4

0.4 2.1 0.8

4.5 44.3 4.9 100

41

2.9

5.8 99.8

98.7

9.6 1.2 0.6 2.4 6.5 0.4 0.3 0.4

0.8

0.4 0.2 6 2.5 7.5 48.7 2.1

0.8

22.3 58.3 38.7

1.6

12.6 7.3

Table 3 Sesquiterpenes produced by ginger and turmeric terpene synthases. The number represents percentage from total products in each terpene synthase. Grey boxes indicate the most abundant compounds produced by specific TPS proteins.

138

Gene name MT00

vectors pH9GW

pCRT7CT

MT01 MT02A

pH9GW pEXP5CT

MT03

pET101D pEXP5CT pEXP5CT pEXP5CT pEXP5CT pEXP5CT pEXP5CT pEXP5CT pH9GW

MT04 MT05 MT06 MT06A MT06B MT07 MT08

pCRT7CT

MT09A MT09A2 MT09B MT11

pEXP5CT pEXP5CT pEXP5CT pEXP5CT pH9GW

pCRT7CT

MT12A-M2 MT16

pEXP5CT pEXP5CT

Expression in E. coli cells BL21 (DE3) pLysS Rosetta (DE3) Rosetta2 (DE3) pLysS* BL21 (DE3) pLysS BL21 CoconPlus (DE3) RIL BL21 CoconPlus (DE3) RP BL21 Star (DE3) pMevT pMBI RIL* BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP Rosetta2 (DE3) pLysS BL21 CoconPlus (DE3) RIL* BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP* BL21 (DE3) pLysS BL21 CoconPlus (DE3) RIL BL21 Star (DE3) pMevT pMBI RIL* Rosetta (DE3) Rosetta2 (DE3) pLysS BL21 (DE3) pLysS BL21 CoconPlus (DE3) RIL BL21 CoconPlus (DE3) RP BL21 Star (DE3) BL21 Star (DE3) RIL BL21 Star (DE3) pMevT pMBI RIL Rosetta2 (DE3) pLysS BL21-AI BL21-AI RIL ArcticExpress (DE3) RIL BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP BL21 (DE3) pLysS BL21 CoconPlus (DE3) RIL BL21 CoconPlus (DE3) RP Rosetta (DE3) Rosetta2 (DE3) pLysS BL21 (DE3) pLysS BL21 CoconPlus (DE3) RIL* BL21 CoconPlus (DE3) RP BL21 Star (DE3) pMevT pMBI RIL* Rosetta (DE3) BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP

Solubility +++ ++++ +++ +++ ++++ +++ ++++ +++++ +++++ +++++ ++++ +++ ++ +++++ + + + + + ++ ++ ++ + ++++ -

Expression in yeast vectors cells Solubility

pESC-URA

EPY219 EPY224*

n/a +++++

pESC-URA

EPY219

-

139 MT17A2 MT17C MT17D MT19

pEXP5CT pEXP5CT pEXP5CT pEXP5CT

ST00A

pH9GW

ST00B

pH9GW

ST01 ST02A

pET101D pEXP5CT pEXP5CT

ST02A2 ST02A3 ST02A4

pEXP5CT pEXP5CT pEXP5CT

ST02B ST02C ST02C2

ST05 ST05A ST07

pET101D pET101D pET101D pEXP5CT pET101D pEXP5CT pEXP5CT pEXP5CT pEXP5CT

ST07A

pEXP5CT

ST03

Rosetta2 (DE3) pLysS BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP Rosetta2 (DE3) pLysS BL21 CoconPlus (DE3) RILP BL21-AI RIL* Rosetta2 (DE3) pLysS BL21 CoconPlus (DE3) RILP BL21-AI RIL* Rosetta2 (DE3) pLysS BL21 CoconPlus (DE3) RIL* BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP Rosetta2 (DE3) pLysS ArcticExpress (DE3) RIL BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP

+++++ +++ +++++ +++++ +++ +++ +++ +++ +++ +++ ++ + -

BL21 CoconPlus (DE3) RIL* BL21 CoconPlus (DE3) RIL* BL21 CoconPlus (DE3) RIL BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RIL* BL21 CoconPlus (DE3) RILP BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP* BL21 CoconPlus (DE3) RILP Rosetta2 (DE3) pLysS ArcticExpress (DE3) RIL BL21 CoconPlus (DE3) RILP

+++++ +++++ +++++ +++ +++++ + -

pESC-URA

EPY219*

+++++

pESC-URA

EPY219*

n/a

pESC-URA

EPY219

-

pESC-URA pESC-URA pESC-URA

EPY219 EPY219 EPY219 EPY224*

n/a

pESC-URA

EPY219 EPY224*

n/a

pESC-URA

EPY224*

+++++

Table4 Vectors used to express various ginger or turmeric TPS proteins in either E. coli or yeast cells. Solubility of expression in E. coli was checked in the Coomassie-stained gels. Solubility of expression in yeast was checked by Western blot. Solubility is the ratio of total and soluble fractions. The solubility is not checked for “n/a”. The vector and cell combination marked with “*” were used for further analysis to identify the functions of specific proteins as outlined in the text.

140

100

A

80

11

37 β-sesquiphellandrene

12 1,8-cineole

2

geranial

20

40

6

20 1

0 100

* 5 *7 4 8

geraniol acetate

24

19 14 15 13

16 17

18

21

B

36

32 33 31

41 39 38 42 43

22 23 25 29 26 27 28 30

*

7

β-sesquiphellandrene

37

60

46

α-zingiberene 34

α-phellandrene

80

34 35 [ E,E ]-α -farnesene

β-phellandrene α-pinene

60

α-zingiberene

3 camphene

α-turmerone

45

p-mentha-1,4(8)-diene (terpinolene)

40

10 8

20

12

2 * 6* 9 45

0 100

14

β-turmerone

36 13

26 27 28 29

32

47 38 40

43

44 *

46

48

C

80 60 40 20 0 4

6

8

10

12

14

16

18

20

22 24 Tim e (m in )

26

28

30

32

34

36

38

141

Fig. 1 Major volatile compounds from ginger and turmeric rhizomes were analyzed by GC/MS. Total ion chromatograms of ginger (A) and turmeric (B, C) 7 month old rhizomes are shown. Both ginger and turmeric produce α-zingiberene (green) and βsesquiphellandrene (blue), however only turmeric produces α-turmerone (orange) and β-turmerone (red) and the total amounts of αturmerone and β-turmerone are variable according to sample (B, C), but their ratios remain relatively constant. The inverse relation in amount between α-zingiberene/β-sesquiphellandrene and α-turmerone/β-turmerone suggests that α-zingiberene and βsesquiphellandrene are the precursors of α-turmerone and β-turmerone, respectively. Compounds identified include: 1, tricyclene; 2, α-pinene; 3, camphene; 4, sabinene (4(10)-thujene); 5, β-pinene; 6, myrcene; 7, α-phellandrene; 8 3-carene; 9, αterpinene; 10, limonene; 11, β-phellandrene; 12, 1,8-cineole; 13, γ-terpinene; 14, p-mentha-1,4(8)-diene (terpinolene); 15, linalool; 16, borneol (endo-borneol); 17, p-menth-1-en-8-ol (α-terpineol); 18, neral (β-citral); 19, [E]-geraniol; 20, geranial (α-citral); 21, δelemene; 22, citronellyl acetate; 23, α-copaene; 24, geraniol acetate; 25, β-elemene; 26, unknown (7-epi-sesquithujene-like); 27, [E]caryophyllene (β-caryophyllene); 28, trans-α-bergamotene; 29, [E]-β-farnesene; 30, allo-aromadendrene; 31, germacrene D; 32, arcurcumene; 33, β-selinene (eudesma-4(14),11-diene); 34, α-zingiberene; 35, [E,E]-α-farnesene; 36, β-bisabolene; 37, βsesquiphellandrene; 38, [E]-γ-bisabolene; 39, α-elemol; 40, unknown (cis-sesquisabinene hydrate-like); 41, germacrene B; 42, [E]nerolidol; 43, unknown (trans-sesquisabinene hydrate-like1); 44, unknown (trans-sesquisabinene hydrate-like2); 45, α-turmerone; 46, epi-α-bisabolol/ α-bisabolol; 47, β-turmerone; 48, α-oxobisabolene

142

MT00 linalool/E-nerolidol T-L ST00 Sesquiterpene synthase T-Rh ST00A α-zingiberene/ß-sesquiphellandrene ST00B α-zingiberene/ß-sesquiphellandrene ST01 ß-selinene ST02A ST02A2 ST02A3 ST02A4 (-)-neointermedeol GY-Rh ST02C ß-elemene ST02C2 ST02B α-elemol ST02B2-FS ST03 γ-amorphene ST05 α-humulene ST05A α-humulene GW-Rh ST05B ST07 caryophyllenyl alcohol Monoterpene synthase ST07A caryophyllenyl alcohol MT02A epi-α-bisabolol/α-bisabolol MT17 MT17B MT17B2 T-L MT17D MT17A MT17A2 linalool/cis-α-bisabolene/trans-α-bergamotene MT17C MT05 GY-R MT04 a-pinene/ß-pinene GY-R GW-Rh MT11 1,8-cineole MT03 α -phellandrene T-Rh MT07 terpinolene MT01 GW-Rh MT06 sabinene/linalool/E-nerolidol/epi-ß-bisabolol MT06A sabinene/linalool/E-nerolidol/epi-ß-bisabolol GW-Rh: white ginger rhizome MT06B camphene/ α -pinene GW-R: white ginger root GY-Rh GW-L: white ginger leaf MT09 GY-Rh: yellow ginger rhizome MT09A2 camphene/ α -pinene GY-R: yellow ginger root MT09B GY-L: yellow ginger leaf MT12A-M2 camphene/ α -pinene GW-R T-Rh: turmeric rhizome GW-Rh MT08 ß-phellandrene T-L: turmeric leaf MT16 GY-Rh MT19 GW-R 0.05

Fig. 2 Phylogenetic tree of cloned full length ginger and turmeric mono- and sesquiterpene synthases. There are three clades (one sesquiterpene synthase clade, two monoterpene synthase clades) and linalool/nerolidol synthase (MT00) is located in the middle of monoterpene and sesquiterpene clades. Major product(s) of each gene is/are shown next to each gene name and the tissue used for cloning is also shown.

143 MT16 MT19 MT08 MT06 MT06A MT09 MT09A2 MT09B MT12A-M2 MT06B MT01 MT03 MT07 MT17B MT17B2 MT17 MT17D MT17A MT17A2 MT17C MT02A MT05 MT04 MT11 ST00 ST00A ST00B ST02A ST02A2 ST02A3 ST02C ST02C2 ST02A4 ST02B ST02B2 -FS ST03 ST05 ST05A ST05B ST07 ST07A ST01 MT00

. . * . : --- ---- ---- ---- --MATR- QAMSICAPMISVFPHRPMIVADVEQC- DRRSFGRTLQVRSCS- -ATSH-VAPLRRSGNYQPSLWTDERLQSLTNIST- -VQQEEKRER --- ---- ---- ---- -MSLYY- TSTTVSAPMISILPRRPMIVAAVEHR- GLQMFRRTLQARSCS- -ATSH-VAPLRRSGNYQPSLWTDERVQSLTNTST- -VQQEEKRER --- ---- ---- ---- --MATR- QAMSICAPMISVFPRRPMIVADVEQC- DRRSFGRTLQVRSCS- -ATSH-VAPLRRSGNYQPSIWTDERLQSLTSTST- -VQQEEKRER --- ---- ---- ---- --MATC- QAKSICAPMISVLPRRPMMVAVVKQYYGRQMFRRTLQVRSCS- -ATSH-VAALRRSGNYHPNIWTDEHVQSLSGTST- -VQQEEKRER --- ---- ---- ---- --MATC- QAMSICAPMISVFPHRPMIVADVEQCD-RRSFGRTLQVRSCS- -ATSH-VAPLRRSGNYHPNIWTDEHVQSLSGTST- -VQQEEKRER --- ---- ---- ---- --MATR- QAMSFYAPMISVSPRRPMIVAAVEPS- SRRTFRQIMQVRSCS- -GTSH-VAPLRRSANYHPNIWTDEHVQSLTSTSM- -VQQEENRER --- ---- ---- ---- --MATR- QAMSFYAPMISVSPRRPMIVAAVEPS- SRRTFRQIMQVRSCS- -GTSH-VAPLRRSANYHPNIWTDEHVQSLTSTSM- -VQQEENRER --- ---- ---- ---- --MATR- QAMSFYAPMISVSPRRPMIVAAVEPS- SRRTFRQIMQVRSCS- -GTSH-VAPLRRSANYHPNIWTDEHVQSLTSTSM- -VQQEENRER --- ---- ---- ---- --MSIYYTSTTVSAPMISVLPRRPMIVAAVEHR- GLQMFRRTLQVRSCS- -GTSH-VAPLRRSANYHPNIWTDEHVQSLSGTST- -AQQEENRER --- ---- ---- ---- --MATC- QAKSICAPMISVLPRRPMMVAVVKQYYGRQMFRRTLQVRSCS- -ATSH-VAALRRSGNYHPNIWTDEHVQSLSGTST- -VQQEEKRER --- ---- ---- ---- --MATG- QVMSICAPMISILPHRPMIVAAVEQR- GRRTFRRTLHVRSCS- -GTSH-VAPLRRSGNYHPSLWTDKSVQSLTSTST- -LQKEEERER --- ---- ---- ---- --- ---- -MSSICAPMISVLPRRPMIVAAFQQYCSRRTFRRTLQVRSCS- SATSH-VAPLRRLANYHPNIWTDEHVQSLTSTSK- -VQREEERER --- ---- ---- ---- --- ---MATMSICAPMISVLPRRPMIAAAFQQYCGRRTFRRTLQVRSCSTSATSHVVAPLRRSGNYPPNIWTEERVQSLTNTST- -VQQEEKRER MPPGNPVQCIDRRIIWKSEQNFGTHPVNMSLFLAPPSYFPLRDVRRSTAAKLQP-CLRLVQCTADRQSPE--- AARRSGNYQPNMWSNDYIQSLTVESPLKVEEKDQPKR MPPGNPVQCIDRRIIWKSEQNFGTHPVNMSLFLAPPSYFPLRDVRRSTAAKLQP-CLRLVQCTADRQSPE--- AARRSGNYQPNMWSNDYIQSLTVESPLKVEEKDQPKR MPPGNPVQCIDRRIIWKSEQNFGTHPVNMSLFLAPPSYFPLRDVRRSTAAKLQP-CLRLVQCTADRQSPE--- AARRSGNYQPNMWSNDYIQSLTVESPLKVEEKDQPKR MPPGNPVQCIDRRIIWKSKQNFGTHPVNMSLFLAPPSYFPLRDVRRSTAAKLQP-CLRLVQCTADRQSPE--- AARRSGNYQPNMWSNDYIQSLTVESPLKVEEKDQPKR MPPGNPVQCIDRRIIGKSKQNFGTHPVNMSLFLAPPSYFPLRDVRRSTAAMLQP-CLRLVQCTADRQSPE--- AARRSGNYQPNMWSNDYIQSLTVESPLKVEEKDQPKR MPPGNPVQCIDRRIIGKSKQNFGTHPVNMSLFLAPPSYFPLRDVRRSTAAMLQP-CLRLVQCTADRQSPE--- AARRSGNYQPNMWSNDYIQSLTVESPLKVEEKDQPKR MPPGNPVQCIDRRIIWKSEQNFGTHPVNMSLFLAPPSYFPLRDVRRSTAAKLQP-CLRLVQCTADRQSPE--- AARRSGNYQPNMWSNDYIQSLTVESPLKVEEKDQPKR --- ---- ---- ---- --- MSSFLPAPLNLPFDQNLS-- -ALR--RRSTPAVEQP-RLSPIRCSAAGQSS---- ASRRSANFQPNLWSNDYIQSLAVSSP- -VEEKDRTER --- ---- ---- ---- --- ---- ---- --MALFQPAASLAPLNLPFDRRPFFLRR-CATTVIRCAAEKTP---- ASRRSANYQPNLWGDDRIRSLTVEEE- ---DRTATAR --- ---- ---- ---- --- ---- MSISLSFAASATFGSRGDLGGFSRPAAAIKQWRCLPRIQCHAAEQSQSPSTTLRRSGNYQPSIWTPDRIQSLTLCHT- -ADEDDQAER --- ---- ---- ---- --- ---- MSVSHSFAASATFGG- --LGGFSRPAAAIKQWRCLPRIQCHAAEQSQSPSTTLRRSGNYQPSIWTHDRIQSLTFSHT- -ADEDDHAER --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MDLDETPSVEVSEDVVVDRQLAGFDPSFWGDYFITNKKSQSEAW--- --MKER --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MDLDETPSVEVSEDVVVDRQLAGFDPSFWGDYFITNKKSQSEAW--- --MNER --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MDLDETPSVEVSEGVVVDRQLAGFDPSFWGDYFIKNKKSQFEAW--- --MNER --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVKSNED-IVIRKTAKYHPSIWGDYFIHHTTSPALTE--- --VWIR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVKSNED-IVIRKTAKYHPSIWGDYFIHHTTSPALTE--- --VWIR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVKSNED-IVIRKTAKYHPSIWGDYFIHHTTSPALTE--- --VWIR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVSSNED-IVIRKTSKYHPSIWGDYFIHRATSPDLTE--- --VSVR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVSSNED-IVIRKTSKYHPSIWGDYFIHRATSPDLTE--- --VSVR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVKSNED-IVIRKTAKYHPSIWGDYFIHHTTSPALTE--- --VWIR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVSSNED-IVIRKTSKYHPSIWGDYFIHHTTSPPLTE--- --VLIR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVSSNED-IVIRKTSKYHPSIWGDYFIHHTTSPPLTE--- --VLIR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSTTPVKSNED-IVIRKTAKYHPSIWGDYFIHHTTSPALTE--- --VWIR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MERQSMALVGDKEE-- IIRKSAEYHPSVWGDYFIRNSSSALEKEPTHRILMKR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MERQSMALVGDKEE-- IIRKSAEYHPSVWGDYFIRNSSSALEKEPTHRILMKR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MERQSMALAGDKEE-- IIRKSAEYHPSVWGDYFIRNSSSALEKESTQRILMKR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MERQSMVLVGDDKEEIIIRKSAEYHPTVWGDYFIRN-YSCLPIEEEKEYMIKR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MERQSMVLVGDDKEEIIIRKSAEYHPTVWGDYFIRN-YSCLPLEEEKEYMIKR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- ---- --- ---- ---- MEKQSLTFVGDEAK-- -VHKSSKYHPSVWGDYFIRNSLSHVETQR-- --MIKR --- ---- ---- ---- --- ---- ---- ---- ---- ---- ---- --MHSCMVLTQCSFGLRPKFVASLRNNS--- -TNPSSNTSPPLPDFQKQQKITKPYRYHDQESLCSRH

RR

MT16 MT19 MT08 MT06 MT06A MT09 MT09A2 MT09B MT12A-M2 MT06B MT01 MT03 MT07 MT17B MT17B2 MT17 MT17D MT17A MT17A2 MT17C MT02A MT05 MT04 MT11 ST00 ST00A ST00B ST02A ST02A2 ST02A3 ST02C ST02C2 ST02A4 ST02B ST02B2 -FS ST03 ST05 ST05A ST05B ST07 ST07A ST01 MT00

86 87 86 87 86 86 86 86 87 87 86 83 87 106 106 106 106 106 106 106 80 73 86 83 48 48 48 47 47 47 47 47 47 47 47 47 51 51 51 52 52 46 62

W

: : : *: : *** . * : . : :: * * ** . . : : : . : .*: .* IKVLKEQTRN- LIREKQRVAEQLQLI---- --GVAYHFKDEIADVLSRLHASLDHVSSQLKDDLHATALLFRLLRANGFSVSQDLLETFRDEKGNFEARCENQIRGLLSL INVLKEQTRN- LIREKQQVAEQLQLIDQLQQLGVAYHFKDEIADVLSRLHASLDHVSSQLKDDLHATALLFRLLRANGFSVSQDLFETFRDENGNFEARCENQIRGLLSL MKVLKEQTRN- LIREKQRVAEQLQLIDHLQQLGVAYHFKDEIADVLSCLHASLDHVSWQLKDDLHATALLFRLLRANGFSVSQDLLETFRDEKGNLEARCENQIRGLLSL INVLKEQTRN- LIREKQRVAEQLQLIDHLQQLGVAYHFKDEIADVLSRLHASLDHVSSELKDDLHATSLLFRLLRANGLSVSQDLFERFRDEKENFEARCENQIRGLLSL INVLKEQTRN- LIREKQRVAEQLQLIDHLQQLGVAYHFKDEIADVLSRLHASLDHVSSELKDDLHATSLLFRLLRANGLSVSQDLFERFRDEKENFEARCENQIRGLLSL INVLKEQTRN- LIREKQQVAEQLQLIDQLQQLGVAYHFKDEIADALSHLHASLDHVSSQLKDDLHATSLLFRLLRANGFSVSQDLFETFRDENGNFEARCENQIRGLLSL INVLKEQTRN- LIREKQQVAEQLQLIDQLQQLGVAYHFKDEIADALSHLHAFLDHVSSQLKDDLHATSLLFRLLRANGFSVSQDLFETFRDENGNFEARCENQIRGLLSL INVLKEQTRN- LIREKQQVAEQLQLIDQLQQLGVAYHFKDEIADALSHLHAFLDHVSSQLKDDLHATSLLFRLLRANGFSVSQDLFETFRDENGNFEARCENQIRGLLSL INVLKEQTRN- LIREKQQVAEQLQLIDQLQQLGVAYHFKDEIADVLSHLHASLGHVSSQLKDDLHATSLLFRLLRANGFSVSQGLFETFRDEKGNFEVRCENQIRGLLSL INVLKEQTRN- LIREKQRVAEQLQLIDHLQQLGVAYHFKDEIADVLSRLHASLDHVSSELKDDLHATSLLFRLLRANGLSVSQDLFERFRDEKENFEARCENQIRGLLSL INVLKEQTRK- LIREKQQVAEQLQLIDHLQQLGVAYHFKDEIANVLSRLHASLDHVSSELKDDLHATSLLFRLLRTNGFSVSQDLFERFRDEKGNFEARCENQIRGLLSL INVLKEQTRN- LIREKQQVAEQLQLIDHLQQLGVAYHFKDEIADVLSRLHASLDHVRWELKDDLHATALLFRLLRAKGFSVSQDLFETFRDEKGNFEARCENPIRGLLSL INVLKEQTRNLMIQEKQRVAEQLQLIDHLQQLGVAYHFKDEIADVLSRLHASLDHVRWELKDDLHATALLFRLLRAKGFSVSQDLFETFRDEKGNFEVRCENPIRGLLSL LVLLKERIAEVICEKK-EVEEQLRLIDHLQQLGVAYHFKDDIKDSLRNLHSSLEEISLTFKDNLHASALLFRLLRENGFSISEDIFEGFRDEKGHFRDGLKNHTEGMLSL LVLLKERIAEVICEKK-EVEEQLRLIDHLQQLGVAYHFKDDIKDSLRNLHSSLEEISLIFKDNLHASALLFRLLRENGFSISEDIFEGFRDEKGHFRDGLKNHTEGMLSL LVLLKERIAEVICEKK-EVEEQLRLIDHLQQLGVAYHFKDDIKDSLRNLHSSLEEISLTFKDNLHASALLFRLLRENGFSISEDIFEGFRDEKGHFRDGLKNHTEGMRSL LVLPKERIAEVICEKK-EVEEQLRLIDHLQQLGVAYHFKDDIKDSLRNLHSSLEEISLTFKDNLHASALLFRLLRENGFSISEDIFEGFRDEKGHFRDGLKNHTEGMLSL LVLLKERIAEVICEKK-EVEEQLRLIDHLQQLGVAYHFKDDIKDSLRNLHSSLEEISLIFKDNLHASALLFRLLRENGFSISEDIFEGFRDEKGHFRDGLKNHTEGMLSL LVLLKERIAEVICEKK-EVEEQLRLIDHLQQLGVAYHFKDDIKDSLRNLHSSLEEISLIFKDNLHASALLFRLLRENGFSISEDIFEGFRDEKGHFRDGLKNHTEGMLSL LVLLKERIAEVICEKK-EVEEPLRLIDHLQQLGVAYHFKDDIKDSLRNLHSSLEEISLIFKDNLHASALLFRLLRENGFSISEDIFEGFRDEKGHFRDGLKNHTEGMLSL IKFLEEKTRDVIGEKK-EVEEQLELIDHLQQLGVAYHFKDGIKDCLTRLHASIEDLSLKFNDNLHATALLFWLLRENGFSISEDMFKKFRDEKGEFRDGGENHTEGMLSL IKLLKEKVRKVIHDDK-EVEEQLQLIDQLQQLGVAYHFKDDIKDSLSSLHASLEDISLKFKDNLHASAVLFRLLRENGFSVSEDIFYKFRDEKGQLRDCLGKNTQGMLSL IKLLKNHTSKLMEEKKGQLEEQLQLIDHLQQLGVAYHFKDEIKDTLRGFHASFEDIGSQLEDNLHASALLFRLLRENGFSVSEDIFKKFKDEKGQFEDRLRSQTQGLLSL IKLLKNQTSKVMEEKKGQLEEQLQLIDHLQQLGVAYHFKDDIKDTLRGFHASFEDIGSQLKDNLHASALLFRLLRENGFSVSEDIFKKFKDEKGQFEDRLHSQAQGLLSL VEELKNEVRSMFQNV-TGVLQTMNLIDTIQLLGLDYHFMEEIDRALDHLKDVDMSK- ---- YGLYEVALHFRLLRQNGFNISSDVFKKYKNKEGKFMEELKDDAKGLLSL VEELKNEVRSMFQNV-TGVLQTMNLIDTIQLLGLDYHFMEEIDRALDHLKDVDMSK- ---- YGLYEVALHFRLLRQKGFNISSDVFKKYKNKEGKFMEELKDDAKGLLSL VEELKNEVRSMFQNV-TGVLQTMNLIDTIQLLGLDYHFMEEIDRALDHLKDVDMSK- ---- YGLYEVALHFRLLRQKGFNISSDVFKKYKNKEGKFMEELKDDAKGLLSL AEELKEQIKNFFRET-SDILQIMNLIDAIQLLGLDYHFEKEIDAALSLISKHDAKN- ---- YELYETSLHFRLLRQHDFYVPADVFNKFKDEEGNFMSTLNEDVKGLLSL AEELKEQIKNFFRET-SDILQIMNLIDAIQLLGLDYHFEKEIDAALSLISKHDAKN- ---- YELYETSLHFRLLRQHDFYVPADVFNKFKDEEGNFMSTLNEDVKGLLSL AEELKEQIKNFFRET-SDILQIMNLIDAIQLLGLDYHFEKEIDAALSLISKHDAKN- ---- YELYETSLHFRLLRQHDFYVPADVFNKFKDEEGNFMSTLNEDVKGLLSL VEELKKQIKNLFRET-SEILQIMNLIDAIQLLGLDYHFEKEIDGALSLISKHDAKN- ---- YELYETSLWFRLLRQHGFYVPPDVFNKFKDEEGNFMSTLNEDVKGLLSL VEELKKQIKNLFRET-SEILQIMNLIDAIQLLGLDYHFEKEIDGALSLISKHDAKN- ---- YELYETSLWFRLLRQHGFYVPPDVFNKFKDEEGNFMSTLNEDVKGLLSL AEELKEQIKNFFRET-SDILQIMNLIDAIQLLGLDYHFEKEIDAALSLISKHDAKN- ---- YELYETSLHFRLLRQHDFYVPADVFNKFKDEEGNFMSTLNEDVKGLLSL AEELKEQIKNLFRET-SDILQIMNLIDAIQLLGLDYHFEKEIDAALSLISKHDAKN- ---- YELYETSLWFRLLRQHGFYVPPDVFNKFKNEEGNFMSTLNEDVKGLLSL AEELKEQIKNLFRET-SDILQIMNLIDAIQLLGLDYHFEKEIDAALSLISKHDAKN- ---- YELYETSLWFRLLRQHGFYVPPDVFNKFKNEEGNFMSTLNEDVKGLLSL AEELKEQIKNFFRET-SDILQIMNLIDAIQLLGLDYHFEKEIDAALSLISKHDAKN- ---- YELYETSLHFRLLRQHDFYVPADVFNKFKDEEGNFMSTLNEDVKGLLSL VEELKERVRNLFKETSDDVLQIMNLVDSIQLLGLDYHFEKEIVTALRLVYGADAEN- ---- YGLYEFSLRFRLLRQHGYYLSADVFNKFKDEKGRFLSTLNGDAKGLLSL VEELKERVRNLFKETSDDVLQIMNLVDSIQLLGLDYHFEKEIVTALRLVYGADAEN- ---- YGLYEFSLRFRLLRQHGYYLSADVFNKFKDEKGRFLSTLNGDAKGLLSL VEELKERVRNLFKETSDDVLQIMNLVDSIQLLGLDYHFEKEITAVLRLIYEADVKN- ---- YGLYEVSLRFRLLRQHGYYLSADVFNKFKDEKGRFLSTLNGDPKGLLSL VEELKDGVRNLFEET-HDGLQIMILVDSIQLLGLDYHFDKEITAALRLIYEADVEN- ---- YGLYEVSLRFRLLRQHGYTSSPDVFNKFKDDKGRFLSALNGDAKGLLGL VEELKDRVRNLFEET-HDVLQIMILVDSIQLLGLDYHFEKEITAALRLIYEADVKN- ---- YGLYEVSLRFRLLRQHGYTLSPDVFNKFKDAKGRFLSALNGDAKGLLSL VEELKVQVKSMFKGT-NDILQIMNLIDSIQLLRLDYHFENEIDDALRLIFEVDDKN- ---- YELYETSLRFRLLRQHGYNVSTDTFNKFRDDNGSFISTLKRDAKGLLSL TDKVEEVRRIVHETK--GEKETLLLIDSLQKLSVDYHFEEEIEAKMHSLYEKRLKIINGEANNIMEVSLLFRLLRQAQYPISTDVFDRFLDRRGEFMASLTKEIEGLISL

189 196 195 196 195 195 195 195 196 196 195 192 197 215 215 215 215 215 215 215 189 182 196 193 152 152 152 151 151 151 151 151 151 151 151 151 156 156 156 156 156 150 170

144 MT16 MT19 MT08 MT06 MT06A MT09 MT09A2 MT09B MT12A-M2 MT06B MT01 MT03 MT07 MT17B MT17B2 MT17 MT17D MT17A MT17A2 MT17C MT02A MT05 MT04 MT11 ST00 ST00A ST00B ST02A ST02A2 ST02A3 ST02C ST02C2 ST02A4 ST02B ST02B2 -FS ST03 ST05 ST05A ST05B ST07 ST07A ST01 MT00

::. : * * .* *: .* : *. * . . : : : : : .: * :. * :: : :: : **. YEASYLEKEGETLLQEAMDFTTEQLKGFMEEGSVPEAGGLREQVAHALQLPLNWRLERVQHRWFIEACSSGDNNIN- -PLLLEFAKLDFNLVQDMYKSELRELSRWWSKL YEASYLEKEGETLLKEAMDFATEQLKGFMEGGSVAEAGGLREQVARALQLPLNWRLERVQHRWFIEACSSGDDTVN- -PLLLEFAKLDFNLVQDMYKSDLRELSRWWSEL YEASYFEKEGETLLQEAMDFATEQLKGFMEEESVPEAGGLREQVAHALQIPLNWRLERVQHRWFIEACRR-DDTIN- -PILLEFAKLDFNLVQDMYKSELRELSRWWSGL YEASYLEKEGETLLKEAMDFATEQLKGFMEEGSVPEAGGLREQVAHALQLPLNWRLERVQHRWFIEACSSGDDTVN- -PLLLEFAKLDFNLVQDMYKSELKELSTWWSGL YEASYLEKEGETLLKEAMDFATEQLKGFMEEGSVPEAGGLREQVAHALQLPLNWRLERVQHRWFIEACSSGDDTVN- -PLLLEFAKLDFNLVQDMYKSELKELSTWWSGL YEATYLEKEGETSLKEAMDFATEQLKGFMEEGSVAEAGGLREQVAHALQLPLNWRLERVQHRWFIEACR-GDDTIN- -PLLLEFAKLDYNLVQDMYKSELRELSRWWSGL YEATYLEKEGETSLKEAMDFATEQLKGFMEEGSVAEAGGLREQVAHALQLPLNWRLERVQHRWFIEACR-GDDTIN- -PLLLEFAKLDYNLIQDMYKSELRELSRWWSGL YEATYLEKEGETSLKEAMDFATEQLKGFMEEGSVAEAGGLREQVAHALQLPLNWRLERVQHRWFIEACR-GDDTIN- -PLLLEFAKLDYNLIQDMYKSELRELSRWWSGL YEASYLEKEGETLLKEAMDFATEQLKGFMEEGSVPETGGLREQVAHALQLPLNWRLERVQHRWFIEACR-GDDTIN- -PLLLEFARLDYNLVQDMYKSELRELSRWWSGL YEASYLEKEGETLLKEAMDFATEQLKGFMEEGSVPEAGGLREQVAHALQLPLNWRLERVQHRWFIEACSSGDDTVN- -PLLLEFAKLDFNLVQDMYKSELRELSRWWSGL YEASYLEKEGETLLKEAMDFATKQLKGFLEEGSVPETGGLREQVAHALQLPLNWRLERVQHRWFIEACR-GDDTINP-PFLLEFAKLDFNLVQDMYKSELRELSRWWSGL YEASYYEKEGESVLKEAMDFATEQLKALMEEGSVPEAGGLREQVAHALQIPLNWRLERVHHRWFIEACS-GDNTID- -PLLLEFAKHDFNLVQDMYKSELKELSKWWSEL YEASFYEKEGESVLKEAMDFATEQLKALMEEGSVPEAGGLREQVAHALEIPLNWQLERVQHRWFIEACSSGDDTIN- -PLLLEFAKHDFNLVQDMYKSELRELSRWWSGL YEASYYEKDGEMVLLEAMEFTTEHLKNLLEE-GSD--LKLKEQTAHALELPLNWRMERLHARWFIEACQREVMVIDN-PLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYEKDGEVVLLEAMEFTTEHLKNLLEE-GSD--LKLKEQTAHALELPLNWRMERLHARWFIEACQREVMVIDN-PLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYEKDGEMVLLEAMEFTTEHLKNLLEE-GSD--LKLKEQTAHALELPLNWRMERLHARWFIEACQREVMVIDN-PLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYEKDGEMVLLEAMEFTTEHLKNLLEE-GSD--LKLKEQTAHALELPLNWRMERLHARWFIEACQRKVLAINN-PLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYEKDGEMVLHEAMEFTTEHLKNLLEE-GSD--LKLKEQTAHALELPLNWRMERLHARWFIEACQREVLVINN-PLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYEKDGEMVLHEAMEFTTEHLKNLLEE-GSD--LKLKEQTAHALELPLNWRMERLHARWLIEACQREVLVINN-PLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYEKDGEMVLHEAMEFTTEHLKNLLEE-GSD--LKLKEQTAHALELPLNWRMERLHARWFIEACQREVLVINN-PLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYGTEGEMVLQEAMEFTTEHLKNLLEE-GSEI-MKLRKKVADALELPLNWRMERVHTRRFIEACQRDASILNNNPLLLEFAKLDFNAVQSIYKKELSALSRWWTKL YEASYYEKDEEMVLHEAMEFATEHLENVLEE-GMESSDLTREKVAYALELPLNWRMERLHTRWFIESCQREAAANVN-RALLEFAKLDFNATQSVHKKELRQVSRWWTEL YEASYLEKDGEELLHEAREFTTKHLKNLLEEEGSLKPGLIREQVAHALELPLNWRFQRLHAKWFIGAWQRDPA---MDPALLLLAKLDFNALQNIYKRELNELSRWWTDL YEASYLEKDGEELLHEAREFTTKHLKNLLEEEGALKPGLIREQVAHALELPLNWRFQRLHTKWFIGAWQRD-- ---- -PALLLLAKLDFNALQNIYKRELRDVSRWWTDL YNAAYLGTKKETILDEAISFTKDNLTSLLKDLNPT--- -VAKLVSLTLKTPIQRSMKRLFTRCYISIYQDEPTRNE- --TILELAKLDFNILQCLHQEELKKVCMWWKKL YNAAYLGTKKETILDEAISFTKDNLTSLLKDLNPT--- -VAKLVSLTLKTPIQRSMKRLFTRCYISIYQDEPTRNE- --TILELAKLDFNILQCLHQEELKKVCMWWKKL YNAAYLGTKKETILDEAISFTKDNLTSLLKDQNPT--- -VAKLVSLTLKTPIQRSMKRLFTRCYISIYQDEPTRNE- --TILELAKLDFNILQCLHQEELKKVCMWWKKL YNAAYLRIHGEYILDEAILFTKNRLASLLDELKQP--- -LVILVSHFLETPLCRGNKRLLARKYIPIYQEEERRNE- --AILEFAKLDFNLLQSLHQEELKKISIWWNDL YNAAYLRIHGEYILDEAILFTKNRLASLLDELKQP--- -LVILVSHFLETPLCRGNKRLLARKYIPIYQEEERRNE- --AILEFAKLDFNLLQSLHQEELKKISIWWNDL YNAAYLRIHGEYILDEAILFTKNRLASLLDELKQP--- -LVILVSHFLETPLCRGNKRLLARKYIPIYQEEERRND- --AILEFAKLDFNLLQSLHQGELKKISIWWNDL YNAAYLRIHGEYILDEAILFTKNRLALSLDKLKQP--- -LVILVSLFLETPLCQRNKRLLARKYIPIYQEEERRNE- --AVLEFAKLDFNLLQSIHQEELKKISIWWNDL YNAAYLRIHGEYILDEAILFTKNRLALSLDKLKQP--- -LVILVSLFLETPLCQRNKRLLARKYIPIYQEGERRNE- --AVLEFAKLDFNLLQSIHQEELKKISIWWNDL YNAAYLRIHGEYILDEAILFTKNRLASLLDELKQP--- -LVILVSLFLETPLCQRNKRLLARKYIPIYQEEERRNE- --AVLEFAKLDFNLLQSIHQEELKKISIWWNDL YNAAYLRIHEEYILDEAILFTKNRLTLLLDKLKQP--- -LVILVSLFLETPLCQRNKRLLARKYIPIYQEEERRND- --AILEFAKLDFNLLQSLHQGELKKISIWWNDL YNAAYLRIHEEYILDEAILFTKNRLTLLLDKLKQP--- -LVILVSLFLETPLCQRNKRLLARKYIPIYQEEERRND- --AILEFAKLDFNLLQSLHQGELKKISIWWNDL YNAAYLRIHGEYILDEAILFTKNRLASLLDELKQP--- -LVILVSHFLETPLCRGNKRLLARKYIPIYQEEERRNE- --AILEFAKLDFNLLQSLHQEELKKISIWWNDL YNAAYLGTHEETILDEAISFTKCQLESLLGELEQP--- -LATEVSFFLETPLCRRTKRLMVRKYIPIYQENVMRND- --TILELAKLDFNLLQSLHQEEVKKISMWWNDL YNAAYLGTHEETILDEAISFTKCQLESLLGELEQP--- -LATEVSFFLETPLCRRTKRLMVRKYIPIYQENVMRND- --TILELAKLDFNLLQSLHQEEVKKISMWWNDL YNAAYLGTHEETILDEAISFTKCQLESLLGELVQP--- -LATEVSFFLETPLCRRTKRLMVRKYIPIYQENVMRND- --TILELAKLDFNLLQSLHQEEVKKISMWWNDL YNAAYLGTHEEMILDEAISFTKCQLESMLDELEPP--- -LATEVSLFLETPLYPRTRRFLVRKYIPIYQEKVMRND- --TILELAKLDFNLLQSLHQEEVKKITIWWNDL YNAAYLGTHEEMILDEAISFTKCQLESMLGELEPP--- -LATEVSLFLETPLYRRTRRFLVRKYIPIYQEKVMRND- --TILELAKLDFNLLQSLHQEEVKKITIWWNDL YNVSYLATHGETILDEANYFTKSQLVSLLSELEQP--- -LETQVSLFLEVPLCRRIKSLLARIYIPIYQKDAMRDD- --VILELAKLDFNLLQTLHQEELKKVSIWWNDL FEASNLNFGGELILYRANEFSRIHLKPYMTSLDAD--- -LAVHIKQILDNPYHLTLERFKARQILDNNTSKFYYNFN---IVELGKMDFSMIQSLHQKELKEVSRWWKES

297 304 302 304 303 302 302 302 303 304 303 299 305 321 321 321 321 321 321 321 297 290 303 296 255 255 255 254 254 254 254 254 254 254 254 254 259 259 259 259 259 253 273

MT16 MT19 MT08 MT06 MT06A MT09 MT09A2 MT09B MT12A-M2 MT06B MT01 MT03 MT07 MT17B MT17B2 MT17 MT17D MT17A MT17A2 MT17C MT02A MT05 MT04 MT11 ST00 ST00A ST00B ST02A ST02A2 ST02A3 ST02C ST02C2 ST02A4 ST02B ST02B2 -FS ST03 ST05 ST05A ST05B ST07 ST07A ST01 MT00

: * * * : : : * :. * :: : ** ::* .:* :* :* : *. :: :*:* : : : .: : : : GLPEK-LPFFRDRLLENYTWALGFSYEPDSRRCRMIEAKAVCLIALYDDIYDVYGTLDELQLFTDAVNRWDLTAMDKLPEYMKLCFFAIFNLVHEEGYRVMKEKGLDIVP GLPEK-LPFFRDRLLENYTWALGFSYEPDSRRCRMIEAKTVSLIALYDDIYDVYGTLDELQLFTDAVNRWDLTAMDKLPEYMKLCFFAIFNLVHEEGYRVMKEKGLDIVP GLSEK-LPFCRDRLTESYLWTVGFTYEPDSWRCRMIETKIINFITLIDDIYDVYGTLDELQLFTDVVDRWDLTAMDKLPEYMKLSFFALFNMVHEEAYRVMKEKSLDIVP GLSEK-LPFFRDRLTENYLWALGLTYEPDNLRCRIIETKAICFITLIDDIYDVYGTLDELQLFTDAVDRWDLTAMGKLPEYMKLSFFALFNMVHEEGYRVMKETGLDVVP GLSEK-LPFFRDRLTENYLWALGLTYEPDNLRCRIIETKAICFITLIDDIYDVYGTLDELQLFTDAVDRWDLTAMGKLPEYMKLSFFALFNMVHEEGYRVMKETGLDVVP GLAEK-LSFFRDRLPENYLWALGFTYEPESWRCRMIQTKVICFLTLIDDIYDVYGTLDELQLFTDAVDRWDLTAMDKLPEYMKLCFFAIFNLVHEEGYRVMKEKGLDIVP GLAEK-LSFFRDRLPENYLWALGFTYEPESWRCRMIQTKVICFLTLIDDIYDVYGTLDELQLFTDAVDRWDLTAMDKLPEYMKLCFFAIFNLVHEEGYRVMKEKGLDIVP GLAEK-LSFFRDRLPENYLWALGFTYEPESWRCRMIQTKVICFLTLIDDIYDVYGTLDELQLFTDAVDRWDLTAMDKLPEYMKLCFFAIFNLVHEEGYRVMKEKGLDIVP GLAEK-LSFFRDRLPENYLWALGFTYEPESWRCRMIQTKIICFLTLIDDIYDVYGTLDELQLFTDAVDRWDLTAMDKLPEYMKLCFFAIFNLVHEEGYRVMKEKGLDIVP GLAEK-LSFFRDRLPENYLWALGFTYEPESWRCRMIQTKVICFLTLIDDIYDVYGTLDELQLFTDAVDRWDLTAMDKLPEYMKLCFFAIFNLVHEEGYRVMKEKGLDIVP GLVEK-LPFFRDRLLENYLWTLGFTYEPESWRCRMIQAKSICLIALIDDIYDVYGALDELQPFTDAVYRWDLTAMDKLPEYMKLSFFALFNMVHEEGYRVMKETGLDVVP GLSEK-LPFFRDRLFENYLWAVGFTYEPDNWRCRLNETKAICFITLIDDIYDVYGTLDELQIFTDVVNRWDLTAMDKLPEYMRLCFFALFNFVHDEGYRVMKEKGLDIVP GLSEK-LPFGRDRLVENYLWAVGFTYEPDDWRCRMNETKSICFITLIDDIYDVYGTLDELQLFTDVVNRWDLTAMDKLPEYMRLCFFALFNFVHDEGYRVMKEKGLDIVP GVVEK-LPFARDRLTENYLWTVGWAFEPEHWSFREAQTKGNCFVTMIDDVYDVYGTLDELELFTNVVDRWDINAIDQLPDYMKILFLALFNTINDDGYKVMKEKGLDIIP GVVEK-LPFARDRLTENYLWTVGWAFEPEHWSFREAQTKGNCFVTMIDDVYDVYGTLDELELFTNVVDRWDINAIDQLPDYMKILFLALFNTINDDGYKVMKEKGLDIIP GVVEK-LPFARDRLTENYLWTVGWAFEPEHWSFREAQTKGNCLVTMIDDVYDVYGTLDELELFTNVVDRWDINAIDQLPDYMKILFLALFNTINDDGYKVMKEKGLDIIP GVVEK-LPFARDRLTENYLWTVGWAFEPEHWSFREAQTKGNCFVTMIDDVYDVYGTLDELELFTRVVDRWDINAIDQLPDYMKILFLALFNTINDDGYKVMKEKGLDIIP GVVEK-LPFARDRLTENYLWAVGWAFEPEHWSFREAQTKGNCFVTMIDDVYDVYGTLDELELFTNVVDRWDINAIDQLPDYMKILFLALFNTINDDGYKVMKEKGLDIIP GVVEK-LPFARDRLTENYLWTVGWAFEPEHWSFREAQTKGNCFVTMIDDVYDVYGTLDELELFTNVVDRWDINAIDQLPDYMKILFLALFNTINDDGYKVMKEKGLDIIP GVVEK-LPFARDRLTENYLWTVGWAFEPEHWSFREAQTKGNCFVTMIDDVYDVYGTLDELELFTNVVDRWDINAIDQLPDYMKILFLALFNTINDDGYKVMKEKGLDIIP GVVEK-LPFARDRLTENYLWTVGWAFEPEHWRFREALTKGNCFVTMIDDVYDVYGTLDELELFTRVVDRWDINAIDKLPDYMKILFLAIFNTTNEEAYKVMKEKGLDIIR GIARE-LPFSRDRLTENYLWTVGWASEPEHWRFREEQTKANCFVTMIDDVYDVYGTLDELELFTDTIDRWDINAIDGLPDYMKLLFLAVFNTTNEATLKVMKEKGLNTMP GLPEK-LPFFRDRLTENYLWTVGFAFEPDSWAFRELQTKFNSFITMIDDVYDVYGTLDELELFTDIMERWDVNAIDKLPEYMKLCFLAVFNTVNDTGYEVMRNKGVDIIP GLPQK-LPFFRDRLTENYLWTVGFAFEPDSWAFRELQTKTNCFVTMIDDVYDVYGTLDELELFTDIMERWDVNAIDKLPEYMKLCFLAVFNTVNDAGYEVMRNKGVDIIP NVDIMHLNFIRDRVVECYCWSMVIRHEPSCSRARLISTKLLMLITILDDIYDSYSTLEESLLLTDAIQRWSPDAVDQLPQYLRDFFLKMLSIFQEFENELAPEE-KFRIF NVDIMHLNFIRDRVVECYCWSMVIRHEPSCSRARLISTKLLMLITILDDIYDSYSTLEESLLLTDAIQRWSPDAVDQLPQYLRDFFLKMLSIFQEFENELAPEE-KFRIF NLDIMHLNFIRDRVVECYCWSMVIRHEPSCSRARLISTKLLMLITILDDIYDSYSTLEESLLLTDAIQRWSPDAVDQLPQYLRGFFLKMLSIFQEFENELAPEE-KFRIF ALAKS-LNFARDRIVECYYWIHNVHFEPHYSRARLICTKVIALLSVLDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELENDE-KYRIS ALAKS-LNFARDRIVECYYWIHNVHFEPHYSRARLICTKVIALLSVLDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELESDE-TYRIP ALAKS-LNFARDRIVECYYWILIVHFEPQYSRARLICTKVVSLMSLMDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELESDE-TYRIP ALAKS-LNFARDRIVECYYWILIVHFESQYSRARLICSKVVSLMSLMDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELESDE-KYRIP ALAKS-LNFARDRIVECYYWILIVHFESQYSRARLICSKVVSLMSLMDDIYDNYSTLQESQLLTEAIQWWEPQAIDEVPEYLKDFYLKLLRTFKEFENELESDE-KYRIP ALAKS-LNFARDRIVECYYWILIVHFESQYSRARLICSKVVSLMSLMDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELESDE-KYRIP ALAKS-LNFARDRIVECYYWILIVHFEPQYSRARLICTKVVSLMSLMDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELESDE-TYRIP ALAKS-LNFARDRIVECYYWILIVHFEPQYSRARLICTKVVSLMSLMDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELESDE-TYRIP ALAKS-LNFARDRIVECYYWIHNVHFEPHYSRARLICTKVIALLSVLDDIYDNYSTLQESQLLTEAIQRWEPQAIDEVPEYLKDFYLKLLRTFKEFENELENDE-KYRIS ALTKS-LKFARDRVVECYYWIVAVYFEPQYSRARVITSKAISLMSIMDDIYDNYSTLEESQLLTEAIERWEPQAVDCVPEYLKDFYLKLLKTYKDFEDELEPNE-KYRIP ALTKS-LKFARDRVVECYYWIVAVYFEPQYSRARVITSKAISLMSIMDDIYDNYSTLEESQLLTEAIERWEPQAVDCVPEYLKDFYLKLLKTYRDFEDELEPNE-KYRIP ALTKS-LKFARDRVVECYYWIVAVYFEPQYSRA- VITSKAISLMSIMDDIYDNYSTLEESQLLTEAIERWEPQAVDCVPEYLKDFYLKLLKTYRDFEDELEPNE-KYRIP ALTES-LKFARDRVVECYYWIVGVYFEPQYSHPRIITCKVISLMSIMDDIYDNYSTLEESRLLTEAIERWEPQAVEHVPEYLKDFYLKLLKTYKDFEDELEPNK-KYRIP ALIES-LKFARDRVVECYYWIVGVYFESQYSYPRIITCKVISLMSIMDDIYDNYSTLEESRLLTEAIERWEPQAVEHVPEYLKDFYLKLLKTYKDFEDELEPNK-KYRIP ALAKS-LKFVRDRIVEAYYWVLGMYYEPQYSRARVMCTKAFGLLSIMDDIYDNYSTLEERKLLTEAIKRWNRQAVDSLPEYTKDFYLKLLKTFEEFEAVLELNE-KYRVQ GLDQE-LGFARVQPLKWFTWPMTCLPNPKFSKYRIVLSKVVAFVYLLDDIFDVKGSLDELYLFTQAIERWDHSSMNSLPDYMKACFKVLEDTINEIARIVFEEHGWNSIE

406 413 411 413 412 411 411 411 412 413 412 408 414 430 430 430 430 430 430 430 406 399 412 405 364 364 364 362 362 362 362 362 362 362 362 362 367 367 366 367 367 361 382

DDXXD

145 MT16 MT19 MT08 MT06 MT06A MT09 MT09A2 MT09B MT12A-M2 MT06B MT01 MT03 MT07 MT17B MT17B2 MT17 MT17D MT17A MT17A2 MT17C MT02A MT05 MT04 MT11 ST00 ST00A ST00B ST02A ST02A2 ST02A3 ST02C ST02C2 ST02A4 ST02B ST02B2 -FS ST03 ST05 ST05A ST05B ST07 ST07A ST01 MT00

:: : * :* .* . :: : . . : *: :* : * . . : . * : DLKIAWGNLCKAYFEEAKWFHHGQTPELQEYLENGWMSISNQLLLFTAYCVGKD--LNGEALKNFSSFYAITRSSGILFRLYDDMGTSTHEMERG- DVATCIQCYMHEKDFKIAWGNLCKAYFEEAKWFHHGQTPELQVYLENGWMSISNQLLLFTAYCAGKD--LNGEALKNFSSYHAITRSSGILFRLYDDMGTSTHEMERG- DVATCIQCYMLEKDLKRAWGNQCKSYLEEAKWFHHGQTPNLQEYLKNGWVSISNPILLFNAYCAGRD--LTGEALKSYPSYYAITRSSGTLFRLYDDMGTSTDEIERG- DVAKCIQCYMHDKGLKRRWGDQCKSYFEEAKWFHHGQTPILKEYLENAWVTVSGPVLLFNAYCVGKD--LTEEALKSFPSYHEITRSASTLARLYDDMGTSTDELERG- DVPKYVQCYMHEKGLKRRWGDQCKSYFEEAKWFHHGQTPILKEYLENAWVTVSGPVLLFNAYCVGKD--LTEEALKSFPSYHEITRSASTLARLYDDMGTSTDELERG- DVPKYVQCYMHEKDLKKAWGDQCKSYFEEAKWFHHGRTPKLEEYMENGLVSIAGPIIVSHAYCVAKD--LTGEALKIFPNYHEITRSSSILFRLYDDMGTSTDELERG- DVPKYVQCYMHEKDLKKAWGDQCKSYFEEAKWFHHGRTPKLEEYMENGLVSIAGPIILSHAYCVAKD--LTGEALKIFPNYHEITRSSSILFRLYDDMGTSTDELERG- DVPKYVQCYMHEKDLKKAWGDQCKSYFEEAKWFHHGRTPKLEEYMENGLVSIAGPIILSHAYCVAKD--LTGEALKIFPNYHEITRSSSILFRLYDDMGTSTDELERG- DVPKYVQCYMHEKDLKKAWGDQCKSYFEEAKWFHHGRTPKLEEYMENGLVSIAGPIILSHAYCVAKD--LTGEALKIFPNYHEITRSSSILFRLYDDMGTSTDELERG- DVPKYVQCYMHEKDLKKAWGDQCKSYFEEAKWFHHGRTPKLEEYMENGLVSIAGPIIVSHAYCVAKD--LTGEALKIFPNYHEITRSSSILFRLYDDMGTSTDELERG- DVPKYVQCYMHEKGLKREWRDQCKAFFEEAKWFHHGQTPKLNEYLENAWVSIGGPNILFNAYCMSKD--LTGEALKFYPNYHEITRSSNILFRLYDDMGTFTDEIERG- DVAKSIQCYMHEKDLKRTWGDLWTADFEEAKWFHHDQTPKLEEYLENAWVSISGPTILVNAYCMGKD--LTGEALKSFPSYYQITRFSSRLFRLIDDMGTSTDELERG- DVAKCIQCYMHEKDLKRTWGNLWKASFQEAKWFHHDQTPKLEEYLGNGWVSISGPTILFNAYCTGKD--LTEEALKSYPSYDQIIRFSSRLFRLLDDYGTSTDEVEKG- DVAKCIQCYMHEKYLKRSWSDLCKAYLVEAKWFHGGYTPTLNEYLDNTWISISGPAIFTNAYCMANN--LTKQALERFSEYPAIAKPSSMLGRLYNDLATSTAEIERG- DVPKSIQCCMHERYLKRSWSDLCKAYLVEAKWFHGGYTPTLNEYLDNTWISMSGPAIFTNAYCMANN--LTKQALERFSEYPAIAKPSSMLGRLYNDLATSTAEIERG- DVPKSIQCCMHERYLKRSWSDLCKAYLVEAKWFHGGYTPTLNEYLDNTWISISGPAIFTNAYCMANN--LTKQALERFSEYPAIAKPSSMLGRLYNDLATSTAEIERG- DVPKSIQCCMHERYLKRSWSDLCKAYLVEAKWFHGGYTPTLNEYLDNTWISISGPAIFTNAYCMANN--LTKQALERFSEYPAIAKPSSMLGRLYNDLATSTAEIERG- DVSKSIQCCMHERYLKRSWSDLCKAYLVEAKWFHGGYTPTLNEYLDNTWISISGPAIFTNAYCMANN--LTKQALERFSEYPAIAKPSSMLGRLYNDLATSTAEIERG- DVPKSIQCCMHERYLKRSWSDLCKAYLVEAKWFHGGYTPTLNEYLDNTWISISGPAIFTNAYCMANN--LTKQALERFSEYPAIAKPSSMLGRLYNDLATSTAEIERG- DVPKSIQCCMHERYLKRSWSDLCKAYLVEAKWFHGGYTPTLNEYLDNTWISISGPAIFTNAYCMANN--LTKQASERFSEYPAIAKPSSMLGRLYNDLATSTAEIERG- DVPKSIQCCMHERYLRGAWADLCKAYLVEAKWYHQGCTPPLAEYLENAQVTITGPLVLINAYCLFND--LSEQDLARFSGHLATIKPPSILARLYNDLATSTAESKRG- DVAKAIACCMHETYLKRAWADLCKAYLVEAKWYHKGHTPKFDEYLENGRMSISSNVMVTYGYCMAQE--LTKHDLERFSDYPAIMLPKSRLARLYDDLATSKDELKRG- DVQKCIQCCMLERYLKRAWAELCKMYLREARWYHAGYTPTLDEYLDGAWITISGALILSAAYCTGKD--LTKEDLDKFSTYPYIVQPSCVLFRLHDDFGTSTDELARG- DVQKAVQCCMHERYLKRAWAELCKMYMREARWYGAGYTPTLDEYLDGAWISISGALVLSAAYCTGKD--LTKEDLDKFSTYPSIVQPSCVLFRLHDDLGTSTDELARG- DVQKAVQCCMHERYLKEEWKILSQLYIKECKWRDDNYVPKLEEHMRVSIKSVGFVWFYCSFLSGMEEAVATKDAFEWFATFPKIIEACAMIVRITNDITSTEREQKRV- HVASTVDCYMKEYYLKEEWKILSQLYIKECKWRDDNYVPKLEEHMRVSIKSVGFVWFYCSFLSGMEEAVATKDAFEWFATFPKIIEACAMIVRITNDITSTEREQKRV- HVASTVDCYMKEYYLKEEWKILSQLYIKECKWRDDNYVPKLEEHMRVSIKSVGFVWFYCSFLSGMEEAVATKDAFEWFATFPKIIEACAMIVRITNDITSTEREQKRV- HVASTVDCYMKEYFLQDEIKAISRSFFIEAKWGIEKYVPTLEEHLSNSIVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASSMIGRLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKALSRSYFIEAKWGIEKYVPTLEEHLSNSIVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASSMIGRLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKALSRSYFIEAKWGIEKYVPTLEEHLSNSIVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASSMIGRLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKALSRSYFIEAKWGIEKYVPTLEEHLSHSLVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASTMIARLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKALSRSYFIEAKWGIEKYVPTLEEHLSHSLVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASTMIARLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKALSRSYFIEAKWGIEKYVPTLEEHLSHSLVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASTMIARLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKALSRSYFIEAKWGIEKYVPTLEEHLSNSIVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASSMIGRLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKALSRSYFIEAKWGIEKYVPTLEEHLSNSIVSTVYPVLICASYVGMDQ-VASKEVFEWVASFPKILKASSMIGRLMNDLTSHKRERQRDQHAASTIECYMKEFFLQDEIKAISRSFFIEAKWGIEKYVPTLEEHLSNSLDTTGYRLLICASYVGMDQ-VASKEVFEWVASFPKIIKASCMICRLMDDVTSHELEQQRE- HTASTVECYMKEFYLQEEIKVLSRSYFQEAKWGVERYVPSLEEHLLVSLISAGYYAVACAAYVGLGE-DATKETFEWVASSPKILKSCSILCRLMDDITSHEREQERD- HVASTVESYMKEHYLQEEIKVLSRSYFQEAKWGVERYVPSLEEHLLVSLISAGYYAVACASYVGLGE-DATKETFEWVASSPKILKSCSILCRLMDDITSHEREQERD- HVASTVESYMKEHYLQEEIKVLSRSYFQEAKWGVERYVPSLEEHLLVSLISAGYYAVACASYVGLGE-DATKETFEWVASSPKILKSCSILCRLMDDITSHEREQERD- HVASTVESYMKEHYLHKEIKDLSRSYFQEAKWCAEGYVPTLEEHLRVSLKSTGYPAITCVSFVGLGE-DATKEAFEWVTSFPKILKSCTIICRLMDDIASHEREQERD- HVASTVESYMKEYYLHKEIKDLSRSYFQEVKWCAEGYVPTLEEHLRVSLKSTGYPAITCVSFVGLGE-DATKEAFEWVTSFPKILKSCTIITRLMDDIASHEREQERD- HVASTVESYMKEYYLKNEFKVVAIAYFEESKWGVERYVPSLEEHLRVSLISAACSLVICSMYLGMGE-VATKEVFEWYSSFPKPVEACSVIGRLLNDIRSHETEQERD- HVASTVECYMKEHYLKKSWIQLSKAFLVEAKWLIEDEVPTVDDYMKTSTVSCGVPLILVHLYFLLGH--- -GVSENLVENLRNLISCPARILRLWDDLGSAKDEEQNG- RDGSLLACLMKENP

MT16 MT19 MT08 MT06 MT06A MT09 MT09A2 MT09B MT12A-M2 MT06B MT01 MT03 MT07 MT17B MT17B2 MT17 MT17D MT17A MT17A2 MT17C MT02A MT05 MT04 MT11 ST00 ST00A ST00B ST02A ST02A2 ST02A3 ST02C ST02C2 ST02A4 ST02B ST02B2 -FS ST03 ST05 ST05A ST05B ST07 ST07A ST01 MT00

* .: *. * . : . .* : : ::. . GVSEKVARRKIRELIRKYWKELNG-SLN-WDSPLEEYFKNIAANIPRAAQFFYQD-GDGYGKE-DGETKAQIISLLLEPIQI-GVSEKVARRKMRELIRKYWKEVNG-SLN-WDSPLEEYFKNIAANIPRAAQFFYQD-GDGYGKE-DGETKTQIISLLLEPIQI-GVSEEVARRKIRELMRKYWRELNG-SLS-WNSPLEEYFKNIAVNIPRSANFFYGN-GDGFGKELEGETKSQITLLLLEPIQI-GVSEEVARREIRELMRKYWRELNA-SLS-WDSPLEEYVKNVIINSSRISQFFYQD-GDGYGKS-DGETKSQIISLLFEPIQI-GVSEEVARREIRELMRKYWRELNA-SLS-WDSPLEEYVKNVIINSSRISQFFYQD-GDGYGKS-DGETKSQIISLLFEPIQI-GVSEEVARREIRELMRKYWRELNA-SLS-WDSPLEEYFMNIQVNIPRTAQFFYDHEGDGYGKA-YGETKSQIILLLFEPIQI-GVSEEVARREIRELMRKYWRELNA-SLS-WDSPLEEYFMNIQVNIPRTAQFFYDHEGDGYGKA-YGETKSQIILLLFEPIQI-GVSEEVARREIRELMRKYWRELNA-SLS-WDSPLEEYVKNVIINSSRISQFFYQD-GDGYGKS-DGETKSQIISLLFEPIQI-GVSEEVARREIRELMRKYWRELNA-SLS-WDSPLEEYFMNIQVNIPRTAQFFYDHEGDGYGKA-YGETKSQIILLLFEPIQI-GVSEEVARREIRELMRKYWRELNA-SLS-WDSPLEEYFMNIQVNIPRTAQFFYDHEGDGYGKA-YGETKSQIILLLFEPIQI-GVSEEVARAKIRELMRKYWRELNG-SLT-WDSPLEGYFKNIAANIPRTAQFFYDHEGDGYGKA-DGETKSQIILLLFEPIQI-GVSEEVARSEIRELIKKHWRALNA-SLS-WDSPFQEYFKNIVVNVPRTAYFFYQD-GDGYGKP-DGETKSQIISIVFEPIQI-GVSEEVARGEIRELIKKYWRALNA-SLS-WDSPVEEYFKNVAINIPRTAHFFYRD-GDGYAKS-DGETKSQIITMLFEPIQI-GVSEGVAREQVKELIRGNWRCMNG-DRA-ATSSFEEMLKRVALDIARSSQFFYQN-GDGYGQG-GGETMNQVMSLLINPII--GVSEGVAREQVKELIRGNWRCMNG-DRA-ATSSFEEMLKRVALDIARSSQFFYQN-GDGYGQG-GGETMNQVMSLLINPII--GVSEGVAREQVKELIRGNWRCMNG-DRA-AESSFEEMLKTVALDIARSSQFFYQN-GDGYGQG-DGETMNQVMSLLINPII--GVSEGVAREQVKELIRGNWRCMNG-DRA-AESSFEEMLKTVALDIARSSQFFYQN-GDGYGQG-DGETMNQVMSLLINPII--GVSEGVAREQVKELIRGNWRSMNG-DRA-AASSFEEMLKRVALDIARSSQFFYQN-GDGYGQG-GGETMNQVMSLLINPII--GVSEGVAREQVKELIRGNWRSMNG-DRA-AASSFEEMLKRVALDIARSSQFFYQN-GDGYGQG-DGETMNQVMSLLINPII--GVSEGVAREQVKELIRGNWRSMNG-DRA-AASSFEEMLKRVALDIARSSQFFYQN-GDGYGQG-DGETMNQVMSLLINPII--GVSEAVARRRIKELIKVNWRRMNG-DRG-GASSLEEHSKRVAIGMARSTHFFYGQ-GDGFGQR-DGDIKSQLVALLVNPIVVEK GVSEDVARGQMKEAIKANWRSVNG-DRGSVSSSFEEYMKRLVVNMIRTFQFFYQD-EDRYGKA-DGETENQVFSLLIIPILL-KVPEAVAREHIKQVMEAKWRLLNG-NRV-ATSSFEEYFQNVAINIPRAAQFFYGK-GDGYAHS-DGETQKQVMSLLIEPVQL-KVPEAVASEHIKQVMEAKWRLLNG-NRV-ATSSFEEYFQNVAINIPRAAQFFYGK-GDGYAKS-DGETQKQVMSLLIEPVQL-GTSKDVACEKLLGFVEDAWKTINE-ELLTK-TGLSREVIELSLHSSRSTELIYKH-VDAFTEP-NTSMKENIFFLLVHPIPI-GTSKDVACEKLLGFVEDAWKTINE-ELLTK-TGLSREVIELSLHSSRSTELIYKH-VDAFTEP-NTSMKENIFFLLVHPIPI-GTSKDVACEKLLGFAEDAWKTINE-ELLTK-TRLSREVIELSLHSSRSTELIYKH-VDAFTEP-NTSMKENIFFLLVHPIPI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRLIIEKIVNFSRVLEEVYKY-TDIYTNS-NTTMKDNIYMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRLIIEKIVNFSRVLEEVYKY-TDIYTNS-NTTMKDNIYMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRLIIEKIVNFSRVLEEVYKY-TDIYTNS-NTTMKDNIYMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRLIIEKIVNFSRVLEEVYKY-TDIYTNS-NTTMKDNIYMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRLIIEKIVNFSRVLEEVYKY-TDIYTNS-NTTMKDNIYMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRLIIEKIVNFSRVLEEVYKY-TDIYTNS-NTTMKDNIYMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRLIIEKIVNFSRVLEEVYKY-TDIYTNS-NTTMKDNIYMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLNP-TQVPRPLIENIVNFSRVVEEFYKY-IDAYTVS-NTTMKDNVNMLLVESVLI-ATDEKEAYKNLMEMVEDAWKDHNK-ECLDQ-TQVPRLLIENIVNFSRVVEEFYKY-IDAYTVS-NTTMKDNVNMLLVESVLI-GTSAKVACEKLQVMVEQKWKDLNE-ECLRP-TQVARPLIEIILNLSRAMEDIYKY-KDTYTNS-NTRMKDNVSLILVESFPI-GTSAKVACEKLQVMVDQKWKDLNE-ECLRP-TQVARPLIEIIMNLSRAMEDIYKH-NDTYTNS-NTRVKDNVSLILVESFPI-GTSAKVACEKLQVMVDQKWKDLNE-ECLRP-TQVARPLIEIIMNLSRAMEDIYKH-NDTYTNS-NTRMKDNVSLILVESFPI-GTSTKVAHEKLQVVVEQAWKDLNK-ECLRPTTQVARSLIEIILNLSRTMEDIYKY-NDTYTNS-NTRMKDNISLILVESFPI-GTSKKVAHEKLQVVVEQAWKDINK-ECLHPTTQVARTLIEIILNLSRTMEDIYKY-NDTYTNS-NTRMKDNISLILVESFPI-GTDVKVACEKLREMLEKAWKDLNK-ERLNP-TLVARPIIERILSLSISMEDVYRD-TDEYTHS-DKKMKDNVSLVLVEPVPI-HCSSQVARGKVMQMIDEAWEELNKESFSLSTSMFSRDFVVVCLNTARMVRVIYNYDEEHKLP--- -MLKEYINLLLFEQI- ---

590 597 596 597 596 596 596 595 597 598 597 592 598 613 613 613 613 613 613 613 592 584 596 589 550 550 550 548 548 548 548 548 548 548 548 547 552 552 551 553 553 546 563

Fig. 3 Alignment of ginger and turmeric terpene synthases by ClustalX2 [34]. RRX8W and DDXXD motifs are marked in the quality curve.

512 519 517 519 518 517 517 517 518 519 518 514 520 536 536 536 536 536 536 536 512 505 518 511 472 472 472 470 470 470 470 470 470 470 470 469 474 474 473 474 474 468 487

146 100

A 50

0 100

B 50

6

0 100

C

1

2

50

3 4

5

0 100

D 50

0 7 .6

100

7 .8

8 .0

8 .2

8 .4

8 .6

8 .8 Tim e (m in )

9 .0

9 .2

9 .4

9 .6

9 .8

1 0 .0

E

50

0 100

F 50

0 100

7

G

8

50

9

*

0 100

10

H 50

0 20

21

22

23

24 Time (min )

25

26

27

28

Fig . 4 Analysis of ST00A and ST00B functions when proteins were expressed in E. coli strain BL21-AI RIL. Total ion chromatograms are displayed: pentane blank (A, E); enzyme assay using E. coli crude extract without pH9GW-ST00A or pH9GW-ST00B plasmid with GPP (B) or FPP (F) as a substrate, respectively; enzyme assay using E. coli crude extract expressing ST00A with GPP (C) or FPP (G) as a substrate, respectively; enzyme assay using E. coli crude extract expressing ST00B with GPP (D) or FPP (H) as a substrate, respectively. Products/compounds identified include: 1, β-phellandrene; 2, α-pinene; 3, sabinene (4(10)-thujene); 4, β-pinene; 5, αphellandrene; 6, limonene; 7, α-zingiberene; 8, β-sesquiphellandrene; 9, β-bisabolene; 10, unknown (trans-sesquisabinene hydrate-like2); *, unknown, it is unknown (7-epi-sesquithujenelike) from ST00A expression in the yeast strain, EPY219 (Fig. 5, peak 5)

147

100 50 0 100 50 0 100 50 0 100 50 0 100 50 0 100 50

A B 1

C

3

2

4

*

D E F

0 22

100

23

24

25

26 Time (min )

E2

27

28

29

30

14

50

0 100

F2

50

8 5

7

6

*

0 2 1 .5

100

2 2 .0

2 2 .5

2 3 .0 Time (min )

2 3 .5

2 4 .0

2 4 .5

15

E2

50

0 100

50

F2 9

4 *

0 2 6 .5

2 7 .0

2 7 .5

2 8 .0

2 8 .5 Tim e (min )

* 2 9 .0

10

11 12 13 *

2 9 .5

3 0 .0

* 3 0 .5

148

Fig. 5 Analysis of ST00A and ST00B functions when proteins were expressed in the yeast strain, EPY219. Total ion chromatograms are displayed: pentane blank (A); EPY219 without pESCURA-ST00A, 18 °C, 2 days (B); EPY219 expressing ST00A, 18 °C, 2 days (C); EPY219 expressing ST00B, 18 °C, 2 days (D); EPY219 without pESC-URA-ST00A, 18 °C, 8 days (E); and EPY219 expressing ST00A, 18 °C, 4 days (F). E2 and F2 are boxed regions of E and F panels to show very small peaks. Products/compounds identified include: 1,α-zingiberene; 2, βsesquiphellandrene; 3, β-bisabolene; 4, unknown (trans-sesquisabinene hydrate-like2); 5, unknown (7-epi-sesquithujene-like); 6, trans-α-bergamotene; 7, γ-curcumene; 8, ar-curcumene; 9, [E]-γ-bisabolene; 10, unknown (α-eudesmol-like); 11, γ-eudesmol; 12, unknown (transsesquisabinene hydrate-like3); 13, α-acorenol; 14, [E]-β-farnesene; 15, [E]-nerolidol

149 A

B

C

Fig. 6 Comparison of MT06 and MT06B modeled structures. MT06 and MT06B structures were modeled based on (4S)-limonene synthase from Mentha spicata [29] as a template and the alignment of all three structures in ribbon (A), the alignment of MT06 and MT06B in ribbon (B) and the side chains of MT06 and MT06B near bound ligand (C) are shown. One more amino acid in MT06B lies in the loop indicated with arrow in B and does not appear to affect the structure; Y576 in MT06 and Y577 in MT06B after loop are aligned very well in C. Green: (4S)-limonene synthase from Mentha spicata, Blue: MT06, Orange: MT06B, Ligand with hashed surface: 2-fluorogeranyl diphosphate.

150 A

B

N-ter

C

C-ter

Fig. 7 Comparison of ST02B and ST02C modeled structures. ST02B and ST02C structures were modeled with (+)-δ-cadinene synthase from Gossypium arboreum [30] as a template and the alignment of all three structures in ribbon (A), the alignment of ST02B and ST02C in ribbon (B) and the atoms/bonds of ST02B and ST02C (C) are shown. C-terminal region and N-terminal regions of ST02B and ST02C are indicated with arrows in B. Green: (+)-δ-cadinene synthase from Gossypium arboreum, Blue: ST02B, Orange: ST02C, Ligand with hashed surface: 2-fluorofarnesyl diphosphate.

151 A

B

C

D

Fig. 8 Comparison of ST02A4, ST02B and ST02C modeled structures. ST02A4, ST02B and ST02C structures were modeled with (+)-δ-cadinene synthase from Gossypium arboreum [30] as a template. In this figure, only the C-terminal region and N-terminal regions are shown. The loop and helix structures with side chains are shown with labeled W28 from the conserved RRX8W motif (A). Surfaces of ST02A4 (B), ST02B (C) and ST02C (D) are shown. ST02C has a longer helix and shorter loop structure like (+)-δ-cadinene synthase from Gossypium arboreum ,whereas ST02A4 and ST02B have longer loop structures, which can affect protein breathing and allow water molecules to quench terpene synthase reactions in these proteins. Yellow: ST02A4, Blue: ST02B, Orange: ST02C, Ligand with hashed surface: 2-fluorofarnesyl diphosphate from aligned (+)-δ-cadinene synthase from Gossypium arboreum.

152

+

+

OPP FPP

+ H (+)-germacrene A

OH (+)-hedycaryol



H HO (-)-neointermedeol

H β-elemene



H

OH

α-elemol

Fig. 9 Proposed synthesis of β-elemene, α-elemol and (-)-neointermedeol from FPP. The products of ST02A4, ST02B and ST02C share the same mechanistic pathway.

153

H

H

O

O

O

H

α-zingiberene β-sesquiphellandrene α-turmerone

H

H

β-turmerone

O

tumerone

curlone

Fig. 10 Comparison of α-turmerone and β-turmerone structures to related compounds. The streochemistry of α-zingiberene and α-turmerone, and β-sesquiphellandrene and β-turmerone are very similar. Together with metabolite profiling shown in Fig. 1, we suggest that “tumerone” and “curlone” previously proposed to be present in turmeric may in fact be α-turmerone and βturmerone. The color used in this figure is compatible with the color used in Fig. 1.

154

O

HO

α-turmerone

α-zingiberene

O

HO

H

β-sesquiphellandrene

H

H

H

H

H

β-turmerone

Fig. 11 Proposed reaction of α-zingiberene/β-sesquiphellandrene oxidase. According to data in Fig. 12, α-zingiberene/β-sesquiphellandrene hydroxylase converts α-zingiberene and βsesquiphellandrene to hydroxylated forms, leading to the requirement of a dehydrogenase to make the ketone forms, α- and β-turmerones.

155

100

A

50 0 100

91.88

120.91

D

40

2

E

50

144.87

50

0 100

90.85

0

20 189.08 120. 09

100

a - t u r m e r on e

β-turmerone

80

60

60

120.12

H

40

36.0

36.5 Time (min)

37.0

37.5

126.12157.16

0 50

83.04

40

55.06

20

0

161.02 132.97

0

105.06

80

92.94

204.07 201.92 219.98

83.03

100

189.04

β-sesquiphellandrene

80

40

144.90 158.96

204.07

68.92

60

131.90 134.89

78.88

40

G

220.02

Peak 3

60

50 0 100

187.03

161.03 121.00

20

92.89 118.87

20

0 100

β-bisabolene

93.96

40

80

F

92.94

80

131.91

76.88

0 100

50

68.93

60 90.86

20

0 100

0 100

92.88 118.90

60

3

204.06 161.02

Peak 2

1

90.90 68.93

20

188.95

80

50 0 100

221.92

149.06

0 100

C

50 0 100

76.83

20

92.93

40

90.86

40

α-zingiberne

80 60

60

50 0 100

Peak 1

80

B

118.94

100

92.87

100

100

150 m/z

20

200.20218.22 200

55.07

105.08 125.14 162.19 200.22218.23

0 250

50

100

150 m/z

200

250

Fig. 12 Analysis of the products of putative α-zingiberene/β-sesquiphellandrene hydroxylase. Total ion chromatograms are shown for: pentane blank (A), yeast cell line, INVSc1 (B), INVSc1 with pESC-TRP-Ob_CPR and pESC-URA-ST00A (C), pESC-URA-ST00A-P1A (D), pESCURA-ST00A-P1A2 (E), pESC-URA-ST00A-P4 (F), pESC-URA-ST00A-P4A (G) and pESCURA-ST00A-P4A2 (H) plasmids. Mass spectra of Peak 1, 2 and 3 are similar with mass spectra of α-zingiberene, β-bisabolene and β-sesquiphellandrene, which suggests that peak 1 and 3 are hydroxylated forms of α-zingiberene and β-sesquiphellandrene shown in Fig. 11. P1A, P1A2, P4, P4A and P4A2 are P450 monooxygenase candidates for α-zingiberene/β-sesquiphellandrene hydroxylase.

156 Contig name MT01 MT02 MT04 MT05

MT06 MT07

MT09 MT12

MT15 MT16 MT17 MT19 ST00 ST01 ST02 ST03 ST05

ST07 ST09

Category 5' GSP 5' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 3' GSP 3' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 3' GSP 3' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 3' GSP 3' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 5' GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 3' GSP 3' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP 5' GSP 5' N-GSP

Primer name MT12R MT01RR MT02R2 MT02RR3 MT04R MT04RR MT05-3F MT05-3FF MT05R MT05RR MT15R MT06RR MT07-3F MT07-3FF MT07R3 MT07RR2 MT09R MT09RR2 MT12-3F MT12-3FF MT12R MT12RR MT15R MT15RR MT16R MT16RR MT17R MT17RR MT19R MT19RR Zd01L13RR ST01R2 ST01RR Ca02L14GSP1 Ca02L14RRR Cb01O23GSP1 Cb01O23R4 ST05-3F ST05F ST05R ST05RR2 ST07R ST07RR ST07R ST09RR

Primer sequence (5’ -> 3’) CAGATCAGAGGGCTTCTGAGTCTGTACG CAGAGAAGCCATTCGTTCTC TGCAACACCATCTCTCCCTCGGTTCCAT TACAAACTCAGCATTCCTTC ATCCGTCCACCATCTTGAGAGTTCAT TCTCGCTTGTAGATGTTCTG AAGGTGGCTTATGCCCTGGAACTGCCAT ACAGAACTTGGAATCGCT CTCGGGAAGAACACCCAAGGAATGCTC CATCTCTAAACTTGTAGAAAAT GCCACGTCGCCTTTTTCTACCTCATCCG ATCGTAAAGACGAGCAAGTG GCATGCAGTAGTGGCGACGACACAATCA ACTGGAGATGCAGAATGAAT GGTGGAACCATTTTGCTTCCTGAAAGCT GTGCTGCACTCTCTCTAGCT AGCAGCTTCAACTGATCGACCAACTGC AGCGAGCTTCAAAGTTTCCA AGCAGAGCAGCTTCAACTGATCGACCAA GGCTTCTCTGTTTCACAAGGAAG CAGATCAGAGGGCTTCTGAGTCTGTACG GTCTCTAAATGTCTCGAACCTT GCCACGTCGCCTTTTTCTACCTCATCCG GGAAGTGCCATAATCATCGAG CGCCACTTTCTCTGAGACACCTTTTTCG GAGCGTGTGATCGCATAGA TACACCTCCAGTTGCCTCG CTGCTCACGAGCCACCC GTATGCCTGAGGAGCGTGTGATCGCATG GCGTGTGATCGCATGGT CGAGAACAACTAGGTTCATGAC GATGGCACATATCGCTCCTCACCCCAC TACCAAGAACCCAATAATAAGC TCTTTCCATGCATCCTCCACCATCTCC TATTGTTGAAGCTGCATGCT TCCTTCCATGCGTCCTCCACCATCTCC CTCTTTCATATAACATTCTACA CGGCGCTAAGATTGGTTTATGGGGCTGA GAATGTGATGCGAAATGATACT CATTCCACTACTCGGTCACGAGCAAAC CCACCACATTGAAATTTTCTTCACT CCATGCTTGTTCCACCACCACTTGCAA TTTCTCATGTGCCACTTTTTTG CCATGCTTGTTCCACCACCACTTGCAA TTTCTCATGTGCCACTTTTGTA

Supplementary Table S1 . Primers for RACE. 5’ GSP, gene specific primers for 5’ RACE; 5’ NGSP, nested gene specific primers for 5’ RACE; 3’ GSP, gene specific primers for 3’ RACE; 3’ NGSP, nested gene specific primers for 3’ RACE

157

Contig name

Category

Primer name

Primer sequence (5’ -> 3’)

MT00

pDONR207-F pCRT7CT-F pDONR207-R pCRT7CT-R

10N21GtwyAttF Cb10N21CT-F 10N21G-AttR Cb10N21CT-R

ATGAACCCTAGCAGTAATACCTC GGGGACCACTTTGTACAAGAAAGCTGGGTCTGCCTATGATCGAGGCGTTA AATTTGTTCAAACAACAACA

MT01

MT02

MT03

MT04

MT05

MT06

GGGGACAAGTTTGTACAAAAAAGCAGGCTTACTGGTTCCGCGTGGATCCATGCATTCATGCATGGTTCTC

pCR2.1-F

MT01--F

ATGGCAACTGGTCAAGTTAT

pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT01--Ft MT01--F-CACC MT01MT09--R

ATGCTGCGTCGCTCGGGGAATTATCACCCAAG CACCATGCGTCGCTCGGGGAATTATCACCCAAG AATTTGGATAGGTTCGAATAAT

pCR2.1-F

MT02--F

ATGTCTAGCTTCCTTCCTGCTCCACTAAATC

pEXP5CT-F pET101D-F pESC-URA-F pCR2.1-R, pEXP5CT-R, pET101D-R pESC-URA-R

MT02--Ft MT02--F-CACC MT02--Ft-AAAACA-BamHI MT02--R MT02--R-XmaI

ATGAGCCGTCGCTCTGCCAACTTCCGGCCCAAC CACCATGCGTCGC TCTGCCAACTTCCGGCCCAACTTA CGCGGATCCAAAACAATGCGTCGCTCTGCCAACTTCCGGCCCAACTTA TTTTTCCACGACAATAGGGTTGACCAATAAT TCCGCCCGGGCCTTTTTCCACGACAATAGGGTTGACCAATAAT

pCR2.1-F

MT03--F

ATGTCGTCTATTTGCGCTCC

pEXP5CT-F pCR2.1-R, pEXP5CT-R

MT03--Ft MT03--R2

ATGCTGCGTCGCTTGGCGAATTATCATCCAAA GATTTGGATAGGTTCAAATACT

pCR2.1-F

MT04--F

ATGTCTATCTCCCTTTCTTT

pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT04--Ft MT04--F-CACC Zc07C01CT-R

ATGCTGCGTCGCTCGGGGAATTACCAGCCCA CACCATGCGTCGCTCGGGGAATTACCAGCCCA GAGCTGGACAGGCTCGATCA

pCR2.1-F

MT05--F

ATGGCGCTCTTCCAACCTGC

pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT05--Ft MT05--F-CACC MT05--R

ATGAGCCGTCGCTCGGCCAATTACCAGCCCA CACCATGCGTCGCTCGGCCAATTACCAGCCCA CAATAGGATAGGGATGATCA

pCR2.1-F pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT06--F MT06--Ft MT06--F-CACC MT06--R

ATGGCTACTTGTCAAGCAAA ATGCTGCGTCGCTCGGGGAATTATCATCCAAA CACCATGCGTCGCTCGGGGAATTATCATCCAAA GATTTGGATAGGTTCGAATAAC

158

MT07

MT08

MT09

MT11

MT12

MT16

MT17

MT19

pCR2.1-F pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT07--F MT07--Ft MT07--F-CACC MT15--R

ATGGCTACAATGTCTATTTGCGCTCCC ATGCTGCGTCGC TCAGGGAATTATCCGCCAAA CACCATGCGTCGC TCAGGGAATTATCCGCCAAACATATGG GATTTGGATAGGTTCAAATAAC

pENTRD-F

Zc05I02tFt

CTGGTTCCGCGTGGATCCATGGCTGACGTCGAGCAGTGTGA

pCRT7CT-F pET101D-F

MT08MT16--Ft MT08MT16--F-CACC

ATGCTGCGTCGCTCGGGGAATTATCAGCCAAG CACCATGCGTCGC TCGGGGAATTATCAGCCAAG

pENTRD-R pCR2.1-R, pCRT7CT-R, pET101D-R

Zc05I02tR Zc05I02CT-R

GTCATTCTAAATTTGGATAGGTTCG AATTTGGATAGGTTCGAGTA

pCR2.1-F

MT08MT09MT16--F

ATGGCTACTCGTCAAGCAAT

pEXP5CT-F

MT09MT12--Ft

ATGCTGCGTCGC TCGGCGAATTATCATCCAAA

pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT09MT12--F-CACC MT01MT09--R

CACCATGCGTCGC TCGGCGAATTATCATCCAAA AATTTGGATAGGTTCGAATAAT

pCRT7CT-F

Zc07C01CT-F

ATGAGGAGGTCGGGAAATTACCA

pCRT7CT-R (w/ stop, w/o His-tag) pCRT7CT-R (w/o stop, w/ His-tag)

Zc07C01tR Zc07C01CT-R

CGAGCCTGTCCAGCTCTGAT GAGCTGGACAGGCTCGATCA

pCR2.1-F

MT12--F2

ATGTCTATCTACTATACCAGCACTAC

pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT09MT12--Ft MT09MT12--F-CACC MT01MT09--R

ATGCTGCGTCGC TCGGCGAATTATCATCCAAA CACCATGCGTCGC TCGGCGAATTATCATCCAAA AATTTGGATAGGTTCGAATAAT

pCR2.1-F

MT08MT09MT16--F

ATGGCTACTCGTCAAGCAAT

pEXP5CT-F pET101D-F pESC-URA-F pCR2.1-R, pEXP5CT-R, pET101D-R pESC-URA-R

MT08MT16--Ft MT08MT16--F-CACC MT08MT16--Ft-AAAACA-BamHI Zc05I02CT-R MT08MT16--R-XmaI

ATGCTGCGTCGCTCGGGGAATTATCAGCCAAG CACCATGCGTCGC TCGGGGAATTATCAGCCAAG CGCGGATCCAAAACAATGCTGCGTCGCTCGGGGAATTATCAGCCAAG AATTTGGATAGGTTCGAGTA TCCGCCCGGGCCAATTTGGATAGGTTCGAGTAAC

pCR2.1-F

MT17--F

ATGCCCCCAGGCAATCCAGTTC

pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT17--Ft MT17--F-CACC MT17--R

ATGGCGCGTCGCTCGGGCAATTACCAGCCCAAC CACCATGCGTCGCTCGGGCAATTACCAGCCCAAC AATGATAGGGTTGATCAGCAGT

pCR2.1-F

MT19--F

ATGTCTCTCTACTATACCAGCACTACTGTC

159

ST00

ST01 ST02, ST03

ST05

ST07, ST09

pEXP5CT-F pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

MT19--Ft MT19--F-CACC MT19--R

ATGCTGCGTCGCTCGGGGAATCATCAGCCAAG CACCATGCGTCGCTCGGGGAATCATCAGCC GATTTGGATAGGTTCGAGTAAC

pCR2.1-F, pEXP5CT-F

Zd01L13F

ATGGATCTTGACGAGACTCC

pET101D-F pESC-URA-F

Zd01L13F-CACC ST00--F-AAAACA-BamHI

CACCATGGATCTTGACGAGACTCC CGCGGATCCAAAACAATGGATCTTGACGAGACTCC

pCR2.1-R, pEXP5CT-R, pET101D-R pESC-URA-R

Zd01L13R ST00--R-XmaI

AATAGGGATAGGATGAACAAGTAGA TCCGCCCGGGCCAATAGGGATAGGATGAACAAGTAGA

pCR2.1-F, pEXP5CT-F, pET101D-F

ST01--F-CACC

CACCATGGAGAAGCAATCACTAAC

pCR2.1-R, pEXP5CT-R, pET101D-R

ST01--R

TATAGGAACAGGTTCAACTAGCACCA

pCR2.1-F, pEXP5CT-F

ST02--F

ATGGAGAAGCAATCAACCAC

pET101D-F pESC-URA-F pCR2.1-R, pEXP5CT-R, pET101D-R pESC-URA-R

ST02ST03--F-CACC ST02--F-AAAACA-BamHI ST02ST03--R ST02--R-XmaI

CACCATGGAGAAGCAATCAACCAC CGCGGATCCAAAACAATGGAGAAGCAATCAACCAC AATAAGAACAGATTCAACCAACAACATAT TCCGCCCGGGCCAATAAGAACAGATTCAACCAACAACATAT

pCR2.1-F, pEXP5CT-F

ST05--F

ATGGAGAGGCAGTCGATGGC

pET101D-F pCR2.1-R, pEXP5CT-R, pET101D-R

ST05--F-CACC ST05--R

CACCATGGAGAGGCAGTCGATGGC AATAGGAAAGGATTCAACCAATATAA

pCR2.1-F, pEXP5CT-F

ST07--F

ATGGAGAGGCAGTCGATGGT

pET101D-F pESC-URA-F pCR2.1-R, pEXP5CT-R, pET101D-R pESC-URA-R

ST07ST09--F-CACC ST07--F-AAAACA-BamHI ST07ST09--R ST07--R-XmaI

CACCATGGAGAGGCAGTCGATGGT CGCGGATCCAAAACAATGGAGAGGCAGTCGATGGT AATAGGGAAGGATTCAACCAATATGA TCCGCCCGGGCCAATAGGGAAGGATTCAACCAATATGA

Supplementary Table S2 Primers for cloning full length gene and sub-cloning for expression. In category, pCR2.1, pCRT7CT, pEXP5CT, pET101D and pENTRD means pCR2.1-TOPO, pCRT7CT-TOPO, pEXP5CT-TOPO, pET101/D-TOPO and pENTR/D-TOPO vectors respectively and primers with F (forward) and R (reverse) appendix were used to amplify PCR fragment, which was then inserted in these vectors. PCR fragment with primers with category pDONR207 was done Gateway BP reaction with pDONR207 vector. PCR fragment with primers with category pESC-URA was sub-cloned into pESC-URA vector.

160 GT1__04301_01 MT01 GT1__02635_01 MT09 GT1__00003_04 MT18 GT1__00771_01 MT06 GT1__02294_01 MT15 GT1__00012_23 MT08 GT1__06567_01 MT13 GT1__09462_01 MT19 GT1__06665_01 MT16 GT1__08969_01 MT02 GT1__11849_02 MT17

Monoterpene

GT1__00683_02 MT04 GT1__01002_01 MT11 GT1__02316_01 MT14 GT1__10909_01 MT07 GT1__10261_01 MT10 GT1__00589_02 MT03 GT1__08018_01 MT12 GT1__03268_02 MT05 GT1__11455_01 MT00 GT1__03649_02 (Diterpene synthase)

Diterpene

GT1__00262_01 (Levopimaradiene synthase) GT1__12279_02 (S qualene synthase)

Triterpene

GT1__10025_01 (Cycloartenol synthase) GT1__01220_01 (Oxidosqualene cyclase) GT1__01275_01 (P hytoene synthase) GT1__08103_01 (P hytoene synthase) GT1__07772_01 (P hytoene synthase) GT1__10066_01 (P hytoene synthase) GT1__07033_01( P hytoene synthase)

Tetraterpene

GT1__10103_01 (P hytoene synthase) GT1__11512_01 (P hytoene synthase) GT1__06639_01 (P hytoene synthase) GT1__11595_01 (P hytoene synthase) GT1__07655_01 (P hytoene synthase) GT1__03132_01 S T01 GT1__01911_01 S T02

Sesquiterpene

GT1__08983_01 S T05 GT1__03156_01 S T00 GT1__00005_18 S T06 GT1__01729_01 S T08 GT1__09742_01 S T07 GT1__13793_01 S T09 GT1__06930_01 S T03 GT1__01157_01 ST04

0.1

Supplementary Fig. S1 Ginger and turmeric terpene synthase tree with contigs from cDNA libraries. The tree was generated by ClustalW with 45 GT1 contigs (ginger and turmeric contigs) and 181 additional terpene synthases from GenBank. The non-ginger/turmeric terpene synthases were removed from the tree and only ginger and turmeric terpene synthases are shown in the figure. For mono- and sesui- terpene synthases there are gene name used in this paper next to contig number. For other terpene synthases there are descriptions next to each contig number. which are best hits from blast results against Swissprot/Uniprot database.

161

100

A

50 0 100

B

50 0 100

C 2 1 3

50 0 100

4

D

50 0 100

E

50 fpp

0 6

8

10

12

14

16

18

20 Tim e (m in )

22

24

26

28

30

32

34

12 10

A2

8 6 4 2 0 12 10

B2

8 6 4 2 0 12 10

1

2

C2

3

4

8 6 4

6

5

7

2

8 9

10

0 7 .5

8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 11 .0 Time (min )

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

1 4 .0

Supplementary Fig. S2 Analysis of MT06B functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract without pEXP5CT-MT06B plasmid with GPP (B) or FPP (D) as a substrate, respectively; enzyme assay using E. coli crude extract expressing MT06B with GPP (C) or FPP (E) as a substrate, respectively. A2, B2 and C2 are boxed region from A, B and C panels to show very small peaks. D and E are shown to compare with MT06/MT06A. Products/compounds identified include: 1, camphene; 2, α-pinene; 3, limonene; 4, borneol (endo-borneol); 5, tricyclene; 6, β-pinene; 7, cis-sabinene hydrate; 8, p-mentha-1,4(8)-diene (terpinolene); 9, trans-pinene hydrate (transpinan-2-ol); 10, p-menth-1-en-8-ol (α-terpineol)

162 100 80

A

60 40 20 0 100 80

B

60 40 20 0 100 80

1

C

60

2

40 20

4

3

5

0 7 .5

8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 11 .0 Time (m in )

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

100 80

A2

60 40 20 0 100 80

B2

60 40 20 0 100 80

3

C2

4

60 40 20

6

7 8

9

11

10

12

0 8 .5

9 .0

9 .5

1 0 .0

1 0 .5

11 .0 Time (m in )

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

Supplementary Fig. S3 Analysis of MT09A2 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay with GPP using E. coli crude extract without pEXP5CT-MT09A2 plasmid (B) or expressing MT09A2 (C). A2, B2 and C2 are boxed region from A, B and C panels to show very small peaks. Products/compounds identified include: 1, camphene; 2, α-pinene; 3, limonene; 4, borneol (endo-borneol); 5, tricyclene; 6, β-pinene; 7, γterpinene; 8, cis-sabinene hydrate; 9, p-mentha-1,4(8)-diene (terpinolene); 10, trans-pinene

163 hydrate (trans-pinan-2-ol); 11, β-citronellal; 12, p-menth-1-en-8-ol (α-terpineol) 100 80

A

60 40 20 0 100 80

B

60 40 20 0 100 80

1

C

2

60

3

40

4

5

20 0 6 .5

7 .0

7 .5

8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 Time (min )

11 .0

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

1 4 .0

100 80

A2

60 40 20 0 100 80

B2

60 40 20 0 100 80

3

C2

60

20

4

8

40

6

10

9

7

0 8 .5

9 .0

9 .5

1 0 .0

1 0 .5

11 .0 Tim e (m in )

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

Supplementary Fig. S4 Analysis of MT12A-M2 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay with GPP using E. coli crude extract without pEXP5CT-MT12A-M2 plasmid (B) or expressing MT12A-M2 (C). A2, B2 and C2 are boxed region from A, B and C panels to show very small peaks. Products/compounds identified include: 1, camphene; 2, α-pinene; 3, limonene; 4, borneol (endo-borneol); 5, tricyclene; 6, β-pinene; 7, cis-sabinene hydrate; 8, p-mentha-1,4(8)-diene (terpinolene); 9, trans-pinene hydrate (transpinan-2-ol); 10, p-menth-1-en-8-ol (α-terpineol)

164 100

A

50 0 100

B

50 0 100

1

C

50 0 100

2 3

D

50 0 100

E

50 0 6

100

8

10

12

14

16

18

20 Tim e (m in )

22

24

26

28

30

32

34

A2

50 0 100

B2

50 0 100

1

C2

2

50 0 100

35

4 D2

50 0 100

E2

50 0 7 .4

7 .6

7 .8

8 .0

8 .2

8 .4

8 .6 Time (min )

8 .8

9 .0

9 .2

9 .4

9 .6

9 .8

Supplementary Fig. S5 Analysis of MT04 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract without pEXP5CT-MT04 plasmid with GPP (B) or FPP (D) as a substrate, respectively; enzyme assay using E. coli crude extract expressing MT04 with GPP (C) or FPP (E) as a substrate, respectively. A2, B2, C2, D2 and E2 are boxed region from A, B, C, D and E panels to show very small peaks. Products/compounds identified include: 1, α-pinene; 2, β-pinene; 3, (R)-(+)-m-mentha-6,8-diene (sylvestrene); 4, sabinene (4(10)-thujene); 5, 1,8-cineole (eucalyptol)

165 100

A

50

0 100

B 50

0 100

C 50

0 100

1

D 50

2

0 6 .5

7 .0

7 .5

8 .0

8 .5 Tim e (m in )

9 .0

9 .5

1 0 .0

100 80

E

60 40 20 0 100 80

F

60 40 20 0 100 80

3

G

60 40 20 0 1 9 .5

2 0 .0

2 0 .5

2 1 .0

2 1 .5

2 2 .0 Tim e (m in )

2 2 .5

2 3 .0

2 3 .5

2 4 .0

Supplementary Fig. S6 Analysis of MT08 functions when proteins were expressed in E. coli strain BL21 Star (DE3) pMevT pMBI RIL. Total ion chromatograms are displayed: pentane blank (A, E); enzyme assay using E. coli crude extract from BL21 Star (DE3) pMevT pMBI RIL without pH9GW-Zc05I02tt with GPP (B); pentane extract from BL21 Star (DE3) pMevT pMBI RIL expressing MT08 (C, G), which represents in vivo activity of MT08; enzyme assay using E. coli crude extract from BL21 Star (DE3) pMevT pMBI RIL expressing MT08 with GPP (D); pentane extract from BL21 Star (DE3) pMevT pMBI RIL without pH9GW-Zc05I02tt (F). Here, Zc05I02 represents MT08 and “tt” in pH9GW-Zc05I02tt represents “truncated, thrombin”, which means transit peptide was truncated and thrombin cleavage site was introduced at N-terminus of MT08 gene. Products/compounds identified include: 1, β-phellandrene; 2, α-pinene; 3, [Z]-β-farnesene

166 100

A

50

0 100

B 50

0 100

C 50

0 100

1

D 50

2

3

0 7 .5

8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 Time (m in )

11 .0

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

100 80

E

60 40 20 0 100 80

F

60 40 20 0 100 80

1

G

60 40

2

20 0 7 .0

7 .5

8 .0

8 .5

9 .0

9 .5

1 0 .0 1 0 .5 Tim e (m in )

11 .0

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

Supplementary Fig. S7 Analysis of MT11 functions when proteins were expressed in E. coli strain BL21 CoconPlus (DE3) RIL and BL21 Star (DE3) pMevT pMBI RIL. Total ion chromatograms are displayed: pentane blank (A, E); enzyme assay using E. coli (BL21 CoconPlus (DE3) RIL) crude extract expressing MT11 without GPP (B) or with GPP (D); enzyme assay with GPP using E. coli crude extract without pCRT7CT-MT11 plasmid (C); pentane extract from BL21 Star (DE3) pMevT pMBI RIL not transformed with pCRT7CT-MT11 (F) or expressing MT11 (G), which represents in vivo activity of MT11. Products/compounds identified include: 1, 1,8-cineole; 2, p-menth-1-en-8-ol (α-terpineol); 3, α-pinene

167 100 80

A

60 40 20 0 100 80

B

60 40 20 0 100 80

C

60 40

6

20 0 100 80

1

D

60 40 20 0 6

100

7

8

9

10

11

12

13

14 Tim e (min )

15

16

17

18

19

20

21

A2

50

0 100

B2 50

0 100

C2

50

0 100

D2

50

3

4

2

5

0 8 .8

9 .0

9 .2

9 .4

9 .6 Time (m in )

9 .8

1 0 .0

1 0 .2

1 0 .4

168 93

100

93

100

E 80

E

80 60

60

68

68 40

40 136

136 20

20 137 163

0 100

F

290 330

371

457 499 559 612 646

93

G

80

20

91 41

137 163

290 330 371

457 499 559 612 646

68

80 60

60 40

0 100

93

40 94

136 94

20

137

0 100

137

0 200

300

400 m/z

500

600

121

100

200

300

400

500

600

m/z

Supplementary Fig. S8 Analysis of MT03 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract with MT03 expression without substrate (B) or with GPP as a substrate (D); enzyme assay with GPP using E. coli crude extract without pET101/DMT03 plasmid (C). A2, B2, C2 and D2 are boxed region from A, B, C and D panels to show very small peaks. Mass spectra of peak 3 (E), β-phellandrene (F) and limonene (G) show βphellandrene and limonene are co-eluted in peak 3. Products/compounds identified include: 1, α-phellandrene; 2, α-terpinene; 3, β-phellandrene (contains limonene); 4, γ-terpinene; 5, pmentha-1,4(8)-diene (terpinolene); 6, geraniol acetate

169 100 80

A

60 40 20 0 100 80

B

60 40 20 0 100 80

1

C

60 40

2 3 4

20

6

5

0 8 .6

8 .8

9 .0

9 .2

9 .4

9 .6

9 .8 1 0 .0 Tim e (min )

1 0 .2

1 0 .4

1 0 .6

1 0 .8

11 .0

11 .2

Supplementary Fig. S9 Analysis of MT07 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay with GPP using E. coli crude extract without pEXP5CT-MT07 plasmid (B) or expressing MT07 (C). Products/compounds identified include: 1, p-mentha-1,4(8)-diene (terpinolene); 2, α-phellandrene; 3, 3-carene; 4, α-terpinene; 5, limonene; 6, γ-terpinene

170 100

A

50 0 100

B

3

5 4

50 0 100

1

C

50 0 100

* D

50 0 100

E

2

50

*

0 8

10

12

14

16

18 Tim e (min )

20

22

24

26

28

F

80 60 40 20 0

1

G

80 60 40 20 0

H

80 60 40

2

20 0 6

8

10

12

14

16

18 Tim e (m in )

20

22

24

26

28

171 100 80

I

60 40 20 0 100 80

J

2

60 40

1

20 0 100 80

K

60 40 20 0 8

10

12

14

16

18 Time (m in )

20

22

24

26

28

Supplementary Fig. S10 Analysis of MT00 functions when proteins were expressed in E. coli. Total ion chromatograms are displayed: pentane blank (A, F and I); E. coli (Rosetta2 (DE3) pLysS) crude extract without the pH9GW-MT00 plasmid with GPP (B); E. coli crude extract expressing MT00 with GPP (C) or FPP (E) as a substrate, respectively; E. coli crude extract without the pH9GW-MT00 plasmid with FPP (D); enzyme assay with purified MT00 from Rosetta2 (DE3) pLysS containing pH9GW-MT00 with GPP (G) or FPP (H) as a substrate, respectively; pentane extraction from BL21 Star (DE3) pMevT pMBI RIL cells with the pCRT7CT-MT00 plasmid without induction (J) or with induction (K), which represent in vivo activities of MT00. Products/compounds identified include: 1, linalool; 2, [E]-nerolidol; 3, trans-geraniol; 4, geranial (α-citral); 5, neral (β-citral); *, acetated form of linalool and [E]-nerolidol

172 100

A

50 0 100

B

50 0 100

1

C

50 0 100

D

50 0 100

E

50 0 8

10

12

14

16

18

20

22 24 Time (min )

26

28

30

32

34

36

100 80

A2

60 40 20 0 100 80

14

D2

60 40 20 0 100 80

E2

7

4 3

40

9

56

2

20

12 13

8

60

11 10

0 20

21

22

23

24

25

26 Time (min )

27

28

29

30

31

Supplementary Fig. S11 Analysis of MT17A2 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract without pEXP5CT-MT17A2 plasmid with GPP (B) or FPP (D) as a substrate, respectively, or expressing MT17A2 with GPP (C) or FPP (E) as a substrate, respectively. A2, D2 and E2 are boxed region from A, D and E panels to show very small peaks. Products/compounds identified include: 1,linalool; 2, unknown (7-epi-sesquithujene-like); 3, cis-α-bergamotene; 4, trans-α-bergamotene; 5, γ-curcumene; 6, unknown (β-sesquiphellandrene-like); 7, cis-α-bisabolene; 8, β-bisabolene; 9, β-sesquiphellandrene; 10, unknown (cis-α-bisabolene-like), 11, [E]-nerolidol; 12, epi-α-bisabolol; 13, α-bisabolol; 14, [E]-β-farnesene

173 100 80 60 40 20

(±)- linalool

0 100 80 60 40 20

(R)-(-)- linalool

0 100 80 60 40 20

MT00

0 13.0

13.5

14.0

14.5

15.0

15.5 Time (min)

16.0

16.5

17.0

17.5

18.0

100 80 60 40 20

(±)- linalool

0 100 80 60 40 20

(R)-(-)- linalool

0 100 80 60 40 20

MT17A2

0 16.2

16.4

16.6

16.8

17.0

17.2

17.4

17.6

17.8

18.0

18.2

18.4

18.6 18.8 Time (min)

19.0

19.2

19.4

19.6

19.8

20.0

20.2

20.4

20.6

20.8

21.0

Supplementary Fig. 12 Analysis of linalools produced by MT00 and MT17A2 in chiral column. MT00 produces (S)-(+)-linalool andMT17A2 synthesizes (R)-(-)-linalool.

174

100

A

50

0 100

10

B

50

0 100

50

0 100

C

1

6

9

3 4

2

7

5

8

D

50

0 7 .5

8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 11 .0 Time (m in )

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

Supplementary Fig. S13 Analysis of MT06/MT06A functions when proteins were expressed in E. coli strain BL21 CoconPlus (DE3) RILP with GPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract without pEXP5CT-MT06 or pEXP5CT-MT06A plasmid with GPP as a substrate (B); enzyme assay using E. coli crude extract expressing either MT06 (C) or MT06A (D) with GPP as a substrate. Products/compounds identified include: 1, sabinene (4(10)-thujene); 2, α-terpinene; 3, γ-terpinene; 4, cis-sabinene hydrate; 5, p-mentha1,4(8)-diene (terpinolene); 6, linalool; 7, p-menth-1-en-4-ol (terpinen-4-ol); 8, p-menth-1-en-8-ol (α-terpineol); 9, α-thujene (3-thujene); 10, myrcene

175

100

A

80 60 40 20 0 100 80

B

11

60 40 20 0 100 80

7

C

60

1

40 20 0 100 80

9

3

2

6

4

8

10

D

60 40

5

20 0 20

21

23

24

119

100

E

22

93

G

80 60 40

41

I

80 121

281

H

91

0 100 80 60

40

346 401

482

546 582

0 100

200

300

400 m/z

500

600

0 100

43

J

69 119 93

80 60

121

161

204

207 253

324 368 416 457

565 594

41 43 69 119

40 121 161 207

20

161 204

31

119 93

20

40

120

20

30

69

40 204 208 259

41

80

29

60

122

20

80 60

100

161

93 119

28

82

120 189

27

93

40

69 161

F

69

60

91

20 0 100

100

25 26 Tim e (min )

0 100

200

121 161 207

20

208

208

0 300

400 m/z

500

600

100

200

300

400 m/z

500

600

176

K

Retention time alignment

34 32 30

Sample (RT)

28 26 24 22 20 18 16 21

26

31

36

41

Adam's library (RT)

Supplementary Fig. S14 Analysis of MT06/MT06A functions when proteins were expressed in E. coli strain BL21 CoconPlus (DE3) RILP with FPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract without pEXP5CT-MT06 or pEXP5CT-MT06A plasmid with FPP as a substrate (B); enzyme assay using E. coli crude extract expressing either MT06 (C) or MT06A (D) with FPP as a substrate. Mass spectra of peak 1 (E), peak 6 (G), peak 8 (I), 7-epi-sesquithujene (F), cis-sesquisabinene hydrate (H), and transsesquisabinene hydrate (J) from library. Retention time alignment of two retention times, one from Adam’s library (x-axis) and the other from our sample (y-axis) (K), where yellow dot represent unknown (7-epi-sesquithujene-like) (peak 1), red dot represents unknown (cissesquisabinene hydrate-like) (peak 6) and green dot represents unknown (trans-sesquisabinene hydrate-like1) (peak 8). Products/compounds identified include: 1, unknown (7-epi-sesquithujene-like); 2, γ-curcumene; 3,β-curcumeme; 4, β-sesquiphellandrene; 5, [E]-γ-bisabolene; 6, unknown (cis-sesquisabinene hydrate-like); 7, [E]-nerolidol; 8, unknown (trans-sesquisabinene hydrate-like1); 9, epi-β-bisabolol; 10, epi-α-bisabolol; 11, [E]-β-farnesene

177 100 80

A

60 40 20 0 100 80

2 1

B

60 40 20 0 100 80

4

C

60

3

40 20 0 20

80

21

22

23

24

25

26

27 Tim e (min )

28

29

30

31

32

33

A2

60 40 20 0 100 80

5

B2

6

60 40 20

7

0 100 80

C2

60 40 20 0 2 3 .0

2 3 .5

2 4 .0

2 4 .5 Tim e (min )

2 5 .0

2 5 .5

2 6 .0

2 6 .5

Supplementary Fig. S15 Analysis of MT02A functions when proteins were expressed in yeast strain EPY224. Total ion chromatograms are displayed: pentane blank (A); EPY224 with MT02A expression (B); EPY224 without the pESC-URA-MT02A plasmid (C). The boxed region of A, B and C are enlarged in A2, B2 and C2 panels to show very small peaks. Products/compounds identified include: 1, epi-α-bisabolol; 2, α-bisabolol; 3, [E]-nerolidol; 4, farnesol; 5, [Z]-α-bisabolene; 6, βbisabolene; 7, trans-α-bergamotene

178

10

A

5

0

10

18

B 17

5

0

10

1 11 14 10 12 15 16 9 13 **

C 2

5

4 5

3

*

0 20

21

22

23

24

6

7 25 Time (m in )

8 26

27

28

29

30

31

Supplementary Fig. S16 Analysis of ST02A4 functions when proteins were expressed in yeast strain EPY224. Total ion chromatograms are displayed: pentane blank (A); EPY224 without pESC-URA-ST02A4 plasmid (B); EPY224 expressing ST02A4 (C). Products/compounds identified include: 1, (-)-neointermedeol; 2, β-elemene; 3, δ-elemene; 4, [E]-caryophyllene; 5, γ-elemene; 6, germacrene D; 7, α-muurolene; 8, δ-cadinene (cadina-1(10),4-diene); 9, unknown (selina-6-en-4ol-like); 10, unknown (cubenol-like); 11, unknown (spathulenol-like); 12, γ-eudesmol; 13, epi-αmuurolol (ô-muurolene); 14, α-muurolol (δ-cadinol); 15, α-cadinol; 16, (+)-intermedeol; 17, [E]β-farnesene; 18, [E]-nerolidol; *, unknown

179 100 80

A

60 40 20 0 100 80 60

B

40 20 0 100 80 60

C

40 20 0 100 80 60

1

D

2 3

40

6

4

5

7

20 0 8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 Tim e (m in )

11 .0

11 .5

1 2 .0

1 2 .5

1 3 .0

Supplementary Fig. S17 Analysis of ST02B functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL with GPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract expressing ST02B without GPP (B) or with GPP as a substrate (D); enzyme assay using E. coli crude extract without pET101/D-ST02B plasmid with GPP as a substrate (C). Products/compounds identified include: 1, myrcene; 2, limonene; 3, (Z)-β-ocimene; 4, (E)-β-ocimene; 5, p-mentha-1,4(8)-diene (terpinolene); 6, linalool; 7, p-menth-1-en-8-ol (α-terpineol)

180 100

A

50 0 100 50 0 100 50 0 100 50

B C

D

1 2 3

4

5

*9

0 100 50

6

8

7

*

E

0 18

19

20

21

22

23 Tim e (min )

24

25

26

27

28

Supplementary Fig. S18 Analysis of ST02B functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL with FPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract expressing ST02B without FPP (B) or with FPP (D, E); enzyme assay using E. coli crude extract without pET101/DST02B plasmid with FPP (C). Enzyme assay was done with overlaid pentane. After 3 hours of 30 ℃ incubation, top pentane was directly injected into GC/MS (D) or whole enzyme assay including remaining top pentane was vortexed, centrifuged and collected pentane was injected into GC/MS (E). Products/compounds identified include: 1, α-elemol; 2, unknown ((+)-cyclosativene-like); 3, α-copaene; 4, β-elemene; 5, γ-elemene; 6, αmuurolene; 7, δ-cadinene (cadina-1(10),4-diene); 8, germacrene B; 9, germacrene D; *, unknown

181 100

A

50

0 100

B 50

0 100

C 50

0 100

1

D

2

50

3

5

4

6 7

0 7 .5

8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 11 .0 Time (min )

11 .5

1 2 .0

1 2 .5

1 3 .0

1 3 .5

Supplementary Fig. S19 Analysis of ST02C functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL with GPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract expressing ST02C without GPP (B) or with GPP as a substrate (D), enzyme assay using E. coli crude extract without pET101/D-ST02C plasmid with GPP as a substrate (C). Products/compounds identified include: 1, myrcene; 2, limonene; 3, (Z)-β-ocimene; 4, (E)-β-ocimene; 5, p-mentha-1,4(8)-diene (terpinolene); 6, linalool; 7, p-menth-1-en-8-ol (α-terpineol)

182 100

A

80 60 40 20 0 100

B

80 60 40 20 0 100

1

C

80 60 40

2

20

7

34 5 6

8

10 9

11

12

13

0 16

17

18

19

20

21 Tim e (m in )

22

23

24

25

Supplementary Fig. S20 Analysis of ST02C functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL with FPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract without pET101/DST02C plasmid with FPP as a substrate (B); enzyme assay using E. coli crude extract expressing ST02C with FPP as a substrate (C). Products/compounds identified include: 1, β-elemene; 2, δ-elemene; 3, unknown ((+)-cyclosativene-like); 4, (+)-cyclosativene; 5, αcopaene; 6, unknown (β-elemene-like); 7, [E]-caryophyllene; 8, γ-elemene; 9, γ-muurolene; 10, germacrene D; 11, α-muurolene; 12, δ-cadinene (cadina-1(10),4-diene); 13, germacrene B

183 100 80

A

60 40 20 0 100 80 60

B

40 20 0 100 80

C

60 40 20 0 100 80 60

D

1

4

2

40

7

20

3

8

5

6

0 8 .0

8 .5

9 .0

9 .5

1 0 .0

1 0 .5 Time (m in )

11 .0

11 .5

1 2 .0

1 2 .5

1 3 .0

Supplementary Fig. S21 Analysis of ST03 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL with GPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract expressing ST03 without GPP (B) or with GPP as a substrate (D); enzyme assay with GPP as a substrate using E. coli crude extract containing pET101/D-ST03 plasmid but ST03 expression was not induced (C) Products/compounds identified include: 1,myrcene; 2, (Z)-β-ocimene; 3, p-mentha-1,4(8)-diene (terpinolene); 4, linalool; 5, cis-p-menth2en-1-ol; 6, p-menth-1-en-8-ol (α-terpineol); 7, limonene; 8, (E)-β-ocimene

184 100 80

A

60 40 20 0 100 80

B

60 40 20 0 100 80

1

C

60

2

40

3

5

20

4

0 2 1 .0

2 1 .5

2 2 .0

2 2 .5

2 3 .0

2 3 .5 Tim e (m in )

2 4 .0

2 4 .5

2 5 .0

2 5 .5

2 6 .0

Supplementary Fig. S22 Analysis of ST03 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL with FPP as a substrate. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli crude extract expressing ST03 without FPP (B) or with FPP as a substrate (C). Products/compounds identified include: 1, γ-amorphene; 2, allo-aromadendrene; 3, γ-cadinene; 4, germacrene D-4-ol; 5, germacrene D;

185 100 80 60

A

40 20 0 100 80 60

B 5

40 20 0 100 80 60

1

C

40

2

3

20

4

0 100 80 60

D 1

40 20 0 2 0 .0

2 0 .5

2 1 .0

2 1 .5

2 2 .0

2 2 .5

2 3 .0 Tim e (min )

2 3 .5

2 4 .0

2 4 .5

2 5 .0

2 5 .5

2 6 .0

Supplementary Fig. S23 Analysis of ST05/ST05A functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay with FPP as a substrate using E. coli crude extract without pEXP5CT-ST05 or pEXP5CT-ST05A plasmid (B; enzyme assay with FPP as a substrate using E. coli crude extract expressing ST05 (C); enzyme assay with FPP as a substrate using partially purified ST05A (D). Products/compounds identified include: 1, α-humulene (α-caryophyllene); 2, [E]-caryophyllene (β-caryophyllene); 3, β-elemene; 4, 1,5,9trimethyl-1,5,9-cyclododecatriene; 5, [E]-β-farnesene

186

100

A

50 0 100

B

50 0 100 50 0 100 50

C D

1 3

6

2

4 5

*

0 100 50

7

*

E

0 19

20

21

22

23

24 Time (min )

25

26

27

28

29

Supplementary Fig. S24 Analysis of ST01 functions when proteins were expressed in E. coli strain BL21 CodonPlus (DE3) RIL and BL21 CodonPlus (DE3) RILP. Total ion chromatograms are displayed: pentane blank (A); enzyme assay using E. coli (BL21 CodonPlus (DE3) RIL) crude extract expressing ST01 by pET101/D-ST01 plasmid without FPP (B) or with FPP as a substrate (D); enzyme assay using E. coli (BL21 CodonPlus (DE3) RILP) crude extract without either pET101/D-ST01 or pEXP5CT-ST01 plasmid with FPP as a substrate (C); enzyme assay using E. coli (BL21 CodonPlus (DE3) RILP) crude extract expressing ST01 by pEXP5CT-ST01 plasmid with FPP as a substrate (E). Products/compounds identified include: 1, β-selinene (eudesma-4(14),11-diene); 2, 7-epi-α-selinene; 3, β-elemene; 4, unknown (ererrophila-1(10),11-diene-like); 5, unknown (β-chamigrene-like); 6, unknown (guaia1(5),7(11)-diene-like); 7, (+)-intermedeol; *, unknown

187

50 40

A

30 20 10 0 50 40

3

B

2

30 20 10 0 50 40

1

C

30 20 10 0 22

100 80

23

24

25

26

27 28 Tim e (m in )

29

30

31

32

33

A2

60 40 20 0 100 80

B2

60 40 20 0 100 80

4

C2

60 40 20 0 2 1 .4

2 1 .6

2 1 .8

2 2 .0

2 2 .2

2 2 .4 Time (m in )

2 2 .6

2 2 .8

2 3 .0

2 3 .2

Supplementary Fig. S25 Analysis of ST07A functions when proteins were expressed in yeast strain EPY224. Total ion chromatograms are displayed: pentane blank (A); pentane extract of EPY224 without pESC-URA-ST07A plasmid (B); pentane extract of EPY224 expressing ST07A (C). The boxed region of A, B and C are enlarged in A2, B2 and C2 panels to show very small peaks. Products/compounds identified include: 1, caryophyllenyl alcohol; 2, [E]-nerolidol; 3, farnesol; 4, [E]-caryophyllene

188 100 80 60

72.93

100

A

peak 1 129.04

50

40

85.03

0 100 80 60

B

2

0 100

1

109.04

peak 2

68.98 85.02

20

80

171.13 200.18 213.26

50

40

0 100

157.10

115.01

20

C

155.14

0 100

3

111.07 202.27 232.66

111.01

peak 3

60 50

40

0 100 80

123.03

80.99

20

4

D

161.09 179.12 135.06 207.10

0 100

110.96

peak 4

60 50

40

54.91 80.95

20 0

50 0 100 50 0 100 50 0 100 50 0 100 50 0 100 50 0 100 50 0 100 50

161.03 179.05 204.09 135.01

0 25

100

122.97

26

27

28 Time (min)

29

30

A

50

100

100

5

129.03

60

20 0 100

7

D

135.07

70.14 99.08

6

peak 5

67.02

40

C

200

59.97

80

B

150 m/z

207.08 199.15

147.09

93.13 91.01

80

119.02

133.08

peak 6

60 161.17

A2

40

136.97 191.15

20

B2

210.07

0 100

90.93 133.08 78.92

80

C2

105.01

peak 7

60 147.18 40

D2

176.27

0 21.5

22.0

22.5 Time (min)

23.0

161.07 189.12

20

0

223.08

50

100

150 m/z

204.34 222.20 200

Supplementary Fig. S26 Analysis of ST07 functions when proteins were expressed in yeast strain EPY224. Total ion chromatograms are displayed: pentane blank (A); pentane extract of EPY224 without pESC-URA-ST07 plasmid (B); pentane extract of EPY224 expressing ST07 (C); pentane extract of EPY224 expressing ST07A (D). A2, B2, C2 and D2 are single ion chromatogram (m/z 91) of A, B, C and D. Most abundant ion in [E]-caryophyllene is m/z 91. Products/compounds identified include: 1, 2 and 5, unknown; 3 and 4, caryophyllenyl alcohol; 6 and 7, [E]-caryophyllene

189 APPENDIX C - AN APPROACH FOR PEAK DETERMINATION AND QUANTIFICATION IN EI-GC/MS ANALYSIS OF COMPLEX BIOLOGICAL DATASETS AND ITS APPLICATION TO METABOLOMIC INVESTIGATIONS

Manuscript “An approach for peak determination and quantification in EI-GC/MS analysis of complex biological datasets and its application to metabolomic investigations”, this manuscript is currently being edition for submission. We plan to submit this manuscript to: Metabolomics.

190 An approach for peak determination and quantification in EI-GC/MS analysis of complex biological datasets and its application to metabolomic investigations Hyun Jo Kooa,b, David R. Ganga,b,* a

Department of Plant Sciences, College of Agriculture and Life Sciences, University of Arizona, Tucson, AZ 85721, USA b Bio5 Institute, University of Arizona, Tucson, AZ 85721, USA *

Corresponding author: David R. Gang Department of Plant Sciences and BIO5 Institute, University of Arizona, Tucson, AZ 857210036, USA Tel: 520-621-7154 Fax: 520-621-7186 email: [email protected]

191 Abstract

Quadrupole-based electron impact GC/MS (EI-Q-GC/MS) analysis has been used for metabolite analysis with libraries for metabolite identification. However, identification is sometimes hard and it is often even more difficult to compare compounds between samples. When compared to data analysis based on total ion chromatogram, (TIC), data manipulation with single ion chromatograms (SIC) for each nominal mass to charge ratio from 50 to 650 m/z gives theoretically 601 times more peak information and resolving power. Using SICs, we demonstrate how peak deconvolution as well as common peak selections between samples, even though the peaks may be very small and represent unknown compound, can be used to increase identification capacity and compound characterization. The limitation of this approach occurs when there are very large peaks in the samples, due to high concentration of specific metabolites, which can distort the SICs of small embedded peaks and even their own SICs.

192

1 Introduction Quadrupole-based electron impact GC/MS (EI-Q-GC/MS) analysis has been the workhorse and benchmark tool for metabolite analysis, for both volatile and derivatized (e.g., TMS) non-volatile compounds, for several decades. Newer technologies, such as GC-TOF-MS, are gaining ground and have proven to be superior to Q-GC/MS in many respects, especially regarding sensitivity, scan speed and ability to produce data that are suitable for rapid and relatively accurate deconvolution, allowing for better identification and quantification of more compounds in complex sample mixtures and matrices. Nevertheless, Q-GC/MS instruments remain the most affordable and therefore accessible instruments for most researchers around the world. However, the slower scan speeds and the resulting either longer run times and/or poorer resolving power of these instruments present serious problems for analysis of complex biological samples, such as are desired for metabolomics-based experiments. Ideally, individual detected peaks in a GC/MS chromatogram would represent single compound entities. However, such is usually not in fact the case. When mass spectra underlying Total Ion Chromatograms (TICs) of analyses of complex biological samples are examined in more detail, apparent TIC peaks are often/usually a conglomeration of overlapping (partially) coeluting compounds. Several software packages have been developed to deal with deconvolution of such overlapping peaks. XCMS [1], which works well for electrospray ionization (ESI) or atmospheric pressure chemical/photo ionization (APCI/APPI) LC/MS data, does not handle EI-(Q or TOF)GC/MS data very well. AMDIS, which was developed to deconvolute EI-Q-GC/MS data, is a sample-based program (analyzes one sample at a time) that relies on the use of a

193 database, such as the NIST or Wiley EI-based libraries, to aid in compound detection and identification. Compounds not in the database are difficult or impossible to accurately detect and quantify using that program. Leco developed ChromaTOF, their own in-house deconvolution software, which works well with data produced from their very fast scanning TOF instruments but does not work well with typical EI-Q-GC/MS data to detect and quantify coeluting compounds. In this manuscript we describe an approach that is able to detect and quantify thousands of compounds from EI-Q-GC/MS analyses of complex biological samples. This approach analyzes data produced from multiple samples to find common peaks and unique peaks in those samples. Our approach takes advantage of the fact that although TIC peaks may have very similar retention times (RT) and overlap, analysis of selected ion chromatograms (SICs) often reveals “clearer” peaks that do not completely co-elute. The major hypothesis underlying this approach is that the ions produced from any given compound should share a common retention time in EI-GC/MS analysis. Identification of which ions represented by specific SICs coelute allows for these “ions” to be grouped into a new “peak”, which can then be used to generate mass spectra that can be searched against spectral libraries for compound identification or used to generate relative quantitation values for sample comparisons. A key component of our approach is that it involves comparing such peaks from samples (technical replicates, biological samples, etc.) using Pearson correlation coefficients of common “ions” using ion peak areas from each pair of compared peaks.

2 Materials and Methods

194

2.1 Plant samples One variety of ginger (Zingiber officinale, Rosc.) called yellow ginger (GY) and two varieties of turmeric (Curcuma longa L.) called Fat Mild Orange (FMO) and Thin Yellow Aromatic (TYA) were grown in a green house for varying lengths of time (2, 3, 4, 6, 7 months for GY; 3, 5, 7 months for FMO and TYA) and metabolites were then extracted from tissues (rhizome, root and leaf), as previously described [2] [3]. Most have 4 biological replicates and some have 2 or 3 biological replicates

2.2 Metabolite analysis A Thermo Finnigan Trace GC 2000 with a Rtx-5MS w/ 5m Integra-Guard Column (Restek, 0.25mm ID, 0.25µm df, 30 m) coupled to a DSQ mass spectrometer was used for Gas chromatography/mass spectrometry (GC/MS) analysis. Eluted compounds were identified by the Thermo Finnigan Xcalibur program (v. 1.3) using the NIST/EPA/NIH Mass Spectral Library (NIST 02) and the essential oil GC/MS mass spectra library from Dr. Robert P. Adams [4].

2.3 Data processing Individual selected ion chromatograms (SICs) for each nominal mass to charge ratio from 50 to 650 m/z were extracted from Thermo Finnigan Xcalibur v. 1.3 RAW files using a custom program operated by Xcalibur executable files. The output file format for this analysis is “MW, RT, peak area” (example in Supplementary Table 1). Three Perl scripts (mass1N2-loader_v3.pl, mass3_RT_adj_v2.pl, and

195 mass4_match_peaks_v2.pl or mass4_match_peaks_v2_no_log.pl) utilizing a MySQL database for storage were applied to each files for each sample (66 files total). There are two versions of the script that performs the last step in the analysis, one generates log files. The log files are very large, but show details on how peaks are selected. The Perl scripts and input/output file formats are available in Supplementary data.

3 Results and Discussion

As mentioned briefly in the Introduction, the goal of this investigation was to develop a means to identify and quantify compounds present in very complex biological samples that could be applied to traditional EI-Q-GC/MS data as well as newer GC/MSbased approaches. Figure 1 outlines the process used in this investigation to achieve this aim. For example, a peak at RT 9.12 from a TIC consists of ions with m/z of 68, 93, 97, 79, 94, 136, 107, 53, etc as shown in mass spectra. SICs of these ions (m/z 68, 93, 97, 79, 94, 136, 107 and 53) are shown. To identify a peak using SICs, the most abundant biggest ion, of m/z 68, is selected first. Based on this ion, other ions with peaks around RT 9.12 in their SICs such as 93, 97, 79, 94, 136, 107, 53, etc are collected (in Fig. 1, only biggest 8 ions are shown. However, the actual process collects small ions too). Finally, a new peak at RT 9.12 is created based on collected SIC ions, which is very similar to the peak shown in the TIC at RT 9.12. The peak area of the newly created peak is the sum of the peak areas of the SIC ions. When the peak is compared with peaks in other samples, the areas of ions in SICs are compared to yield Pearson correlation coefficients. A brief data processing flow chart is shown in Fig. 2 with detail in section 3.6-3.8.

196

3.1 Peak areas from TIC and from sum of SIC are similar Theoretically, TIC peak areas should be the same as the sum of SIC peak areas for all ions. We found this to be the case, more or less, in all of our samples. For example, in the F60Rh (FMO-rhizome, 3 month old) sample, ion 119 has the largest SIC peak area at RT 22.8591. There are 95 more ions that were detected at RT 22.8591 ± 0.01. The peak area from the TIC at RT 22.86 is 1641158014 and the sum of the 96 SIC peak areas is 1604363826, which is 97.758 % of peak area from the TIC. In addition, a middle sized peak with a peak area of 42000544 in the TIC has a corresponding area of 42076490 when the SIC peak areas are summed, and that is 100.181 % of the peak area of the TIC. Another small peak with a TIC peak area of 5053559 shows 83.711 % of the peak area (4230362) from the sum of SIC peak areas. Other examples of peak area calculated with ions collected with RT ± 0.005 have 106.3%, 96.0%, 104.9% and 101.1%. Because there were no large differences between TIC peak areas and summed SIC peak areas, using the peak area from the SIC peak areas still represents original data.

3.2 Deconvolution of similar sized peaks at very similar RTs using SICs From TIC of several FMO-rhizome samples, two peaks come out together around RT 26.8 (Fig 3). The picture shows 5 FMO-rhizome samples and they seem to have one peak or two peaks according to TIC. However, all five samples have two peaks when their mass spectra are examined. The compound shown at RT 26.78 is 2,4,5,6tetrachloro-m-xylene, the standard3 for GC/MS, and the compound shown at RT 26.84 is thought as sesquisabinene hydrate according to the GC/MS library and we will call this

197 compound as sesquisabinene hydrate in this paper. Oxidized compounds are normally skewed to the right and according to the amount of compound, their retention times are shifted. Here the peaks of sesquisabinene hydrate in the first and second chromatograms are small and those from third to fifth chromatograms are larger (Fig. 3). When we look at SIC of m/z 69, which is the most abundant ion from sesquisabinene hydrate and ignorable in 2,4,5,6-tetrachloro-m-xylen (Fig. 4), we can see that sesquisabinene hydrate is present in the first and second panels, although the amount of sesquisabinene hydrate is low. The most abundant ion in 2,4,5,6-tetrachloro-m-xylen is m/z 207. However, sesquisabinene hydrate has also small amount of ion 207. So instead of m/z 207, Fig. 5 shows SIC at m/z 244, which is the second most abundant ion from 2,4,5,6-tetrachloro-m-xylen and ignorable in sesquisabinene hydrate. When compared to the amount of 2,4,5,6-tetrachloro-m-xylen, the amount of sesquisabinene hydrate from the plant is more variable because 2,4,5,6-tetrachloro-m-xylen is one of our GC/MS standards and same amount was used each sample . These two peaks can be deconvoluted by our process and successfully deconvoluted GC/MS standard3 from FMO-rhizome samples are shown in Table 1.

3.3 Deconvolution of embedded peaks using SICs We looked at peak deconvolution of similar-sized peaks in the previous section. Fig. 6 is one example of peak deconvolution where one peak is imbedded in another one because of a large difference in peak sizes. The peak at RT 23.11 is hidden under the peak at RT 23.16 and only the larger peak is shown in the TIC (first chromatogram of Fig. 6). Ions of m/z 62, 85, 124, 164, 172, 201, and 203 for the peak at RT 23.11 are shown when

198 we look at SICs. The peak at RT 23.16 is β-bisabolene, according to NIST library. However, we do not know what the compound shown at RT 23.11 is because of the small amount of this compound in this chromatogram. After normal processing using the instrument’s data processing software, only the peak at RT 23.16 was collected and the peak at RT 23.12 was not collected because this peak contains only four ions with intensities larger than 5000 arbitrary units (203, 201, 124, 61). Although the peak area is 167731 and the peak area exceeded the limit set for peak selection, the small number of ions prevent this peak from being identified as a peak. Regardless of peak selection, these two peaks, RT 23.12 and RT 23.16 were successfully deconvoluted.

3.4 Limitation of deconvolution using SICs Deconvolution using SICs is limited if two peaks are from very similar compounds sharing many of the same ions because SICs for those shared ions are not easily separated. If the sizes of two peaks are similar, these shared ions may come out in the middle of two peaks and eventually be assigned to just one peak, which may cause the other peak to be missing some ions, which in turn can lead to erroneous identification when searching against spectral databases, or when comparing and searching the same compound among different samples. However, this problem may not be that significant of an issue in our approach because we compare only shared ions between peaks and calculate similarity, although matching ion number also is important because we also consider the ratio of common ions based on a smaller number of compared peaks. In the case of a big difference in peak sizes as shown in Fig. 6, deconvolution is worse. Often ion detection in the smaller peak is not efficient due to possible masking by the covering

199 peak. In Fig. 6, the height of the peak at RT 23.16 in the TIC is 3.18E7 and the height of the most abundant ion in this peak (m/z 69) is 3.67E6 (data not shown). However, the height of the most abundant ion (m/z 203) of the embedded peak at RT 23.11 is only 3.75E4, which is about only 1% of the larger peak, and the height of insignificant ions in the larger peak are sometimes higher than the main ions of the embedded peak. In this case, the major ions of embedded peaks are “absorbed” into the ion lists of covering peaks, which can cause embedded peaks to be missing some ions and the covering peak to have larger areas for some ions. In the example of Fig. 6, the RT difference is about 0.05 and ions from two peaks are still shown separately. However, if the imbedded peak is closer to the covering peak, it is not easy to detect those ions separately.

3.5 Peak combining algorithm A certain peak is separated to two or more peaks when there are neighboring peaks rich in specific ions that also exist in this peak. This situation can cause a RT shift of some specific ions. For example, caryophyllene oxide at RT 25.63 in FMO-leaf 7M sample (F-L-7M-v435_F275-F272L7M) has 62 ions with ion intensity higher than 5000 from SICs (Fig. 7). Here, ion at m/z 69 was selected first and other ions were collected within the RT range between 25.6019 and 25.6319 (25.6169, RT of ion m/z 69, ± 0.015, Group A) and 18 other ions were collected (TEXT 1). However 17 out of 19 ions are in search range of another ion, m/z 79 with a RT range between 25.6191 and 25.6491 (25.6341, RT of ion m/z 79, ± 0.015, Group B). Two ions that belong to Group A but not to Group B are m/z 69 and 103, shown in the second and third chromatograms in Fig. 7. Although m/z 69 has largest area among ions in this peak, that ion is more abundant in a

200 neighboring peak at RT 25.48, causing an apparent RT shift. The ion with m/z 103 has a peak area of 5869, which is small when considering the area limit of ion peak detection is 5000, and a RT shift more readily occurs when the ion peak is very small due to background noise. In this case, Group A and Group B are combined into one peak. More details about peak combining algorithm including parameters will be discussed later.

3.6 Parameters for ion and peak detection and deconvolution We limited ion intensity to 5000 and peak area to 1E5 for inclusion our analysis. Ions with less than 5000 intensity are sometimes hard to distinguish from the background. If the peak area, the sum of collected ion intensities, is less than 1E5, it is hard to say if the peak really represents real metabolite peaks or not and peaks with small areas are not selected. Also if the ion number constituting a peak is equal to or less than 6, the peak is not selected because the information to compare the peak with other peak is insufficient. To determine retention time window to use around a selected larger ion to determine which neighboring ions constitute a peak, we tested ±0.01, ±0.015, and ±0.02 min, for the conditions of our chromatographic runs. As shown in Fig. 3 to Fig. 5, peaks that are close but distinguishable have RT differences of 0.04-0.06, which means the peak search time frame for proper peak resolution is about 0.04 min (±0.02, 1.2 sec). In each sample, all ions from single chromatograms are sorted by their intensity and the ion with the largest intensity is first used to search for nearby ions within the RT time window. Once ions are collected, they are assigned as belonging to a peak and the sum of intensities of collected ions is the area of the peak. We used the peak combining algorithm because some ions in one peak are

201 observed to spread out around RT ±0.02 (Fig. 7, TEXT 1) and secondary peak formation from main peak can be prevented by using the peak combining algorithm, which makes the ions in secondary peaks be absorbed to the main peak if more than 30% of secondary peak ions are already collected in the main peak or if secondary peak ion numbers are less than 10 and at least one of them is already collected in the main peak. Once a peak absorbs a secondary peak, this combined peak is not absorbed by other peaks, which prevents the peak from expanding. If one peak shares ions with two neighboring peaks, one of two neighboring peaks is selected by shorter RT distance between main peak and neighboring peaks. The parameter of ion search time frame is set to ±0.005 after we found that the RT distance between sesquisabinene hydrate and 2,4,5,6-tetrachloro-m-xylene is 0.0131. When we do not use the peak combining algorithm, these peaks are separated without overlap with ion search time frame ±0.01, however they are combined when using the peak combining algorithm. When we use an ion search time frame of ±0.005 and peak combining algorithm, they remain as separated peaks. When we use an ion search time frame of ±0.005, the resolution for peak detection is 0.01 (0.6 sec) and peaks eluting within 0.6 sec are not distinguished. However the chance that one peak can be separated to two or more also increases. These parameters work well for traditional GC chromatographic separations. Higher resolution GC separations would require appropriate adjustments to the parameters.

3.7 Exogenous GC/MS standard use for RT adjustment We used three standards, one internal standard (p-chloro toluene) and two

202 surrogate standards (1,2,4-trimethyl benzene and 2,4,5,6-tetrachloro-m-xylen). p-chloro toluene elutes first at RT around 7.74, 1,2,4-trimethyl benzene elutes later at RT around 8.44, and finally 2,4,5,6-tetrachloro-m-xylen elutes around 26.75. We picked one sample (G-L-2M-v986) that shows all three standards clearly and used these data to search for the standards in other samples with ±0.5 time frame. The default value of time frame for standard search is ±0.5 and can be input by the user. Out of a total of 66 samples, 57 samples show very clear matches with all three standards. However, in some of TYA-rhizome samples, the standard2, 1,2,4-trimethyl benzene is masked by a very large myrcene peak is not detected or detected with shifted RT, whereas p-chloro toluene and 2,4,5,6-tetrachloro-m-xylen are well detected. Out of 9 TYA rhizome samples, 5 samples still show 1,2,4-trimethyl benzene. However, its RT is delayed about 0.16 when compared to other samples due to the myrcene peak. There is one sample that still shows the 1,2,4-trimethyl benzene peak but with a delayed RT. In Fig. 8, the first chromatogram from the FMO-leaf (harvested at 7 M) sample shows myrcene at RT 8.36. However, the second chromatogram from TYA rhizome (5M sample) has a large myrcene peak at RT 8.68. The standard2, 1,2,4-trimethyl benzene, is shown at RT 8.43 in the first chromatogram, but is not seen in second chromatogram due to the myrcene peak. The standard searching process found 1,2,4-trimethyl benzene at RT 8.6359 with 0.995423 of Pearson correlation coefficient value with another clear 1,2,4trimethyl benzene peak from G-L-2M-v986 sample (mentioned above). When we look at SICs of ions m/z 105 and 120, which are the most abundant ions in 1,2,4-trimethyl benzene mass spectra (Fig. 8, B bottom), there are just small peaks at RT 8.64 (Fig. 8, A, third and fourth chromatogram). Although the major ions of myrcene are 69 and 93, and

203 ions 105 and 120 are almost ignorable in the myrcene mass spectrum (Fig. 8, B top), the myrcene peak is huge and there are still large amounts of ions 105 and 120, as shown. The following log file example (T-Rh-5M-v498_T141Rh5M) shows the candidate peaks for standards and selection of the standards (TEXT 2). In TEXT2, Matched MW means the number of ions that are shared between two peaks, each detected peak in this sample and compared standards peaks from the G-L-2M-v986 sample. Higher matched MW gives us more reliability. If a peak has a matched MW equal to or less than 5, it is marked as (e.g. ) and is not considered. In standard1, there are two peaks at RT 7.6248 and 7.7316 that have high Pearson correlation coefficient values. Note that a Pearson correlation coefficient value of 0.947 is highest in standard1. Other samples have coefficients equal to or higher than 0.998. The compound at RT 7.7316 is p-chloro toluene and the compound at RT 7.6248 is o-chloro toluene. As shown in Fig. 9, o-chloro toluene(RT 7.64) has very small peak and is an impurity in the p-chloro toluene (RT 7.74) used as the standard compound in the extractions. Ginger and turmeric do not contain these chloro toluenes. Many other samples also have detectible o-chloro toluene. We use all three standards for RT adjustment in non TYA-rhizome samples and use only standard1 and standard3 in TYA-rhizome samples. Before a peak RT is adjusted by standards, the RT of each peak is set to the average RT of each ion involved in this peak. And then, average RT difference of two or three standards is applied to all peaks in each sample. The sample, G-Rh-3M-v595_G79Rh3M failed to deconvolute standard2, 2,4,5,6tetrachloro-m-xylen, from sesquisabinene hydrate (Table 1). The RT for ion m/z 69, the

204 major ion for sesquisabinene hydrate is 26.7945 and RT for ions m/z 207 and 244, the major ions for 2,4,5,6-tetrachloro-m-xylen is 26.7901. The difference between 26.7945 and 26.7901 is 0.0044, which is less than 0.005, the resolution for peak deconvolution. In this case, it is hard to determine the appropriate RT time frame for resolution. If we set it less than 0.0044 for deconvolution of these peaks, it is possible that some other peaks are separated into two peaks. The GC/MS data of samples used in this experiment are from runs performed on the same column installed once with identical instrument settings. As a result, there are no large differences in RTs for the standards across samples. However, if samples are run on columns that are removed and reinstalled or exposed to a prolonged time of column conditioning, the RT can shift significantly and RT adjustment from GC standards will be more helpful to align peaks in different samples.

3.8 Peak comparison in different samples Peaks are detected and RT of each peak are typically adjusted according to GC standards. The peaks that have high Pearson correlation coefficients between samples can be grouped and considered as a unique peak representing a common compound among samples. To do this, we need to analyze peaks from all samples at the same time. Our approach is to enter peaks into a MySQL table and the peaks are picked one by one in the order of descending peak area. Once a peak is picked, peaks in other samples are examined to see if they are similar or not in some RT range. In this paper, the peak picked initially will be called the “query peak” and the peak selected within the searched RT range in other samples will be

205 called the “subject peak”. Query peaks with different peak areas need to have different RT ranges to search for subject peaks in other samples; for our system, peaks with peak areas eqaul to or larger than 5E8 have a search RT range from -0.65 to 0.1 min, peaks with peak areas eqaul to or larger than 1E8 have a search RT range from -0.3 to 0.1, peaks with peak areas eqaul to or larger than 5E5 have a search RT range from -0.2 to 0.1, and peaks with peaks area smaller than 5E5 have a search RT range from -0.15 to 0.1. The subject peaks within the search RT range are first examined to see if their matched ion number is eqaul to or more than 6, and if the matched ion number is less than 6 the subject peak will not be considered as a candidate to be similar to the query peak, because of insufficient information for Pearson correlation coefficient calculation. The ion ratio is then considered. The ion ratio is the ratio of matched ion number to total ion number from either query peak or subject peak (numerator contains smaller number of ion number). This prevents the unwanted situation that only a small number of ions are the same and they give high Pearson correlation coefficients. The subject peaks with ratios of matched ions equal to or higher than 0.6 will be considered as candidates to be a peak that is similar to the query peak. If a subject peak has passed both criteria for matched ion number and ratio of matched ions, a Pearson’s correlation coefficient is calculated based on peak areas of matched ions. Subject peaks with Pearson’s coefficients equal to or higher than 0.7 are considered as candidates to be peaks similar to the query peak. One subject peak with the highest Pearson’s coefficient value in each sample will be selected and this peak will become a new query peak to search for any similar peak other than original query peak in the sample to which the original query peak belonged. If there is peak with a higher Pearson’s correlation coefficient value, the subject peak will

206 not be grouped with original query peak.

3.9 Analysis of result table The sample number used in this study is 66. After computation, 23349 unique peaks came out from total 35425 peaks in these samples. Only 5 peaks are common in all 66 samples, and 21836 peaks were found in only one sample, which is 93.5% of total unique peaks, and their peak areas were relatively small (Table 2, Fig. 10, Fig. 11), suggesting that they either represented very low abundance compounds or instrument noise. For the unique peaks from only one sample, the average of their peak areas is 1277507 and median of peak areas is 192481. A total of 89.07% (19449 peaks) of the peaks had areas less than 1E6, 9.61% (2098 peaks) were between 1E6 and 1E7, 1.10% (239 peaks) were between 1E7 and 1E8, and 0.22% (49 peaks) were higher than 1E8. When considering that the lower limit of peak area allowed for peak detection was 1E5, it appears that peak areas of most unique peaks (found in only one sample) are small and should be considered to likely be background or instrument noise. In ginger and turmeric samples, there are several very large peaks that represent some of the most abundant compounds in these plants: myrcene (RT 8.69) from TYArhizome; á-phellandrene (RT 8.72) from FMO-rhizome; á-zingiberene (RT 22.92) and âsesquiphellandrene (RT 24.10) from GY-rhizome and FMO-rhizome; á-turmerone (RT 29.14) and â-turmerone (RT 30.43), unknown (manool-like) (RT 50.31) from TYArhizome; and unknown (gingerol-like) (RT 51.04) from GY-rhizome. Ions from SIC peaks for these very large peaks are often divided into several peaks when embedded peaks exist and SICs of these small embedded peaks appear as steps in the slope of SIC

207 peaks of these abundant compounds. As shown in Fig. 8 and Table 2, the peaks of standard2 are sometimes hard to be selected as the same peak as other standard2 peaks when these peaks are embedded in the very large myrcene peak. These large peaks cause some smaller peaks to have incomplete ion information and can lead to orphan peaks, unique peaks from only 1 sample. As shown in Fig. 11, there are many more orphan peaks around RTs of very peaks. Peaks shared by all 66 samples are n-butyl acetate (RT 5.07), α-pinene (RT 7.26), camphene (RT 7.58), myrcene (RT 8.69) and [E]-caryophyllene (RT 20.02). Peaks shared by 64 samples are β-pinene (RT 8.18) and α-humulene (RT 21.09). Standard1 (p-chloro toluene ) and standard3 (2,4,5,6-tetrachloro-m-xylen) peaks are shared by 65 samples. Standard1 peak was not present in the T-Rh-5M-v498 sample, which had the lowest Pearson’s correlation coefficient value for standard1 in Table 2. The G-L-2M-v986 sample was selected as the source sample for standard searching for standard1 and sample GL7Mv738, having the largest standard1 peak, was automatically selected in the later stage of analysis. Standard3 peak was not selected in sample G-Rh-3M-v595, which showed a Pearson’s correlation coefficient value of 0.581 with other standard3 peaks in Table 2. The standard2 (1,2,4-trimethyl benzene) peak was divided by groups of peaks after processing. As mentioned before, standard2 peaks in TYA-rhizome samples are embedded in very large myrcene peaks and have larger peak areas when compared to non-embedded standard2 peaks in samples other than TYA-rhizome. Standard2 peak selection was automatically done using one of these embedded peaks because these peaks are larger, and the ion patterns are a little bit altered because of the huge myrcene peaks (Fig. 8, Table 2) and therefore standard2 peaks in several samples were not selected and

208 finally there are two groups of standard2 peaks.

4 Concluding remarks

The approach we used works very well for mid- or small-sized peaks. For example, n-butyl acetate, which is thought to be a contaminant in the extraction solvent MTBE (methyl tert-butyl ether), was identified in all 66 samples. Very large peaks cause problems for small coeluting peaks and also delay their retention time (Fig. 8, Table 2). However, this approach is still useful to deconvolute embedded peaks (standard2 peak from the huge myrcene peak) as well as two similarly-sized peaks with very close RTs (standard3 peak from similarly-sized sabinene hydrate peak, Fig. 6, Table 2). We used 1E5 as the lower limit of peak area to compare and collect as many peaks as possible. However, most small peaks were present only in one sample. The parameters of this process are adjustable and increasing the limit of peak area will give a smaller number of background peaks. Without unique peaks not shared with other sample, there are 1513 peaks shared by at least two samples. If the parameter of peak area limit is increased to 5E5 or 1E6, there will be 1135 or 1078 peaks shared by at least two samples respectively. These numbers are still large when compared to sample-based approach. When we look at individual samples and search the library to identify peaks, if the peak represents an unknown compound, it is very hard if not impossible to compare with other samples and decide if they are same peak or not. The approach outlined in this paper enables comparison of unknown peaks very efficiently from a large dataset.

209

5 References 1.

Smith, C.A., et al., XCMS: processing mass spectrometry data for metabolite

profiling using nonlinear peak alignment, matching, and identification. Anal Chem, 2006. 78(3): p. 779-87. 2.

Ma, X.-Q. and D.R. Gang, Metabolic profiling of in vitro micropropagated and

conventionally greenhouse grown ginger (Zingiber officinale). Phytochemistry, 2005. Submitted. 3.

Ma, X. and D.R. Gang, Metabolic profiling of turmeric (Curcuma longa L.)

plants derived from in vitro micropropagation and conventional greenhouse cultivation. J Agric Food Chem, 2006. 54(25): p. 9573-83. 4.

Adams,

R.P.,

Identification

of

Essential

Oil

Components

by

Chromatography/Mass Spectroscopy. 1995, Illinois, USA: Allured Publishing Co

Gas

210

100

9.12

TIC

9.12

100 50

9.00

SIC

95

9.18

9.05

90

0 9.05

9.10 9.15 Time (min)

9.20

85 80

67.92

100

92.94

Mass spectra

70 65

90 80

9.12

75

60

66.92

55 50

70 60 50

45

9.12

40

9.12

35 78.91 136.03

106.96

30

9.12 9.12

30

93.96

40

25

9.12

20

52.89

15

20

10 10

137.08

0 60

80

100 m/z

120

140

5

9.18

9.02 9.04 9.06 9.01 9.02 9.04 9.06

0 9.00

9.05

9.22 9.10

9.15

9.23 9.24

9.20

Time (min)

Fig. 1 Concept of the process used in this paper. Peak at RT 9.12 from TIC is constituted with ions of 68, 93, 97, 79, 94, 136, 107, 53, etc as shown in mass spectra. SIC of these ions (MW 68, 93, 97, 79, 94, 136, 107 and 53) are shown. To decide the peak by SIC, biggest ion of MW 68 is selected first. Based on this ion, other ions from SIC around RT 9.12 such as 93, 97, 79, 94, 136, 107, 53, etc are collected (Here, only biggest 8 samples are shown, however actual process collect small ions too). Finally, new peak at RT 9.12 is created based on collected SIC ions, which is very similar with peak shown in TIC at RT 9.12. The peak area of newly created peak is sum of peak area of SIC ions. When peak is compared with other peak in other samples, the areas of ions in SIC are compared to yield Pearson correlation coefficient. The height of SIC is fixed to 760138.

211

Ion peak collection

by SIC from MW 50-650 in each sample

Search of 3 GC standards search range: RT ± 0.5 RT is shifted by average of RT difference of standards

Ion peak selection by area >5E3

Peak constitution

with ion peaks at RT± 0.005

Peak combining algorhithm :Combine devided peaks when common MW > 30% or when there is common MW in case common MW < 10 RT of combined peak :average of ions’RT

if matched ion number ≥ 6 and common ion ratio based on sample with small number ions ≥ 6 Yes

Selection of a Query peak largest peak first across all samples

Selected Subject peak

if Perason correlation coefficient ≥ 0.7 Yes

Peak selection by area > 1E5 by ion # ≥ 6

Checking Subject peak

Other Subject peaks selection for Query peak near RT of Query peak

if peak area of Query peak, RT range is : -0.65 ~ 0.1 > 5E8 5E8 ≥ > 1 E8 : -0.3 ~ 0.1 1 E8≥ >5 E5 : -0.3 ~ 0.1 5E5≥

: -0.15 ~ 0.1

Yes

Fig. 2 Data processing flow chart (see details in section 3.6 - 3.8.).

Feedback checkin g

if Sample that Query peak belong to has more similar peak with Subject peak near Subject peak RT No

Subject peak is selected as same as Query peak

212 RT: 26.11 - 27.34 SM: 7G

NL: 2.88E7 TIC F: MS F-Rh-3Mv1102_F60Rh3M

100 26.79

50 0 100

26.47

26.18 26.22 26.30 26.36

26.54

26.65

27.01 27.05

27.18

27.28

26.78

50 Relative Abundance

26.94

26.44

26.28

0 100

26.65

26.53

26.95

27.02

27.28

27.11

26.86 26.80

50 0 100

26.16

26.28 26.31 26.35

26.55

26.44

26.66

27.03 27.09 26.79

27.17

27.28

NL: 2.88E7 TIC F: MS f-rh-3mv579_f67rh3m

26.84

50 0 100

26.23 26.26

26.33

26.44

26.54

26.65

26.94 27.01 27.06 26.79

27.18

27.27 NL: 2.88E7 TIC F: MS f-rh-3mv579_f67rh3m

26.84

50 0

26.23 26.26 26.2

26.33

26.3

26.44 26.4

26.54 26.5

26.65 26.6

26.94 27.01 27.06 26.7 26.8 Time (min)

NL: 2.88E7 TIC F: MS f-rh-3mextra_v1006_f52r h3m NL: 2.88E7 TIC F: MS f-rh-3mv551_f56rh3m_

26.9

27.0

27.1

27.18 27.2

27.27 27.3

Fig 3.TIC of 5 FMO-rhizome samples. The scale of y-axis is fixed to 28800000. The compound shown at RT 26.78 is 2,4,5,6-tetrachloro-m-xylene, the standard3 for GC/MS, and the compound shown at RT 26.84 is thought as sesquisabinene hydrate. In first and second chromatograms two peaks are eluted at very similar RT and cannot be separated in TIC.

213 RT: 26.11 - 27.34 SM: 7G 100 26.82

50 0 100

26.14 26.19 26.24

26.40 26.44

26.98

26.52 26.56 26.64

Relative Abundance

50 0 100

26.44

26.20 26.27 26.30

26.54

26.65

50 0 100

26.16

26.24 26.31 26.36

26.45 26.53 26.57

26.66

50 0 100

26.11 26.18 26.25

26.37 26.42

26.11 26.18 26.25

26.37 26.42

26.53

26.63

50 0

26.2

26.3

26.4

26.53 26.5

26.63 26.6

27.07

27.17 27.20 27.29

NL: 3.67E6 m/z= 68.50-69.50 F: MS F-Rh-3Mv1102_F60Rh3M

NL: 3.67E6 m/z= 68.50-69.50 F: MS f-rh-3m26.82 extra_v1006_f52rh3 26.91 27.00 27.04 27.11 27.17 27.21 27.31 m 26.86 NL: 3.67E6 m/z= 68.50-69.50 F: MS f-rh-3m27.17 27.01 27.09 27.29 v551_f56rh3m_ NL: 3.67E6 26.84 m/z= 68.50-69.50 F: MS f-rh-3mv579_f67rh3m 27.27 26.99 27.08 27.17 NL: 3.67E6 26.84 m/z= 68.50-69.50 F: MS f-rh-3mv579_f67rh3m 27.27 26.99 27.08 27.17

26.7 26.8 Time (min)

26.9

27.0

27.1

27.2

27.3

Fig. 4 SIC of 5 FMO rhizome samples at MW 69, which is most abundant ion from sesquisabinene hydrate and ignorable in 2,4,5,6-tetrachloro-m-xylen. All five samples have sesquisabinene hydrate. The scale of y-axis is fixed to 3670000.

214 RT: 26.11 - 27.34 SM: 7G 26.78

100 50

26.89 26.95

0 100

27.06 27.13

27.22

27.32

26.77

Relative Abundance

50 0 100

26.14

26.31

26.61

26.89

26.98

27.07 27.13

27.20 27.26

26.79 50 0 100

26.16 26.22

26.53 26.61 26.64

26.88

26.97 27.02 27.09

27.17 27.22 27.26

26.78 50 26.68

26.25 26.31

0 100

26.87

26.98

27.07

27.21 27.24 27.28

26.78 50 26.68

26.25 26.31

0 26.2

26.3

26.4

26.5

26.6

26.87 26.7 26.8 Time (min)

26.9

26.98 27.0

27.07 27.1

27.21 27.24 27.28 27.2

NL: 5.39E5 m/z= 243.50-244.50 F: MS F-Rh-3Mv1102_F60Rh3M NL: 5.39E5 m/z= 243.50-244.50 F: MS f-rh-3mextra_v1006_f52rh3m NL: 5.39E5 m/z= 243.50-244.50 F: MS f-rh-3mv551_f56rh3m_ NL: 5.39E5 m/z= 243.50-244.50 F: MS f-rh-3mv579_f67rh3m NL: 5.39E5 m/z= 243.50-244.50 F: MS f-rh-3mv579_f67rh3m

27.3

Fig. 5 SIC of 5 FMO rhizome samples at MW 244, which is secondly abundant ion from 2,4,5,6tetrachloro-m-xylen and ignorable in sesquisabinene hydrate. All five samples have standard3. With together with Fig. 4, all five samples have both sesquisabinene hydrate and standard3 although it is hard to discriminate in TIC (Fig. 3). Standard3 peaks are successfully deconvoluted as shown in Table 1. The scale of y-axis is fixed to 539000

215 RT: 22.91 - 23.38 SM: 15G

23.30

0 100

23.12 22.98

Relative Abundance

0 100 0 100 0 100 0 100

23.18

23.03

23.23

23.36

22.95

23.11

23.04

23.22

23.26 23.29

23.16 23.11 23.02

22.93

23.31

23.23 23.17 23.12

22.93

23.07

23.02

23.33

23.26

NL: 5.69E3 m/z= 123.50-124.50 F: MS 23.38 F-Rh-3M-v1102_F60Rh3M NL: 3.98E3 m/z= 163.50-164.50 F: MS F-Rh-3M-v1102_F60Rh3M NL: 1.38E3 m/z= 171.50-172.50 F: MS F-Rh-3M-v1102_F60Rh3M

23.12 23.17 22.97

23.03 23.11 23.02

22.97

0 100 22.93 22.95

23.26

23.29

23.28 23.05

23.10

NL: 3.75E4 m/z= 202.50-203.50 F: MS F-Rh-3M-v1102_F60Rh3M

23.17

23.02 23.00

NL: 6.20E3 m/z= 200.50-201.50 F: MS F-Rh-3M-v1102_F60Rh3M

23.17

23.06 23.11

23.15 Time (min)

NL: 3.60E3 m/z= 61.50-62.50 F: MS F-Rh-3M-v1102_F60Rh3M NL: 5.22E3 m/z= 84.50-85.50 F: MS F-Rh-3M-v1102_F60Rh3M

23.17

0 100

0

NL: 3.18E7 TIC F: MS F-Rh-3M-v1102_F60Rh3M

23.16

100

23.20

23.25

23.30

23.33 23.35

Fig. 6 An example of deconvolution for embedded peak using SIC. The first chromatogram is TIC and the rests are SIC of ion at MW 62, 85, 124, 164, 172, 201 and 203. The Peak at RT 23.16 is thought as β-bisabolene and the compound at RT 23.11 is unknown. These two peaks are successfully deconvoluted. The scales are relative.

216 f-l-7m-v435_f275-f272l7m #4997-5014 RT: 25.60-25.67 AV: 18 SB: 6 5.00-5.00 , 5.05-5.06 NL: 5.52E4 T: + c Full ms [ 50.00-1000.00] 68.97 55000 50000

78.95

45000 92.97

Relative Abundance

40000 80.94

35000

90.94 94.97

30000 109.03

66.99

25000

104.98

76.98

20000

96.01

54.95

81.97

15000

110.01

82.98

10000 64.91

121.06 123.04 135.05 136.02

70.97

5000

85.00

97.03

111.02

149.09

147.08

64.15

177.12 161.11 159.12 163.15 175.07

187.14

0 50

60

70

80

90

100

110

120

130 m/z

140

150

160

170

180

190

193.30

205.18

200

Fig. 7 Example of RT difference of some ions in one peak at RT 25.63. A: The first chromatogram shows TIC. From second to seventh chromatograms are SICs. The second and third chromatograms show two ions, MW 69 and MW 103 that belong to Group A in TEXT 1. The fourth and fifth chromatograms show two ions, MW 83 and MW119 that belong to both Group A and Group B. The sixth and seventh chromatograms show two ions, MW 79 and MW 93 that belong to Group B. Group A and Group B are specified in the text. These Group A and Group B are combined by peak combining algorithm. The scales are relative. B. Mass spectra of caryophyllene oxide at RT 25.63 in panel A. The mass spectrum of caryophyllene oxide has ions from both Group A and Group B, which indicate Group A and Group B should be combined to one peak. There are 62 ions with ion intensity higher than 5000.

217 RT: 8.23 - 8.82 SM: 7G NL: 7.51E7 TIC F: MS f-l-7mv740_f276l7m

150 8.65

100

Relative Abundance

50

8.36 8.26

0

8.43

8.30

8.52

8.75

8.56

NL: 4.86E8 TIC F: MS t-rh-5mv498_t141rh5m

150 8.68

100

8.54

50 8.30

0

8.78 8.80 NL: 5.64E6 m/z= 104.50-105.50 F: MS t-rh-5mv498_t141rh5m

150 100

8.59 8.62

8.53

8.64

8.68

50 8.25

0

8.28

8.30

8.75

8.78 NL: 1.68E6 m/z= 119.50-120.50 F: MS t-rh-5mv498_t141rh5m

150 100

8.53 8.56

8.64

8.58

8.67

50 8.26

0

8.25

8.39

8.31 8.30

8.35

8.76

8.40

8.45

8.50 8.55 Time (min)

8.65

8.70

8.75

8.81 8.80

NL: 4.72E6 f-l-7m-v740_f276l7m#10061013 RT: 8.35-8.38 AV: 8 SB: 8 8.19-8.20 , 8.24-8.26 T: + c Full ms [ 50.00-1000.00]

93.03

100

Relative Abundance

8.60

80 68.99 60 40 20

52.94 54.98

0 100

61.98

66.99 64.98

70.02

78.98 80.02 81.02

91.01 90.09

94.04 95.08

103.08 107.08 105.03

115.09

121.10

128.14 NL: 5.54E6 f-l-7m-v740_f276l7m#10221029 RT: 8.42-8.45 AV: 8 SB: 8 8.19-8.20 , 8.24-8.26 T: + c Full ms [ 50.00-1000.00]

80 60

120.08

40 20 50.94 0 50

58.95 60

64.96

76.97 80.05 72.36 76.11 70

80

91.01 89.00 92.05 90 m/z

103.05 106.06 101.06 107.10 100

110

119.11 121.12 115.04 120

130.43 130

Fig. 8 A: The first chromatogram is from FMO leaf 7 M sample (F-L-7M-v740) that shows myrcene at RT 8.36 and 1,2,4-trimethyl benzene (standard2) at RT 8.43.The second chromatogram is from TYA rhizome 5 M sample (T-Rh-5M-v498) with huge myrcene peak at RT8.68. The third and fourth chromatogram show SIC of ion 105 and 120, which are most abundant in standard2. Huge myrcene peak masked small standard2 peak, however most abundant ions of standard2 forms small buffs. Deconvolution of standard2 peak in T-Rh-5M-v498 sample is successful (Table 1). The peak at RT8.65 in first chromatogram is α-phallandrene. B: top: MS of myrcene, bottom: MS of 1,2,4trimethyl benzene. MW 105 and 120, which are most abundant in standard2 mass spectra exist at very low amount in myrcene mass specta. The scales are relative.

218 RT: 7.30 - 8.03 SM: 15G

NL: 1.11E7 TIC F: MS F-L-7Mv435_F275 -F272L7M

7.74

100 95 90 85 80 75 70 Relative Abundance

65 60 55 50 45 40 35 30 25 20 15 7.55

10 5 0 7.30

7.64

7.45 7.35

7.40

7.45

7.50

7.55

7.60

7.65 7.70 Time (min)

7.75

7.80

7.85

7.90

7.95

8.00

Fig. 9 Standard 1 peak in T-Rh-5M-v498_T141Rh5M sample. RT 7.74 represents p-chloro toluene and RT 7.64 represents o-chloro toluene. The process searching standards also finds o-chloro toluene possibly from impurity of p-chloro toluene. p-chloro toluene and o-chloro toluene have very similar mass spectra.

219 100000

1.60E+08

Peak number and average area

1.40E+08 10000

1.00E+08

1000

8.00E+07 100

6.00E+07

Average peak area

Peak number

1.20E+08

4.00E+07 10 2.00E+07 1

0.00E+00 0

10

20

30

40

50

60

70

Sample number shared by unique peak

Fig. 10 Chart from Table 2. Unique peaks shared by more samples tend to have larger average peak area (Red). Most peaks are unique peak from only one or two sample(s) (Blue).

220 Max peak area of unique peaks

peak area

1.0E+09

1.0E+08

1.0E+07

1.0E+06

1.0E+05 0

10

20

30

40

50

60

70

80

70

80

RT

Peak area of unique peaks from only 1 sample

peak area

1.0E+09

1.0E+08

1.0E+07

1.0E+06

1.0E+05 0

10

20

30

40 RT

50

60

221 Max peak area of unique peaks from > 1 sample

peak area

1.0E+09

1.0E+08

1.0E+07

1.0E+06

1.0E+05 0

10

20

30

40

50

60

70

80

RT

Fig. 11 Max peak area of unique peaks. Ions from SIC peaks for these huge peaks at RT 8.69, RT 8.72, RT 22.92, RT 24.10, RT 29.14, RT 30.43, RT 50.31 and RT 51.04 are often divided into several peaks when embedded peaks exist and SICs of these small embedded peaks make step in the slope of SCI peaks of these huge peaks. Embedded peaks are affected by these huge peaks and their ion information is often distorted especially embedded peaks are similar compounds with huge peaks, which cause lots of unique peaks from only one sample near these huge peaks.

222

sample F-L-7M-v435 F-L-7M-v740 F-L-7M-v743 F-L-7M-v747 F-R-7M-v444 F-R-7M-v763 F-R-7M-v765 F-Rh-3M-v1102 F-Rh-3M-v551 F-Rh-3M-v579 F-Rh-5M-v857 F-Rh-5M-v863 F-Rh-5M-v865 F-Rh-5M-v867 F-Rh-7M-v443 F-Rh-7M-v764 F-Rh-7M-v767 G-L-2M-v986 G-L-2M-v989 G-L-2M-v993 G-L-2M-v997 G-L-7M-v434 G-L-7M-v716 G-L-7M-v728 G-L-7M-v738 G-R-2M-v985 G-R-2M-v999 G-R-7M-v426 G-R-7M-v725 G-R-7M-v727 G-R-7M-v737 G-Rh-2M-v990 G-Rh-2M-v998 G-Rh-3M-v595 G-Rh-3M-v599 G-Rh-3M-v628 G-Rh-3M-v640 G-Rh-4M-v1055 G-Rh-4M-v1063 G-Rh-4M-v1071 G-Rh-4M-v1079 G-Rh-6M-v1035 G-Rh-6M-v1040 G-Rh-6M-v1160 G-Rh-6M-v796 G-Rh-7M-v425 G-Rh-7M-v724 G-Rh-7M-v726 G-Rh-7M-v736 T-L-7M-v436 T-L-7M-v741

RT 7.7406 7.7212 7.7240 7.7230 7.7315 7.7282 7.7239 7.7366 7.7231 7.7229 7.7280 7.7281 7.7328 7.7281 7.7246 7.7277 7.7311 7.7402 7.7335 7.7282 7.7322 7.7240 7.7312 7.7243 7.7231 7.7413 7.7339 7.7246 7.7297 7.7190 7.7234 7.7370 7.7327 7.7409 7.7276 7.7354 7.7329 7.7369 7.7391 7.7357 7.7402 7.7446 7.7441 7.7312 7.7426 7.7289 7.7362 7.7318 7.7245 7.7281 7.7240

Standard 1 R ÄRT 1.000 0.0055 1.000 0.0249 0.999 0.0221 0.999 0.0231 0.999 0.0146 1.000 0.0179 0.999 0.0222 1.000 0.0095 0.999 0.0230 0.998 0.0232 0.999 0.0181 1.000 0.0180 0.999 0.0133 0.999 0.0180 0.998 0.0215 0.999 0.0184 0.999 0.0150 1.000 0.0059 0.999 0.0126 1.000 0.0179 1.000 0.0139 1.000 0.0221 0.998 0.0149 0.999 0.0218 1.000 0.0230 1.000 0.0048 1.000 0.0122 1.000 0.0215 0.999 0.0164 0.999 0.0271 0.999 0.0227 1.000 0.0091 1.000 0.0134 0.999 0.0052 0.998 0.0185 0.999 0.0107 0.999 0.0132 1.000 0.0092 1.000 0.0070 1.000 0.0104 1.000 0.0059 1.000 0.0015 1.000 0.0020 1.000 0.0149 0.999 0.0035 1.000 0.0172 0.998 0.0099 0.999 0.0143 0.999 0.0216 0.999 0.0180 0.999 0.0221

RT 8.4497 8.4331 8.4369 8.4367 8.4388 8.4373 8.4332 8.4505 8.4391 8.4368 8.4416 8.4461 8.4454 8.4434 8.4358 8.4390 8.4414 8.4444 8.4383 8.4350 8.4384 8.4322 8.4345 8.4322 8.4328 8.4451 8.4396 8.4307 8.4328 8.4266 8.4288 8.4439 8.4375 8.4454 8.4352 8.4380 8.4367 8.4425 8.4449 8.4450 8.4456 8.4462 8.4468 8.4385 8.4443 8.4363 8.4377 8.4334 8.4328 8.4375 8.4309

Standard 2 R ÄRT 1.000 -0.0042 1.000 0.0124 1.000 0.0086 0.999 0.0088 1.000 0.0067 0.999 0.0082 1.000 0.0123 1.000 -0.0050 0.999 0.0064 0.999 0.0087 1.000 0.0039 0.999 -0.0006 1.000 0.0001 0.999 0.0021 0.999 0.0097 0.999 0.0065 1.000 0.0041 1.000 0.0011 1.000 0.0072 0.999 0.0105 1.000 0.0071 0.999 0.0133 1.000 0.0110 0.999 0.0133 0.999 0.0127 1.000 0.0004 1.000 0.0059 1.000 0.0148 0.999 0.0127 0.999 0.0189 0.999 0.0167 1.000 0.0016 1.000 0.0080 1.000 0.0001 1.000 0.0103 0.999 0.0075 1.000 0.0088 1.000 0.0030 1.000 0.0006 1.000 0.0005 0.999 -0.0001 1.000 -0.0007 1.000 -0.0013 1.000 0.0070 1.000 0.0012 1.000 0.0092 0.999 0.0078 0.999 0.0121 0.999 0.0127 0.999 0.0080 1.000 0.0146

Standard 3 RT R 26.7353 0.999 26.7047 0.992 26.7028 0.993 26.7095 0.994 26.7147 0.955 26.7445 0.959 26.7270 0.990 26.7751 0.987 26.8095 0.971 26.7916 0.932 26.7793 0.960 26.7978 0.971 26.7366 0.992 26.7953 0.961 26.7287 0.959 26.7365 0.972 26.8190 0.939 26.7480 1.000 26.7337 0.998 26.7290 0.999 26.7397 0.999 26.7096 0.996 26.7038 0.988 26.7135 0.987 26.7437 0.999 26.7415 0.997 26.7367 0.997 26.7106 0.998 26.7021 0.990 26.7049 0.989 26.7061 0.990 26.7621 0.996 26.7612 0.995 26.7918 0.581 26.7521 0.951 26.7310 0.949 26.7205 0.978 26.7670 0.996 26.7758 0.995 26.7697 0.998 26.7692 0.996 26.7758 0.996 26.7883 0.992 26.7643 0.994 26.7750 0.989 26.7628 0.957 26.7492 0.975 26.7529 0.974 26.7591 0.979 26.7123 0.993 26.7049 0.994

ÄRT 0.0100 0.0406 0.0425 0.0358 0.0306 0.0008 0.0183 -0.0298 -0.0642 -0.0463 -0.0340 -0.0525 0.0087 -0.0500 0.0166 0.0088 -0.0737 -0.0027 0.0116 0.0163 0.0056 0.0357 0.0415 0.0318 0.0016 0.0038 0.0086 0.0347 0.0432 0.0404 0.0392 -0.0168 -0.0159 -0.0465 -0.0068 0.0143 0.0248 -0.0217 -0.0305 -0.0244 -0.0239 -0.0305 -0.0430 -0.0190 -0.0297 -0.0175 -0.0039 -0.0076 -0.0138 0.0330 0.0404

peak# 516 611 399 499 508 357 385 250 460 480 420 593 410 424 761 647 446 434 281 556 348 859 494 541 2161 243 275 520 219 196 309 632 402 845 606 448 539 565 702 405 493 599 674 610 696 1083 702 735 571 948 363

ion#/p 22.2 20.5 22.7 21.6 21.6 21.5 21.9 22.5 23.5 22.3 21.3 20.6 21.7 21.7 21.8 21.1 20.7 18.1 17.4 18.5 16.8 18.3 17.4 17.1 22.0 19.7 19.8 21.2 22.1 22.1 20.1 23.1 20.2 24.9 24.0 24.3 24.2 23.7 22.5 22.9 21.8 23.5 23.3 21.6 23.0 23.3 25.2 24.0 24.5 20.6 20.9

ion# 11438 12530 9054 10795 10987 7680 8423 5623 10795 10710 8929 12232 8882 9221 16575 13652 9213 7864 4878 10283 5837 15738 8571 9277 47587 4796 5445 11022 4841 4328 6226 14623 8104 21026 14527 10893 13039 13368 15801 9266 10752 14093 15701 13168 16006 25274 17656 17652 13996 19499 7569

223 T-L-7M-v745 T-L-7M-v756 T-R-7M-v428 T-R-7M-v753 T-R-7M-v755 T-R-7M-v761 T-Rh-3M-v1106 T-Rh-3M-v603 T-Rh-3M-v616 T-Rh-5M-v498 T-Rh-5M-v870 T-Rh-5M-v872 T-Rh-5M-v874 T-Rh-7M-v427 T-Rh-7M-v754 Average

7.7240 7.7279 7.7236 7.7364 7.7280 7.7286 7.7396 7.7320 7.7326 7.7316 7.7326 7.7373 7.7347 7.7285 7.7296

0.999 0.999 0.999 1.000 0.999 1.000 1.000 0.999 0.999 0.948 1.000 1.000 1.000 1.000 0.999

0.0221 0.0182 0.0225 0.0097 0.0181 0.0175 0.0065 0.0141 0.0135 0.0145 0.0135 0.0088 0.0114 0.0176 0.0165

8.4322 8.4335 8.4323 8.4411 8.4338 8.4354 8.5369 8.0383 8.5354 8.6481 8.8022 8.5473 8.4928 8.5786 8.6169

1.000 1.000 1.000 1.000 1.000 0.999 0.015 0.224 0.976 0.993 0.836 0.307 0.607 0.974 0.927

0.0133 0.0120 0.0132 0.0044 0.0117 0.0101 -0.0914 0.4072 -0.0899 -0.2026 -0.3567 -0.1018 -0.0473 -0.1331 -0.1714

26.7036 26.7176 26.7033 26.7855 26.7202 26.7255 26.7569 26.6901 26.6912 26.7081 26.7195 26.7251 26.7340 26.6981 26.7131

0.994 0.996 0.998 0.975 0.997 0.996 0.997 0.982 0.986 0.982 0.998 0.999 0.998 0.999 0.996

0.0417 0.0277 0.0420 -0.0402 0.0251 0.0198 -0.0116 0.0552 0.0541 0.0372 0.0258 0.0202 0.0113 0.0472 0.0322

530 411 302 454 288 263 235 173 307 847 493 801 490 922 689 536.74

20.1 20.3 21.4 21.1 20.1 20.3 20.0 20.1 21.0 20.9 21.0 19.8 19.8 21.3 21.0 21.354

10662 8362 6458 9570 5783 5343 4695 3481 6461 17672 10370 15873 9695 19632 14466 11575

Table 1 Collected standards using standard peaks from GY leaf 2M (G-L-2M-v986) Sample. To constitute a peak, ions are collected based on ion with biggest intensity and ions within ± 0.005 RT were collected. And then peak combining algorithm was applied to reduce duplicated peaks. Red boxed region in standard2 is from TYA-rhizome samples, where standard2 was not found or standard2 was found with delayed RT. Pearson correlation coefficient in red boxed region in standard3 search is very low due to deconvolution failure. Another peak near standard3 is too close within 0.05, the ion search range. Standard1: p-chloro toluene, Standard2: 1,2,4-trimethyl benzene, Standard3: 2,4,5,6-tetrachloro-mxylen. RT: retention time that each standard is detected; R: Pearson correlation coefficient with corresponding GY leaf 2M standard; ÄRT: Subtraction of each sample RT from corresponding GY leaf 2M standard; peak#: total peak number; ion#/p: average ion number per peak; ion#: total ion number

224 A. Ions collected together using ion with MW 69 (Group A) MW RT Area #2674 69 25.6169 309441 #122 52 25.6299 11741 #474 54 25.6213 14272 #2015 66 25.6299 19067 #2636 68 25.6255 61540 #5714 83 25.6299 54086 #8500 103 25.6125 5869 #10728 119 25.6299 58978 #10990 120 25.6255 48418 #11819 125 25.6299 6021 #12342 131 25.6299 32340 #12792 134 25.6255 20154 #13566 138 25.6299 7930 #14152 147 25.6255 27711 #14204 148 25.6299 13057 #14924 159 25.6299 5207 #15078 161 25.6299 9158 #15309 164 25.6213 9965 #16205 187 25.6299 10813 B. Ions collected together using ion with MW 79 (Group B) #4685 79 25.6341 275640 #85 51 25.6474 11114 #122 52 25.6299 11741 - already picked up #153 53 25.6385 48261 #474 54 25.6213 14272 - already picked up #541 55 25.6385 101867 #1046 56 25.6429 25699 #1268 57 25.6429 14971 #1907 65 25.6341 32347 #2015 66 25.6299 19067 - already picked up #2333 67 25.6341 119096 #2636 68 25.6255 61540 - already picked up #3048 70 25.6385 20402 #3358 71 25.6341 30127 #4156 77 25.6341 113688 #4372 78 25.6341 23070 #4938 80 25.6341 44436 #5240 81 25.6341 161716 #5507 82 25.6341 96739 #5714 83 25.6299 54086 - already picked up #5845 84 25.6385 11503 #6588 91 25.6341 172435 #6876 93 25.6385 200592 #7338 94 25.6429 74238 #7419 95 25.6385 189490 #7738 96 25.6385 100530 #8130 97 25.6341 22259 #8779 105 25.6341 110985 #8852 106 25.6385 110333 #8994 107 25.6385 120140 #9276 108 25.6385 50624 #9509 109 25.6385 151037 #9884 110 25.6341 68268 #10002 111 25.6429 20833 #10379 117 25.6385 14132 #10728 119 25.6299 58978 - already picked up #10990 120 25.6255 48418 - already picked up #11175 121 25.6341 71930 #11348 122 25.6341 31015

225 #11577 #11732 #11819 #12342 #12478 #12792 #12807 #13194 #13419 #13566 #13828 #14152 #14204 #14337 #14551 #14924 #15078 #15238 #15309 #15910 #16205

123 124 125 131 133 134 135 136 137 138 145 147 148 149 151 159 161 163 164 177 187

25.6341 25.6341 25.6299 25.6299 25.6385 25.6255 25.6385 25.6341 25.6429 25.6299 25.6385 25.6255 25.6299 25.6429 25.6474 25.6299 25.6299 25.6429 25.6213 25.6341 25.6299

68785 7023 6021 32340 22619 20154 47952 30360 16604 7930 14021 27711 13057 50256 9784 5207 9158 20250 9965 34091 10813

- already picked up - already picked up - already picked up

- already picked up - already picked up - already picked up - already picked up - already picked up - already picked up - already picked up

TEXT 1. Example: some ions belong to two peaks. Group A was selected first and Group B later. Common ions in both Group A and Group B are marked as yellow background. As shown in Fig. 7, caryophyllene oxide peak need both Group A and Group B ions and these two groups should be combined to one peak. Ions MW 69 and 103 in Group A have slightly lower RT compared to common ions (yellow block) and many ions in Group B (white block) have slightly higher RT compared to common ions.

226 Reading... T-Rh-5M-v498 T141Rh5M ... done, now data handling Standard1: 7.7406 T-Rh-5M-v498_T141Rh5M 7.2679 Pearson R value : 0.34665738410422 (matched MW:42) T-Rh-5M-v498_T141Rh5M 7.5530 Pearson R value : 0.090842593476779 (matched MW:18) T-Rh-5M-v498_T141Rh5M 7.6248 Pearson R value : 0.91345877303909 (matched MW:26) T-Rh-5M-v498_T141Rh5M 7.7316 Pearson R value : 0.94755633717576 (matched MW:51) T-Rh-5M-v498_T141Rh5M 8.1575 Pearson R value : 0.25943896483032 (matched MW:48) 7.7316 0.947556 (RT difference: 0.0145) Standard2: 8.4455 T-Rh-5M-v498_T141Rh5M 8.1575 Pearson R value : -0.003954576785697 (matched MW:53) T-Rh-5M-v498_T141Rh5M 8.5224 T-Rh-5M-v498_T141Rh5M 8.5359 Pearson R value : 0.74702288352677 (matched MW:8) T-Rh-5M-v498_T141Rh5M 8.5881 Pearson R value : 0.77685996429069 (matched MW:6) T-Rh-5M-v498_T141Rh5M 8.6037 Pearson R value : 0.79345683455474 (matched MW:7) T-Rh-5M-v498_T141Rh5M 8.6197 T-Rh-5M-v498_T141Rh5M 8.6358 T-Rh-5M-v498_T141Rh5M 8.6481 Pearson R value : 0.99345282537583 (matched MW:7) T-Rh-5M-v498_T141Rh5M 8.6680 Pearson R value : -0.032983239373261 (matched MW:50) T-Rh-5M-v498_T141Rh5M 8.7980 8.6481 0.993453 (RT difference: -0.2026) Standard3: 26.7453 T-Rh-5M-v498_T141Rh5M 26.5538 Pearson R value : 0.64545217687813 (matched MW:6) T-Rh-5M-v498_T141Rh5M 26.7081 Pearson R value : 0.98225615638177 (matched MW:97) T-Rh-5M-v498_T141Rh5M 27.1114 Pearson R value : -0.65831589338132 (matched MW:6) T-Rh-5M-v498_T141Rh5M 27.1306 Pearson R value : -0.39201080814249 (matched MW:13) T-Rh-5M-v498_T141Rh5M 27.1422 T-Rh-5M-v498_T141Rh5M 27.1938 T-Rh-5M-v498_T141Rh5M 27.2026 T-Rh-5M-v498_T141Rh5M 27.2134 Pearson R value : -0.21420663654604 (matched MW:8) 26.7081 0.982256 (RT difference: 0.0371999999999986) The average difference with standards (std - sample): -0.0503000000000006

TEXT 2 Search for GC/MS standards. Using ion peak area info from SIC for three standards (from G-L-2M-v986 sample), broad range of RT (±0.5) was checked to find standards in T-Rh-5M-v498 sample. All three standards are found (blue) with high Pearson correlation coefficient and we also found that standard1 is contaminated with o-chloro toluene. RT difference for each standard is calculated and average RT difference is also calculated and used for RT adjustment for all peaks in each sample. (matched MW: number), the number of common MW between standards from G-L-2M-v986 sample and the sample standards are sought; , matched MW is equal to or less than 5 and is not selected as standards. Pearson correlation coefficient is not calculated for these samples.

227

MW

RT

Area

50

5.069267

6842.087

50

7.023433

7238.846

50

7.2612

194578

50

7.576817

517142.7

50

7.728133

226576.1

50

7.888017

6444.956

50

8.030733

7295.214

50

8.0914

39685.52

50

8.367967

143097.5

… 434

65.40522

7471.153

475

73.62917

6073.172

503

22.87343

21854.12

503

39.07448

6258.12

503

71.77838

5219.388

503

72.37943

5450.3

504

22.87343

8103.913

504

71.55365

5384.8

505

22.86927

7939.277

Supplementary Table 1 Example of output file of custom program operated by Xcalibur executable files that extract peaks of SIC from MW 50 to MW 650. As shown in table, it starts with MW 50 and extract peaks according to RT and search upto peaks in SIC with MW650. In this example, after MW 505, there is no peak detected in their SICs.

228 mass1N2-loader_v3.pl This script uses *_report.txt files (e.g. F-L-7M-v435_F275-F272L7M_report.txt). The format of these files is shown in Supplementary Table 1.

#!/usr/local/bin/perl -w #This program is for collecting useful data from GC/MS data file. #MySQL table used: HJK_temp1 #Input file: *_report.txt #Reslt file: mass1: *_report_peak.txt - not any more in v1.1 # mass2: *_report_peak_ready.txt - not any more in v1.1 # *_report_peak_ready.csv => will be used in mass3 # $0_duplication.log - not any more in v1.1 #Author : Hyun Jo Koo #Date : 2009-06-06 #Version 1.0 #Version 2.0: combine divied peaks, combine duplicated ions #Version 3.0: when combining peaks, peaks that already have Reference number are not chaned. use DBI; use Getopt::Long; GetOptions ("path=s"=>\$path, "tiem=s"=>\$time); if (!defined $path || !defined $time){ print "Usage : $0 -path FOLDER_NAME -time SEARCH_TIME_FRAME For example: $0 -p project/ginger -t 0.005 For path option, if you have files in ~/project/files, just input as project/files All input files should be *_report.txt files.\n\n"; exit(1); }

#Remove previous result file if exist @file_names = glob ("~/$path/*.txt"); foreach $file_name (@file_names){ if ($file_name =~ /_report_peak\.txt/){ unlink ($file_name); } } #Ready to read the file contents. @file_names = glob ("~/$path/*_report.txt"); $num_of_files=@file_names; if ($num_of_files == 0){ print " No file to process in the folder ~/$path\n\n"; die; }

print "Total file number: $num_of_files file\n"; $dbh = DBI->connect("DBI:mysql:your_ID","your_ID","pasword",{RaiseError=>1}) or die("Connect error: $DBI::errstr"); #Make the data table $sql1 = "CREATE TABLE HJK_temp1 (Number INT NOT NULL auto_increment, MW INT NOT NULL, RT decimal (2,4) NOT NULL, Intensity INT NOT NULL, Reference INT NOT NULL , primary key(Number))"; $sth1 = $dbh ->prepare ($sql1); $sth1 ->execute;

229 $sth1 ->finish; foreach $file_name (@file_names){ $log_file = $file_name; $log_file =~ s/\.txt/\.log/; if (-e $log_file){ unlink ($log_file); } open (OUTFILE, ">$log_file") or die "Error-cannot open log file"; print OUTFILE "time frame: +/- $time\n\n"; # The file name is F-L-7M-v435_F275-F272L7M_report.txt or # F-L-7M-v740_F276L7M.RAW_report.txt # The stored file name have path also. if ($file_name =~ /[^\s]+$path\/([^\s]+)_([^\s]+)_.+\.txt/){ print "Reading... $1 $2"; print OUTFILE "\nReading... $1 $2"; $sample_id = $1."_".$2; }else{ $file_name =~ /[^\s]+$path\/(.+)\.txt/; print "Reading... $1"; print OUTFILE "\nReading... $1"; $sample_id = $1; } #Insert file contents to the table. $sql3 = "INSERT INTO HJK_temp1 (MW, RT, Intensity, Reference) VALUES (?,?,?,?)"; $sth3 = $dbh ->prepare ($sql3); open (INFILE, $file_name); #Get the information from text file. while ($line=){ chomp ($line); ($MW, $RT, $Intensity) = split (/,/, $line); $sth3 ->execute ($MW, $RT, $Intensity, 0); } close (INFILE); print " ... done, now data handling\n"; print OUTFILE " ... done, now data handling\n>: already included in other peak\n\n"; $picked = ""; # Data handling. $ref_num = 0; $prev_ref_num = 0; $sql4 $sth4 $sth4 while

= "SELECT MW, RT, Intensity, Number from HJK_temp1 ORDER BY Intensity DESC"; = $dbh ->prepare ($sql4); ->execute; (@row4 = $sth4->fetchrow_array) { print OUTFILE "#$row4[3]\t$row4[0]\t$row4[1]\t$row4[2]"; if ($picked =~ / $row4[3] /){ print OUTFILE " >\n"; next; } print OUTFILE "\n";

$time1 = $row4[1]-$time; $time2 = $row4[1]+$time; $ref_count_total = 0; # execute; while (@row5 = $sth5 ->fetchrow_array){ print OUTFILE "\t#$row5[3]\t$row5[0]\t$row5[1]\t$row5[2]"; push (@temp_n, $row5[3]); push (@temp_n2, $row5[3]); push (@temp_rt, $row5[1]); $temp_ref = $row5[4]; if ($temp_ref != 0){ pop (@temp_n2); if ($ref_count_total > 0){ if ($temp_ref != $prev_temp_ref){ $third_ref = $temp_ref; $second_ref = $prev_temp_ref; $ref_count_second = $ref_count_others; } } $ref_count_others++; $prev_temp_ref = $temp_ref; }

if ($picked =~ / $row5[3] /){ print OUTFILE "\t(Ref# $row5[4])"; } $picked = "$picked $row5[3] "; $ref_count_total++; print OUTFILE "\n"; } $sth5 ->finish; if ($temp_ref == 0){ # execute; $sth6 ->finish; #

print OUTFILE "\t\t\tnumber: $curr_num\tnew ref: $prev_ref_num\n"; } $print = "\tRef: $prev_ref_num\n";

}elsif ($ref_count_second > 0){ # 0.3 or $ref_count_total < 10){ $sql14 = "SELECT RT from HJK_temp1 WHERE Reference = $second_ref ORDER BY Intensity"; $sth14 = $dbh ->prepare($sql14);

231 $sth14 ->execute; @row14 = $sth14 ->fetchrow_array(); $second_rt = $row14[0]; $sth14 ->finish; $sql13 = "SELECT RT from HJK_temp1 WHERE Reference = $third_ref ORDER BY Intensity"; $sth13 = $dbh ->prepare($sql13); $sth13 ->execute; @row13 = $sth13 ->fetchrow_array(); $third_rt = $row13[0]; $sth13 ->finish;

$num_temp = @temp_n2; $how_many_second = 0; $how_many_third = 0; for ($i = 1; $i abs ($curr_rt $third_rt)){ $ref_num = $third_ref; $how_many_third++; }else{ $ref_num = $second_ref; $how_many_second++ } $sql6 = "UPDATE HJK_temp1 SET Reference = $ref_num WHERE Number = $curr_num"; $sth6 = $dbh ->prepare($sql6); $sth6 ->execute; $sth6 ->finish; print OUTFILE "\t\t\t#$curr_num\tnow ref# $ref_num\n";

} $print = "\tRef: $second_ref ($how_many_second) or $third_ref ($how_many_third)\n"; }else{ $prev_ref_num++; $num_temp = @temp_n2; for ($i = 1; $i prepare($sql6); $sth6 ->execute; $sth6 ->finish; print OUTFILE "\t\t\t#$curr_num\tnow ref# $prev_ref_num\n";

#

} $print = "\tRef: $prev_ref_num\n"; }

}else{ # 0.3 or $ref_count_total < 10){ $ref_num = $temp_ref; $print = "\tRef: $ref_num\n"; }else{ $prev_ref_num++; $ref_num = $prev_ref_num; $print = "\tRef: $ref_num\n"; } $num_temp = @temp_n2;

232 for ($i = 1; $i prepare($sql6); $sth6 ->execute; $sth6 ->finish; print OUTFILE "\t\t\tnumber: $curr_num\tnew ref: $ref_num\n";

#

} } print OUTFILE "$print\n"; } $save_file = $file_name; $save_file =~ s/\.txt/_peak_ready\.csv/; if (-e $save_file){ unlink ($save_file); } open (OUTFILE_SAVE, ">$save_file") or die "Error-cannot open save file"; print OUTFILE_SAVE "sample,avgRT,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78 ,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108, 109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133, 134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158, 159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183, 184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208, 209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233, 234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258, 259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283, 284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308, 309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333, 334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358, 359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383, 384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408, 409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433, 434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458, 459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483, 484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508, 509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533, 534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558, 559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583, 584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608, 609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633, 634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650\n";

$sql8 = "SELECT COUNT(DISTINCT(MW)), SUM(Intensity), AVG(RT), COUNT(MW), Reference FROM HJK_temp1 GROUP BY Reference"; $sth8 = $dbh ->prepare ($sql8); $sth8 ->execute; while (@row8 = $sth8 ->fetchrow_array){ $mw_num8 = $row8[0]; $Intensity8 = $row8[1]; $rt_avg8 = $row8[2]; $mw_count8 = $row8[3]; $ref8 = $row8[4]; if ($mw_num8 < 6 or $Intensity8 prepare ($sql12); $sth12 ->execute; print OUTFILE "unique mw#(total mw#): $mw_num8 ($mw_count8)\tArea: $Intensity8\tAvgRT: $rt_avg8\t\n"; while (@row12 = $sth12 ->fetchrow_array){ print OUTFILE "\t$row12[0]\t$row12[1]\n"; push (@row12_mw, $row12[0]); push (@row12_intensity, $row12[1]); } push (@row12_mw, 651); push (@row12_intensity, 651); $mw_check = shift (@row12_mw); $intensity_save = shift (@row12_intensity); $mw_check_num = @row12_mw; print OUTFILE "@row12_mw\n@row12_intensity\n\n"; for ($i = 50; $i prepare($sql8); ->execute; ->finish;

} $sql9 $sth9 $sth9 $sth9

= "DROP table HJK_temp1"; = $dbh ->prepare($sql9); ->execute; ->finish;

$sth3 ->finish; $sth4 ->finish; $sth8 ->finish; $sth12 ->finish; print

"\nAll processes have ended.\n\n";

$dbh -> disconnect;

234 sample

avgRT

50

51

52

684390

8370264

4763806

F-L-7M-v435_F275-F272L7M

8.16213457

F-L-7M-v435_F275-F272L7M

50.37659765

F-L-7M-v435_F275-F272L7M

21.25164299

34252

1334356

955566

F-L-7M-v435_F275-F272L7M

50.21014107

5715

274870

63851

F-L-7M-v435_F275-F272L7M

7.26568919

119708

1564040

733788

F-L-7M-v435_F275-F272L7M

50.12586818

F-L-7M-v435_F275-F272L7M

50.41525846

F-L-7M-v435_F275-F272L7M

50.49499882

F-L-7M-v435_F275-F272L7M

50.29177742

F-L-7M-v435_F275-F272L7M

50.30643889

F-L-7M-v435_F275-F272L7M

50.15315833

F-L-7M-v435_F275-F272L7M

50.44033333

F-L-7M-v435_F275-F272L7M

8.38023333

106515

F-L-7M-v435_F275-F272L7M

50.3185871

5631

F-L-7M-v435_F275-F272L7M

50.427

F-L-7M-v435_F275-F272L7M

7.74061333

144358

424256

69430

F-L-7M-v435_F275-F272L7M

8.44975

62358

567606

181597

F-L-7M-v435_F275-F272L7M

23.01714598

349538

224508

F-L-7M-v435_F275-F272L7M

50.47440625

F-L-7M-v435_F275-F272L7M

50.40039444

F-L-7M-v435_F275-F272L7M

50.34549565

F-L-7M-v435_F275-F272L7M

50.04528936

F-L-7M-v435_F275-F272L7M

50.136

F-L-7M-v435_F275-F272L7M

50.45322857

F-L-7M-v435_F275-F272L7M

68.05275167

F-L-7M-v435_F275-F272L7M

8.644115

F-L-7M-v435_F275-F272L7M

50.18794737

F-L-7M-v435_F275-F272L7M

50.11406818



648

649

650

121033

498276

505019

93716

100377 856310

521289

979429

124235

466217

24335

244559

89701

6418

15113

67082

Supplementary Table 2 Output format of mass1N2-loader_v3.pl. It is also input format of mass3_RT_adj_v2.pl. For example, this table has 516 rows that represent each peak in F-L-7Mv435_F275-F272L7M sample. avgRT column is average RT of combined peaks in the process. After that there are 601 columns representing each MW from 50 to 650 and the number is area of each ion from SIC.

235 mass3_RT_adj_v2.pl This script uses *_report_peak_ready.csv files generated by mass1N2-loader_v3.pl (e.g. F-L-7Mv435_F275-F272L7M_report_peak_ready.csv). The format of these files is shown in Supplementary Table 2.

#!/usr/local/bin/perl -w #This program is for collecting ions from GC/MS data SIC file. #MySQL table used: HJK_temp2, pearson, max #Input file: *_report_peak_ready.csv will be used by mass4 #Author : Hyun Jo Koo #Date : 2009-06-06 #Version :1.0

use DBI; use Getopt::Long; GetOptions ("path=s"=>\$path, "time=s"=>\$time, "1st=s"=>\$st1, "2nd=s"=>\$st2, "3rd=s" =>\$st3); if (!defined $path){ print "Usage : $0 -path FOLDER_NAME -t Time_frame_for_standard_search -1st yes -2nd no -3rd yes\n For example: $0 -p project/ginger -t time_frame\n For path option, if you have files in ~/project/files, just input as project/files\n All input files should be *peak_ready.csv files.\n Time_frame is the range you want to find standards +/- time range. Default value is 0.5.\n 1st, 2nd, 3rd are the standard for use or not for RT shift. Default use all three.\n"; exit(1); } if (!defined $time){ $time = 0.5; } if (!defined $st1){ $st1 = "yes"; } if (!defined $st2){ $st2 = "yes"; } if (!defined $st3){ $st3 = "yes"; } #Remove previous result file if exist @file_names = glob ("~/$path/*.csv"); foreach $file_name (@file_names){ if($file_name =~ /_peak_ready_RT_adj\.csv/){ unlink ($file_name); } } #Ready to read the file contents. @file_names = glob ("~/$path/*_peak_ready.csv"); $num_of_files=@file_names; if ($num_of_files == 0){ print " No file to process in the folder ~/$path\n\n"; die; }

236 #GC_standards from G-L-2M-v986_G20-G15L2M_report_peak.csv file #GC_std1 = (GC_standards,7.7406,155210,463481,72186,28833,,19578,,,6707,5270,69372,274953,536062,1407500,320510 ,1181445,74304,8462,,,6791,7391,65486,329482,226771,327857,54718,33188,,,,,,11401,55254,129991,16712 7,157881,123257,1539865,431483,12886255,990330,60743,,,16406,57520,26351,352287,27246,105472,6529,,, ,,,,,8687,12848,,,,,,,,,,10812,10222,64970,144171,1879070,5187913,920599,1566847,106520,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,); #GC_std2 = (GC_standards,8.4455,77146,579809,195119,290623,16091,29662,,113176,263339,204918,15841,27011,109766 ,289628,76741,395307,48472,38548,,,,,,22410,83154,83985,134702,1270458,482287,959513,75005,,,,,8289, 23636,28461,24907,144435,87320,1220109,171795,101876,,,,,10071,8545,9216,31084,203108,899833,649265, 13612812,1048576,37933,,,,,,14251,32161,442101,107121,393845,211090,1552486,7074306,594748,31325,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,); #GC_std3 = (GC_standards,26.7453,,238191,,,,,,,,,65330,187758,165083,160828,,,75377,157803,232594,92881,,13267, ,230625,355097,359347,78267,,,,,,,64217,226206,664657,487492,270517,,7117,15420,,,,,,,244735,259762, 494690,161166,376756,415084,190141,91435,22601,27731,42090,136156,234165,84384,126331,22278,19779,,, ,,14141,41573,198632,288499,,96748,11103,,,,,,,45951,,140011,112452,456121,1462349,598941,448974,128 345,12440,,8451,,28269,80721,53503,,,,,,,,,,13073,,,38332,,,,,,,,27717,8040,89843,335230,724698,5531 73,455361,258357,85919,,6172,,14285,37752,15154,33175,9403,10440,,,,,,,,16479,27872,,,,,,,,11846,,21 779,44603,113673,374384,3410285,628318,3190472,374963,1024231,103152,97726,,,,,,,,,,,,,,68592,10958, 86473,,,,,,,,,,14517,,204902,1970665,330364,2284560,296801,1141506,123158,248800,9005,17841,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,); #GC standard with 0 filled # sample,RT,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79, 80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109, 110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134, 135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159, 160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184, 185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209, 210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234, 235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259, 260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284, 285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309, 310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334, 335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359, 360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384, 385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409, 410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434, 435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459, 460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484, 485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509, 510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534, 535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559, 560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584, 585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609, 610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634, 635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650 # GC_standards,7.7406,155210,463481,72186,28833,0,19578,0,0,6707,5270,69372,274953,536062,1407500,3205 10,1181445,74304,8462,0,0,6791,7391,65486,329482,226771,327857,54718,33188,0,0,0,0,0,11401,55254,129 991,167127,157881,123257,1539865,431483,12886255,990330,60743,0,0,16406,57520,26351,352287,27246,105 472,6529,0,0,0,0,0,0,0,8687,12848,0,0,0,0,0,0,0,0,0,10812,10222,64970,144171,1879070,5187913,920599, 1566847,106520,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

237 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # GC_standards,8.4455,77146,579809,195119,290623,16091,29662,0,113176,263339,204918,15841,27011,109766 ,289628,76741,395307,48472,38548,0,0,0,0,0,22410,83154,83985,134702,1270458,482287,959513,75005,0,0, 0,0,8289,23636,28461,24907,144435,87320,1220109,171795,101876,0,0,0,0,10071,8545,9216,31084,203108,8 99833,649265,13612812,1048576,37933,0,0,0,0,0,14251,32161,442101,107121,393845,211090,1552486,707430 6,594748,31325,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # GC_standards,26.7453,0,238191,0,0,0,0,0,0,0,0,65330,187758,165083,160828,0,0,75377,157803,232594,928 81,0,13267,0,230625,355097,359347,78267,0,0,0,0,0,0,64217,226206,664657,487492,270517,0,7117,15420,0 ,0,0,0,0,0,244735,259762,494690,161166,376756,415084,190141,91435,22601,27731,42090,136156,234165,84 384,126331,22278,19779,0,0,0,0,14141,41573,198632,288499,0,96748,11103,0,0,0,0,0,0,45951,0,140011,11 2452,456121,1462349,598941,448974,128345,12440,0,8451,0,28269,80721,53503,0,0,0,0,0,0,0,0,0,13073,0, 0,38332,0,0,0,0,0,0,0,27717,8040,89843,335230,724698,553173,455361,258357,85919,0,6172,0,14285,37752 ,15154,33175,9403,10440,0,0,0,0,0,0,0,16479,27872,0,0,0,0,0,0,0,11846,0,21779,44603,113673,374384,34 10285,628318,3190472,374963,1024231,103152,97726,0,0,0,0,0,0,0,0,0,0,0,0,0,68592,10958,86473,0,0,0,0 ,0,0,0,0,0,14517,0,204902,1970665,330364,2284560,296801,1141506,123158,248800,9005,17841,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

print "Total file number: $num_of_files\n"; $dbh = DBI->connect("DBI:mysql:your_ID","your_ID","pasword",{RaiseError=>1}) or die("Connect error: $DBI::errstr");

$file_names[0] =~ /(.+)\/.+/; $table_file = $1; $0 =~ /(.+)\.pl/; $table_file = $table_file."/".$1."_result_table.txt"; #$table_file = "mass3_RT_adj-Pearson_R_and_standards_info.txt"; if (-e $table_file){ unlink($table_file); } open (OUTFILE_TABLE, ">$table_file") or die "Error-cannot open table file"; print OUTFILE_TABLE "sample\t\tstd1_RT(7.7406)\tstd1_R\tstd1_RT_diff\tstd2_RT(8.4455)\tstd2_R\tstd2_RT_diff\tstd3_RT(26. 7453)\tstd3_R\tstd3_RT_diff\t\tRT_diff(std - smpl)\tpeak#\ttotal ion_num\taverage of ion_num\n";

#Make the data table $sql1 = "CREATE TABLE HJK_temp2 (Number INT NOT NULL auto_increment, Sample char (30) NOT NULL, RT decimal (2,4) NOT NULL,

238 RT_adj decimal (2,4) NOT NULL, Area decimal(10,0), ion_num INT, area50 INT,area51 INT,area52 INT,area53 INT,area54 INT,area55 INT,area56 INT,area57 INT,area58 INT,area59 INT,area60 INT,area61 INT,area62 INT,area63 INT,area64 INT,area65 INT,area66 INT,area67 INT,area68 INT,area69 INT,area70 INT,area71 INT,area72 INT,area73 INT,area74 INT,area75 INT,area76 INT,area77 INT,area78 INT,area79 INT,area80 INT,area81 INT,area82 INT,area83 INT,area84 INT,area85 INT,area86 INT,area87 INT,area88 INT,area89 INT,area90 INT,area91 INT,area92 INT,area93 INT,area94 INT,area95 INT,area96 INT,area97 INT,area98 INT,area99 INT,area100 INT,area101 INT,area102 INT,area103 INT,area104 INT,area105 INT,area106 INT,area107 INT,area108 INT,area109 INT,area110 INT,area111 INT,area112 INT,area113 INT,area114 INT,area115 INT,area116 INT,area117 INT,area118 INT,area119 INT,area120 INT,area121 INT,area122 INT,area123 INT,area124 INT,area125 INT,area126 INT,area127 INT,area128 INT,area129 INT,area130 INT,area131 INT,area132 INT,area133 INT,area134 INT,area135 INT,area136 INT,area137 INT,area138 INT,area139 INT,area140 INT,area141 INT,area142 INT,area143 INT,area144 INT,area145 INT,area146 INT,area147 INT,area148 INT,area149 INT,area150 INT,area151 INT,area152 INT,area153 INT,area154 INT,area155 INT,area156 INT,area157 INT,area158 INT,area159 INT,area160 INT,area161 INT,area162 INT,area163 INT,area164 INT,area165 INT,area166 INT,area167 INT,area168 INT,area169 INT,area170 INT,area171 INT,area172 INT,area173 INT,area174 INT,area175 INT,area176 INT,area177 INT,area178 INT,area179 INT,area180 INT,area181 INT,area182 INT,area183 INT,area184 INT,area185 INT,area186 INT,area187 INT,area188 INT,area189 INT,area190 INT,area191 INT,area192 INT,area193 INT,area194 INT,area195 INT,area196 INT,area197 INT,area198 INT,area199 INT,area200 INT,area201 INT,area202 INT,area203 INT,area204 INT,area205 INT,area206 INT,area207 INT,area208 INT,area209 INT,area210 INT,area211 INT,area212 INT,area213 INT,area214 INT,area215 INT,area216 INT,area217 INT,area218 INT,area219 INT,area220 INT,area221 INT,area222 INT,area223 INT,area224 INT,area225 INT,area226 INT,area227 INT,area228 INT,area229 INT,area230 INT,area231 INT,area232 INT,area233 INT,area234 INT,area235 INT,area236 INT,area237 INT,area238 INT,area239 INT,area240 INT,area241 INT,area242 INT,area243 INT,area244 INT,area245 INT,area246 INT,area247 INT,area248 INT,area249 INT,area250 INT,area251 INT,area252 INT,area253 INT,area254 INT,area255 INT,area256 INT,area257 INT,area258 INT,area259 INT,area260 INT,area261 INT,area262 INT,area263 INT,area264 INT,area265 INT,area266 INT,area267 INT,area268 INT,area269 INT,area270 INT,area271 INT,area272 INT,area273 INT,area274 INT,area275 INT,area276 INT,area277 INT,area278 INT,area279 INT,area280 INT,area281 INT,area282 INT,area283 INT,area284 INT,area285 INT,area286 INT,area287 INT,area288 INT,area289 INT,area290 INT,area291 INT,area292 INT,area293 INT,area294 INT,area295 INT,area296 INT,area297 INT,area298 INT,area299 INT,area300 INT,area301 INT,area302 INT,area303 INT,area304 INT,area305 INT,area306 INT,area307 INT,area308 INT,area309 INT,area310 INT,area311 INT,area312 INT,area313 INT,area314 INT,area315 INT,area316 INT,area317 INT,area318 INT,area319 INT,area320 INT,area321 INT,area322 INT,area323 INT,area324 INT,area325 INT,area326 INT,area327 INT,area328 INT,area329 INT,area330 INT,area331 INT,area332 INT,area333 INT,area334 INT,area335 INT,area336 INT,area337 INT,area338 INT,area339 INT,area340 INT,area341 INT,area342 INT,area343 INT,area344 INT,area345 INT,area346 INT,area347 INT,area348 INT,area349 INT,area350 INT,area351 INT,area352 INT,area353 INT,area354 INT,area355 INT,area356 INT,area357 INT,area358 INT,area359 INT,area360 INT,area361 INT,area362 INT,area363 INT,area364 INT,area365 INT,area366 INT,area367 INT,area368 INT,area369 INT,area370 INT,area371 INT,area372 INT,area373 INT,area374 INT,area375 INT,area376 INT,area377 INT,area378 INT,area379 INT,area380 INT,area381 INT,area382 INT,area383 INT,area384 INT,area385 INT,area386 INT,area387 INT,area388 INT,area389 INT,area390 INT,area391 INT,area392 INT,area393 INT,area394 INT,area395 INT,area396 INT,area397 INT,area398 INT,area399 INT,area400 INT,area401 INT,area402 INT,area403 INT,area404 INT,area405 INT,area406 INT,area407 INT,area408 INT,area409 INT,area410 INT,area411 INT,area412 INT,area413 INT,area414 INT,area415 INT,area416 INT,area417 INT,area418 INT,area419 INT,area420 INT,area421 INT,area422 INT,area423 INT,area424 INT,area425 INT,area426 INT,area427 INT,area428 INT,area429 INT,area430 INT,area431 INT,area432 INT,area433 INT,area434 INT,area435 INT,area436 INT,area437 INT,area438 INT,area439 INT,area440 INT,area441 INT,area442 INT,area443 INT,area444 INT,area445 INT,area446 INT,area447 INT,area448 INT,area449 INT,area450 INT,area451 INT,area452 INT,area453 INT,area454 INT,area455 INT,area456 INT,area457 INT,area458 INT,area459 INT,area460 INT,area461 INT,area462 INT,area463 INT,area464 INT,area465 INT,area466 INT,area467 INT,area468 INT,area469 INT,area470 INT,area471 INT,area472 INT,area473 INT,area474 INT,area475 INT,area476 INT,area477 INT,area478 INT,area479 INT,area480 INT,area481 INT,area482 INT,area483 INT,area484 INT,area485 INT,area486 INT,area487 INT,area488 INT,area489 INT,area490 INT,area491 INT,area492 INT,area493 INT,area494 INT,area495 INT,area496 INT,area497 INT,area498 INT,area499 INT,area500 INT,area501 INT,area502 INT,area503 INT,area504 INT,area505 INT,area506 INT,area507 INT,area508 INT,area509 INT,area510 INT,area511 INT,area512 INT,area513 INT,area514 INT,area515 INT,area516 INT,area517 INT,area518 INT,area519 INT,area520 INT,area521 INT,area522 INT,area523 INT,area524 INT,area525 INT,area526 INT,area527 INT,area528 INT,area529 INT,area530 INT,area531 INT,area532 INT,area533 INT,area534 INT,area535 INT,area536 INT,area537 INT,area538 INT,area539 INT,area540 INT,area541 INT,area542 INT,area543 INT,area544 INT,area545 INT,area546 INT,area547 INT,area548 INT,area549 INT,area550 INT,area551 INT,area552 INT,area553 INT,area554 INT,area555 INT,area556 INT,area557 INT,area558 INT,area559 INT,area560 INT,area561 INT,area562 INT,area563 INT,area564 INT,area565 INT,area566 INT,area567 INT,area568 INT,area569 INT,area570 INT,area571 INT,area572 INT,area573 INT,area574 INT,area575 INT,area576 INT,area577 INT,area578 INT,area579 INT,area580 INT,area581 INT,area582 INT,area583 INT,area584 INT,area585 INT,area586 INT,area587 INT,area588 INT,area589 INT,area590 INT,area591 INT,area592 INT,area593 INT,area594 INT,area595 INT,area596 INT,area597 INT,area598 INT,area599 INT,area600 INT,area601 INT,area602 INT,area603 INT,area604 INT,area605 INT,area606 INT,area607 INT,area608 INT,area609 INT,area610 INT,area611 INT,area612 INT,area613

239 INT,area614 INT,area615 INT,area616 INT,area622 INT,area623 INT,area624 INT,area630 INT,area631 INT,area632 INT,area638 INT,area639 INT,area640 INT,area646 INT,area647 INT,area648 primary key(Number))"; $sth1 = $dbh ->prepare ($sql1); $sth1 ->execute; $sth1 ->finish;

INT,area617 INT,area625 INT,area633 INT,area641 INT,area649

INT,area618 INT,area626 INT,area634 INT,area642 INT,area650

INT,area619 INT,area627 INT,area635 INT,area643 INT,

$sql17 $sth17 $sth17 $sth17

="CREATE TABLE pearson (x float NOT NULL, y float NOT NULL)"; = $dbh ->prepare ($sql17); ->execute; ->finish;

$sql18 $sth18 $sth18 $sth18

="CREATE TABLE max (RT float NOT NULL, R float NOT NULL)"; = $dbh ->prepare ($sql18); ->execute; ->finish;

INT,area620 INT,area628 INT,area636 INT,area644

INT,area621 INT,area629 INT,area637 INT,area645

$sql21 ="INSERT INTO max (RT, R) VALUES (?,?)"; $sth21 = $dbh ->prepare ($sql21);

foreach $file_name (@file_names){ $log_file = $file_name; $log_file =~ s/\.csv/_std\.log/; if (-e $log_file){ unlink ($log_file); } open (OUTFILE, ">$log_file") or die "Error-cannot open log file"; # The file name is F-L-7M-v435_F275-F272L7M_report.txt or # F-L-7M-v740_F276L7M.RAW_report.txt # The stored file name have path also. if ($file_name =~ /[^\s]+$path\/([^\_]+)_([^\_]+)_.+/){ print "Reading... $1 $2"; print OUTFILE "\nReading... $1 $2"; print OUTFILE_TABLE "$1\t$2\t"; }else{ $file_name =~ /[^\s]+$path\/(.+)/; print "Reading... $1"; print OUTFILE "\nReading... $1"; print OUTFILE_TABLE "$1\t\t"; }

#

open (OUTFILE_TWO, ">$save_file") or die "Error-cannot open RT_adj file";

#Insert file contents to the table. $sql3 = "INSERT INTO HJK_temp2 (Sample, RT, Area, ion_num,area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,a rea63,area64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,are a77,area78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area9 1,area92,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,ar ea105,area106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area11 7,area118,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,ar ea130,area131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area14 2,area143,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,ar ea155,area156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area16 7,area168,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,ar ea180,area181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area19 2,area193,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,ar ea205,area206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area21 7,area218,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,ar ea230,area231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area24

240 2,area243,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,ar ea255,area256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area26 7,area268,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,ar ea280,area281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area29 2,area293,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,ar ea305,area306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area31 7,area318,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,ar ea330,area331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area34 2,area343,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,ar ea355,area356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area36 7,area368,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,ar ea380,area381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area39 2,area393,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,ar ea405,area406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area41 7,area418,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,ar ea430,area431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area44 2,area443,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,ar ea455,area456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area46 7,area468,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,ar ea480,area481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area49 2,area493,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,ar ea505,area506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area51 7,area518,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,ar ea530,area531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area54 2,area543,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,ar ea555,area556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area56 7,area568,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,ar ea580,area581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area59 2,area593,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,ar ea605,area606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area61 7,area618,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,ar ea630,area631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area64 2,area643,area644,area645,area646,area647,area648,area649,area650) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?)"; $sth3 = $dbh ->prepare ($sql3); open (INFILE, $file_name); #Get the information from text file. Also calculate area based on SIC. $total_ion_num = 0; $m = 0; while ($line=){ $m++; if ($m == 1){ #The first lane has column info, so remove it when import to MySQL next; }else{ chomp ($line); ($Sample, $RT, $area50,$area51,$area52,$area53,$area54,$area55,$area56,$area57,$area58,$area59,$area60,$area61,$are a62,$area63,$area64,$area65,$area66,$area67,$area68,$area69,$area70,$area71,$area72,$area73,$area74, $area75,$area76,$area77,$area78,$area79,$area80,$area81,$area82,$area83,$area84,$area85,$area86,$are a87,$area88,$area89,$area90,$area91,$area92,$area93,$area94,$area95,$area96,$area97,$area98,$area99, $area100,$area101,$area102,$area103,$area104,$area105,$area106,$area107,$area108,$area109,$area110,$ area111,$area112,$area113,$area114,$area115,$area116,$area117,$area118,$area119,$area120,$area121,$a rea122,$area123,$area124,$area125,$area126,$area127,$area128,$area129,$area130,$area131,$area132,$ar ea133,$area134,$area135,$area136,$area137,$area138,$area139,$area140,$area141,$area142,$area143,$are a144,$area145,$area146,$area147,$area148,$area149,$area150,$area151,$area152,$area153,$area154,$area

241 155,$area156,$area157,$area158,$area159,$area160,$area161,$area162,$area163,$area164,$area165,$area1 66,$area167,$area168,$area169,$area170,$area171,$area172,$area173,$area174,$area175,$area176,$area17 7,$area178,$area179,$area180,$area181,$area182,$area183,$area184,$area185,$area186,$area187,$area188 ,$area189,$area190,$area191,$area192,$area193,$area194,$area195,$area196,$area197,$area198,$area199, $area200,$area201,$area202,$area203,$area204,$area205,$area206,$area207,$area208,$area209,$area210,$ area211,$area212,$area213,$area214,$area215,$area216,$area217,$area218,$area219,$area220,$area221,$a rea222,$area223,$area224,$area225,$area226,$area227,$area228,$area229,$area230,$area231,$area232,$ar ea233,$area234,$area235,$area236,$area237,$area238,$area239,$area240,$area241,$area242,$area243,$are a244,$area245,$area246,$area247,$area248,$area249,$area250,$area251,$area252,$area253,$area254,$area 255,$area256,$area257,$area258,$area259,$area260,$area261,$area262,$area263,$area264,$area265,$area2 66,$area267,$area268,$area269,$area270,$area271,$area272,$area273,$area274,$area275,$area276,$area27 7,$area278,$area279,$area280,$area281,$area282,$area283,$area284,$area285,$area286,$area287,$area288 ,$area289,$area290,$area291,$area292,$area293,$area294,$area295,$area296,$area297,$area298,$area299, $area300,$area301,$area302,$area303,$area304,$area305,$area306,$area307,$area308,$area309,$area310,$ area311,$area312,$area313,$area314,$area315,$area316,$area317,$area318,$area319,$area320,$area321,$a rea322,$area323,$area324,$area325,$area326,$area327,$area328,$area329,$area330,$area331,$area332,$ar ea333,$area334,$area335,$area336,$area337,$area338,$area339,$area340,$area341,$area342,$area343,$are a344,$area345,$area346,$area347,$area348,$area349,$area350,$area351,$area352,$area353,$area354,$area 355,$area356,$area357,$area358,$area359,$area360,$area361,$area362,$area363,$area364,$area365,$area3 66,$area367,$area368,$area369,$area370,$area371,$area372,$area373,$area374,$area375,$area376,$area37 7,$area378,$area379,$area380,$area381,$area382,$area383,$area384,$area385,$area386,$area387,$area388 ,$area389,$area390,$area391,$area392,$area393,$area394,$area395,$area396,$area397,$area398,$area399, $area400,$area401,$area402,$area403,$area404,$area405,$area406,$area407,$area408,$area409,$area410,$ area411,$area412,$area413,$area414,$area415,$area416,$area417,$area418,$area419,$area420,$area421,$a rea422,$area423,$area424,$area425,$area426,$area427,$area428,$area429,$area430,$area431,$area432,$ar ea433,$area434,$area435,$area436,$area437,$area438,$area439,$area440,$area441,$area442,$area443,$are a444,$area445,$area446,$area447,$area448,$area449,$area450,$area451,$area452,$area453,$area454,$area 455,$area456,$area457,$area458,$area459,$area460,$area461,$area462,$area463,$area464,$area465,$area4 66,$area467,$area468,$area469,$area470,$area471,$area472,$area473,$area474,$area475,$area476,$area47 7,$area478,$area479,$area480,$area481,$area482,$area483,$area484,$area485,$area486,$area487,$area488 ,$area489,$area490,$area491,$area492,$area493,$area494,$area495,$area496,$area497,$area498,$area499, $area500,$area501,$area502,$area503,$area504,$area505,$area506,$area507,$area508,$area509,$area510,$ area511,$area512,$area513,$area514,$area515,$area516,$area517,$area518,$area519,$area520,$area521,$a rea522,$area523,$area524,$area525,$area526,$area527,$area528,$area529,$area530,$area531,$area532,$ar ea533,$area534,$area535,$area536,$area537,$area538,$area539,$area540,$area541,$area542,$area543,$are a544,$area545,$area546,$area547,$area548,$area549,$area550,$area551,$area552,$area553,$area554,$area 555,$area556,$area557,$area558,$area559,$area560,$area561,$area562,$area563,$area564,$area565,$area5 66,$area567,$area568,$area569,$area570,$area571,$area572,$area573,$area574,$area575,$area576,$area57 7,$area578,$area579,$area580,$area581,$area582,$area583,$area584,$area585,$area586,$area587,$area588 ,$area589,$area590,$area591,$area592,$area593,$area594,$area595,$area596,$area597,$area598,$area599, $area600,$area601,$area602,$area603,$area604,$area605,$area606,$area607,$area608,$area609,$area610,$ area611,$area612,$area613,$area614,$area615,$area616,$area617,$area618,$area619,$area620,$area621,$a rea622,$area623,$area624,$area625,$area626,$area627,$area628,$area629,$area630,$area631,$area632,$ar ea633,$area634,$area635,$area636,$area637,$area638,$area639,$area640,$area641,$area642,$area643,$are a644,$area645,$area646,$area647,$area648,$area649,$area650) = split (/,/, $line); @area = split (/,/, $line); shift (@area); shift (@area); $area = 0; } $ion_num = 0; foreach $each_area (@area){ if ($each_area =~ /\d+/){ $ion_num++; $total_ion_num++; $area += $each_area; } } splice (@area); # print OUTFILE "\t$RT\t$area\n"; $sth3 ->execute ($Sample, $RT, $area, $ion_num, $area50,$area51,$area52,$area53,$area54,$area55,$area56,$area57,$area58,$area59,$area60,$area61,$are a62,$area63,$area64,$area65,$area66,$area67,$area68,$area69,$area70,$area71,$area72,$area73,$area74, $area75,$area76,$area77,$area78,$area79,$area80,$area81,$area82,$area83,$area84,$area85,$area86,$are a87,$area88,$area89,$area90,$area91,$area92,$area93,$area94,$area95,$area96,$area97,$area98,$area99, $area100,$area101,$area102,$area103,$area104,$area105,$area106,$area107,$area108,$area109,$area110,$ area111,$area112,$area113,$area114,$area115,$area116,$area117,$area118,$area119,$area120,$area121,$a rea122,$area123,$area124,$area125,$area126,$area127,$area128,$area129,$area130,$area131,$area132,$ar ea133,$area134,$area135,$area136,$area137,$area138,$area139,$area140,$area141,$area142,$area143,$are a144,$area145,$area146,$area147,$area148,$area149,$area150,$area151,$area152,$area153,$area154,$area 155,$area156,$area157,$area158,$area159,$area160,$area161,$area162,$area163,$area164,$area165,$area1

242 66,$area167,$area168,$area169,$area170,$area171,$area172,$area173,$area174,$area175,$area176,$area17 7,$area178,$area179,$area180,$area181,$area182,$area183,$area184,$area185,$area186,$area187,$area188 ,$area189,$area190,$area191,$area192,$area193,$area194,$area195,$area196,$area197,$area198,$area199, $area200,$area201,$area202,$area203,$area204,$area205,$area206,$area207,$area208,$area209,$area210,$ area211,$area212,$area213,$area214,$area215,$area216,$area217,$area218,$area219,$area220,$area221,$a rea222,$area223,$area224,$area225,$area226,$area227,$area228,$area229,$area230,$area231,$area232,$ar ea233,$area234,$area235,$area236,$area237,$area238,$area239,$area240,$area241,$area242,$area243,$are a244,$area245,$area246,$area247,$area248,$area249,$area250,$area251,$area252,$area253,$area254,$area 255,$area256,$area257,$area258,$area259,$area260,$area261,$area262,$area263,$area264,$area265,$area2 66,$area267,$area268,$area269,$area270,$area271,$area272,$area273,$area274,$area275,$area276,$area27 7,$area278,$area279,$area280,$area281,$area282,$area283,$area284,$area285,$area286,$area287,$area288 ,$area289,$area290,$area291,$area292,$area293,$area294,$area295,$area296,$area297,$area298,$area299, $area300,$area301,$area302,$area303,$area304,$area305,$area306,$area307,$area308,$area309,$area310,$ area311,$area312,$area313,$area314,$area315,$area316,$area317,$area318,$area319,$area320,$area321,$a rea322,$area323,$area324,$area325,$area326,$area327,$area328,$area329,$area330,$area331,$area332,$ar ea333,$area334,$area335,$area336,$area337,$area338,$area339,$area340,$area341,$area342,$area343,$are a344,$area345,$area346,$area347,$area348,$area349,$area350,$area351,$area352,$area353,$area354,$area 355,$area356,$area357,$area358,$area359,$area360,$area361,$area362,$area363,$area364,$area365,$area3 66,$area367,$area368,$area369,$area370,$area371,$area372,$area373,$area374,$area375,$area376,$area37 7,$area378,$area379,$area380,$area381,$area382,$area383,$area384,$area385,$area386,$area387,$area388 ,$area389,$area390,$area391,$area392,$area393,$area394,$area395,$area396,$area397,$area398,$area399, $area400,$area401,$area402,$area403,$area404,$area405,$area406,$area407,$area408,$area409,$area410,$ area411,$area412,$area413,$area414,$area415,$area416,$area417,$area418,$area419,$area420,$area421,$a rea422,$area423,$area424,$area425,$area426,$area427,$area428,$area429,$area430,$area431,$area432,$ar ea433,$area434,$area435,$area436,$area437,$area438,$area439,$area440,$area441,$area442,$area443,$are a444,$area445,$area446,$area447,$area448,$area449,$area450,$area451,$area452,$area453,$area454,$area 455,$area456,$area457,$area458,$area459,$area460,$area461,$area462,$area463,$area464,$area465,$area4 66,$area467,$area468,$area469,$area470,$area471,$area472,$area473,$area474,$area475,$area476,$area47 7,$area478,$area479,$area480,$area481,$area482,$area483,$area484,$area485,$area486,$area487,$area488 ,$area489,$area490,$area491,$area492,$area493,$area494,$area495,$area496,$area497,$area498,$area499, $area500,$area501,$area502,$area503,$area504,$area505,$area506,$area507,$area508,$area509,$area510,$ area511,$area512,$area513,$area514,$area515,$area516,$area517,$area518,$area519,$area520,$area521,$a rea522,$area523,$area524,$area525,$area526,$area527,$area528,$area529,$area530,$area531,$area532,$ar ea533,$area534,$area535,$area536,$area537,$area538,$area539,$area540,$area541,$area542,$area543,$are a544,$area545,$area546,$area547,$area548,$area549,$area550,$area551,$area552,$area553,$area554,$area 555,$area556,$area557,$area558,$area559,$area560,$area561,$area562,$area563,$area564,$area565,$area5 66,$area567,$area568,$area569,$area570,$area571,$area572,$area573,$area574,$area575,$area576,$area57 7,$area578,$area579,$area580,$area581,$area582,$area583,$area584,$area585,$area586,$area587,$area588 ,$area589,$area590,$area591,$area592,$area593,$area594,$area595,$area596,$area597,$area598,$area599, $area600,$area601,$area602,$area603,$area604,$area605,$area606,$area607,$area608,$area609,$area610,$ area611,$area612,$area613,$area614,$area615,$area616,$area617,$area618,$area619,$area620,$area621,$a rea622,$area623,$area624,$area625,$area626,$area627,$area628,$area629,$area630,$area631,$area632,$ar ea633,$area634,$area635,$area636,$area637,$area638,$area639,$area640,$area641,$area642,$area643,$are a644,$area645,$area646,$area647,$area648,$area649,$area650); } close (INFILE); print " ... done, now data handling\n"; print OUTFILE " ... done, now data handling\n";

# Data handling. $RT_range_r1 $RT_range_f1 $RT_range_r2 $RT_range_f2 $RT_range_r3 $RT_range_f3

= = = = = =

7.7406 - $time; 7.7406 + $time; 8.4455 - $time; 8.4455 + $time; 26.7453 - $time; 26.7453 + $time;

$sql2 = "INSERT INTO pearson (x, y) values (?,?)"; $sth2 = $dbh ->prepare ($sql2); #Finding GC standard1 if ($st1 eq "yes"){ print OUTFILE "Standard1: 7.7406\n"; print "Standard1\n"; $sql4 = "SELECT Sample, RT, area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,area63,ar ea64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,area77,area

243 78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area91,area92 ,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,area105,ar ea106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area117,area11 8,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,area130,ar ea131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area142,area14 3,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,area155,ar ea156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area167,area16 8,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,area180,ar ea181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area192,area19 3,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,area205,ar ea206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area217,area21 8,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,area230,ar ea231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area242,area24 3,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,area255,ar ea256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area267,area26 8,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,area280,ar ea281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area292,area29 3,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,area305,ar ea306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area317,area31 8,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,area330,ar ea331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area342,area34 3,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,area355,ar ea356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area367,area36 8,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,area380,ar ea381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area392,area39 3,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,area405,ar ea406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area417,area41 8,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,area430,ar ea431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area442,area44 3,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,area455,ar ea456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area467,area46 8,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,area480,ar ea481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area492,area49 3,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,area505,ar ea506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area517,area51 8,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,area530,ar ea531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area542,area54 3,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,area555,ar ea556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area567,area56 8,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,area580,ar ea581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area592,area59 3,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,area605,ar ea606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area617,area61 8,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,area630,ar ea631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area642,area64 3,area644,area645,area646,area647,area648,area649,area650, Number from HJK_temp2 WHERE RT between $RT_range_r1 and $RT_range_f1 ORDER BY RT"; $sth4 = $dbh ->prepare ($sql4); $sth4 ->execute; while (@row4 = $sth4->fetchrow_array) { #GC_standards from G-L-2M-v986_G20-G15L2M_report_peak.csv file @GC_std1 = (155210,463481,72186,28833,0,19578,0,0,6707,5270,69372,274953,536062,1407500,320510,1181445,74304,84 62,0,0,6791,7391,65486,329482,226771,327857,54718,33188,0,0,0,0,0,11401,55254,129991,167127,157881,1 23257,1539865,431483,12886255,990330,60743,0,0,16406,57520,26351,352287,27246,105472,6529,0,0,0,0,0, 0,0,8687,12848,0,0,0,0,0,0,0,0,0,10812,10222,64970,144171,1879070,5187913,920599,1566847,106520,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0); $column1 = shift (@row4); $column2 = shift (@row4); $n = 0;

244 for ($i = 1; $i 0 and $std > 0){ $n++; $sth2 ->execute ($query, $std); } } if ($n prepare($sql8); $sth8 ->execute; next; } $sql5 ="SELECT (sum(x*y)-sum(x)*sum(y)/count(x))/SQRT((sum(x*x)sum(x)*sum(x)/count(x))*(sum(y*y)-sum(y)*sum(y)/count(y))) from pearson"; $sth5 = $dbh ->prepare ($sql5); $sth5 ->execute; @row5 = $sth5 ->fetchrow_array; $pearson_r = $row5[0]; #

print "\t$column1\t$column2\tPearson R value : $pearson_r (matched MW:$n)\n"; print OUTFILE "\t$column1\t$column2\tPearson R value : $pearson_r (matched MW:$n)\n"; $sth21 ->execute ($column2, $pearson_r); $sql8 = "TRUNCATE table pearson"; $sth8 = $dbh ->prepare($sql8); $sth8 ->execute; } $sql20 ="SELECT RT, R FROM max ORDER BY R DESC"; $sth20 = $dbh ->prepare ($sql20); $sth20 ->execute; @standard_found = $sth20 ->fetchrow_array; $RT_standard_found1 = $standard_found[0]; $R_standard_found1 = $standard_found[1]; $difference1 = 7.7461 - $RT_standard_found1; print OUTFILE "\t$RT_standard_found1\t$R_standard_found1\t(RT difference: $difference1)\n"; print OUTFILE_TABLE "$RT_standard_found1\t$R_standard_found1\t$difference1\t";

$sql19 = "TRUNCATE table max"; $sth19 = $dbh ->prepare($sql19); $sth19 ->execute; }else{ print OUTFILE "Standard1: 7.7406 skipped\n"; print OUTFILE_TABLE "\t\t\t"; } #Finding GC standard2 if ($st2 eq "yes"){ print OUTFILE "Standard2: 8.4455\n"; print "Standard2\n"; $sql4 = "SELECT Sample, RT, area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,area63,ar ea64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,area77,area 78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area91,area92 ,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,area105,ar ea106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area117,area11 8,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,area130,ar ea131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area142,area14 3,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,area155,ar ea156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area167,area16 8,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,area180,ar ea181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area192,area19 3,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,area205,ar

245 ea206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area217,area21 8,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,area230,ar ea231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area242,area24 3,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,area255,ar ea256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area267,area26 8,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,area280,ar ea281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area292,area29 3,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,area305,ar ea306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area317,area31 8,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,area330,ar ea331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area342,area34 3,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,area355,ar ea356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area367,area36 8,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,area380,ar ea381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area392,area39 3,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,area405,ar ea406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area417,area41 8,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,area430,ar ea431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area442,area44 3,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,area455,ar ea456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area467,area46 8,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,area480,ar ea481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area492,area49 3,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,area505,ar ea506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area517,area51 8,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,area530,ar ea531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area542,area54 3,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,area555,ar ea556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area567,area56 8,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,area580,ar ea581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area592,area59 3,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,area605,ar ea606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area617,area61 8,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,area630,ar ea631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area642,area64 3,area644,area645,area646,area647,area648,area649,area650, Number from HJK_temp2 WHERE RT between $RT_range_r2 and $RT_range_f2 ORDER BY RT"; $sth4 = $dbh ->prepare ($sql4); $sth4 ->execute; while (@row4 = $sth4->fetchrow_array) { #GC_standards from G-L-2M-v986_G20-G15L2M_report_peak.csv file @GC_std2 = (77146,579809,195119,290623,16091,29662,0,113176,263339,204918,15841,27011,109766,289628,76741,39530 7,48472,38548,0,0,0,0,0,22410,83154,83985,134702,1270458,482287,959513,75005,0,0,0,0,8289,23636,2846 1,24907,144435,87320,1220109,171795,101876,0,0,0,0,10071,8545,9216,31084,203108,899833,649265,136128 12,1048576,37933,0,0,0,0,0,14251,32161,442101,107121,393845,211090,1552486,7074306,594748,31325,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0); $column1 = shift (@row4); $column2 = shift (@row4); $n = 0; for ($i = 1; $i 0 and $std > 0){ $n++; $sth2 ->execute ($query, $std); } } if ($n prepare($sql8); $sth8 ->execute; next; } $sql5 ="SELECT (sum(x*y)-sum(x)*sum(y)/count(x))/SQRT((sum(x*x)sum(x)*sum(x)/count(x))*(sum(y*y)-sum(y)*sum(y)/count(y))) from pearson"; $sth5 = $dbh ->prepare ($sql5); $sth5 ->execute; @row5 = $sth5 ->fetchrow_array; $pearson_r = $row5[0]; #

print "\t$column1\t$column2\tPearson R value : $pearson_r (matched MW:$n)\n"; print OUTFILE "\t$column1\t$column2\tPearson R value : $pearson_r (matched MW:$n)\n"; $sth21 ->execute ($column2, $pearson_r); $sql8 = "TRUNCATE table pearson"; $sth8 = $dbh ->prepare($sql8); $sth8 ->execute; } $sql20 ="SELECT RT, R FROM max ORDER BY R DESC"; $sth20 = $dbh ->prepare ($sql20); $sth20 ->execute; @standard_found = $sth20 ->fetchrow_array; $RT_standard_found2 = $standard_found[0]; $R_standard_found2 = $standard_found[1]; $difference2 = 8.4455 - $RT_standard_found2; print OUTFILE "\t$RT_standard_found2\t$R_standard_found2\t(RT difference: $difference2)\n"; print OUTFILE_TABLE "$RT_standard_found2\t$R_standard_found2\t$difference2\t";

$sql19 = "TRUNCATE table max"; $sth19 = $dbh ->prepare($sql19); $sth19 ->execute; }else{ print OUTFILE "Standard2: 8.4455 skipped\n"; print OUTFILE_TABLE "\t\t\t"; } #Finding GC standard3 if ($st3 eq "yes"){ print OUTFILE "Standard3: 26.7453\n"; print "Standard3\n"; $sql4 = "SELECT Sample, RT, area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,area63,ar ea64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,area77,area 78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area91,area92 ,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,area105,ar ea106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area117,area11 8,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,area130,ar ea131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area142,area14 3,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,area155,ar ea156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area167,area16 8,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,area180,ar ea181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area192,area19 3,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,area205,ar ea206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area217,area21 8,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,area230,ar ea231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area242,area24 3,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,area255,ar ea256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area267,area26 8,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,area280,ar ea281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area292,area29 3,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,area305,ar ea306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area317,area31 8,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,area330,ar

247 ea331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area342,area34 3,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,area355,ar ea356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area367,area36 8,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,area380,ar ea381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area392,area39 3,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,area405,ar ea406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area417,area41 8,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,area430,ar ea431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area442,area44 3,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,area455,ar ea456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area467,area46 8,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,area480,ar ea481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area492,area49 3,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,area505,ar ea506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area517,area51 8,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,area530,ar ea531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area542,area54 3,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,area555,ar ea556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area567,area56 8,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,area580,ar ea581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area592,area59 3,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,area605,ar ea606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area617,area61 8,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,area630,ar ea631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area642,area64 3,area644,area645,area646,area647,area648,area649,area650, Number from HJK_temp2 WHERE RT between $RT_range_r3 and $RT_range_f3 ORDER BY RT"; $sth4 = $dbh ->prepare ($sql4); $sth4 ->execute; while (@row4 = $sth4->fetchrow_array) { #GC_standards from G-L-2M-v986_G20-G15L2M_report_peak.csv file @GC_std3 = (0,238191,0,0,0,0,0,0,0,0,65330,187758,165083,160828,0,0,75377,157803,232594,92881,0,13267,0,230625, 355097,359347,78267,0,0,0,0,0,0,64217,226206,664657,487492,270517,0,7117,15420,0,0,0,0,0,0,244735,25 9762,494690,161166,376756,415084,190141,91435,22601,27731,42090,136156,234165,84384,126331,22278,197 79,0,0,0,0,14141,41573,198632,288499,0,96748,11103,0,0,0,0,0,0,45951,0,140011,112452,456121,1462349, 598941,448974,128345,12440,0,8451,0,28269,80721,53503,0,0,0,0,0,0,0,0,0,13073,0,0,38332,0,0,0,0,0,0, 0,27717,8040,89843,335230,724698,553173,455361,258357,85919,0,6172,0,14285,37752,15154,33175,9403,10 440,0,0,0,0,0,0,0,16479,27872,0,0,0,0,0,0,0,11846,0,21779,44603,113673,374384,3410285,628318,3190472 ,374963,1024231,103152,97726,0,0,0,0,0,0,0,0,0,0,0,0,0,68592,10958,86473,0,0,0,0,0,0,0,0,0,14517,0,2 04902,1970665,330364,2284560,296801,1141506,123158,248800,9005,17841,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0); $column1 = shift (@row4); $column2 = shift (@row4); $n = 0; for ($i = 1; $i 0 and $std > 0){ $n++; $sth2 ->execute ($query, $std); } } if ($n prepare($sql8); $sth8 ->execute; next; } $sql5 ="SELECT (sum(x*y)-sum(x)*sum(y)/count(x))/SQRT((sum(x*x)-

248 sum(x)*sum(x)/count(x))*(sum(y*y)-sum(y)*sum(y)/count(y))) from pearson"; $sth5 = $dbh ->prepare ($sql5); $sth5 ->execute; @row5 = $sth5 ->fetchrow_array; $pearson_r = $row5[0]; #

print "\t$column1\t$column2\tPearson R value : $pearson_r (matched MW:$n)\n"; print OUTFILE "\t$column1\t$column2\tPearson R value : $pearson_r (matched MW:$n)\n"; $sth21 ->execute ($column2, $pearson_r); $sql8 = "TRUNCATE table pearson"; $sth8 = $dbh ->prepare($sql8); $sth8 ->execute; } $sql20 ="SELECT RT, R FROM max ORDER BY R DESC"; $sth20 = $dbh ->prepare ($sql20); $sth20 ->execute; @standard_found = $sth20 ->fetchrow_array; $RT_standard_found3 = $standard_found[0]; $R_standard_found3 = $standard_found[1]; $difference3 = 26.7453 - $RT_standard_found3; print OUTFILE "\t$RT_standard_found3\t$R_standard_found3\t(RT difference: $difference3)\n"; print OUTFILE_TABLE "$RT_standard_found3\t$R_standard_found3\t$difference3\t"; $sql19 = "TRUNCATE table max"; $sth19 = $dbh ->prepare($sql19); $sth19 ->execute; }else{ print OUTFILE "Standard3: 26.7453 skipped\n"; print OUTFILE_TABLE "\t\t\t"; } #three standards ended here

#calculating average difference if ($st1 eq "yes" and $st2 eq "yes" and $st3 eq "yes"){ $RT_difference_with_standard = ($difference1 + $difference2 + $difference3)/3; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n"; print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }elsif ($st1 eq "yes" and $st2 eq "yes" and $st3 ne "yes"){ $RT_difference_with_standard = ($difference1 + $difference2)/2; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n"; print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }elsif ($st1 eq "yes" and $st2 ne "yes" and $st3 eq "yes"){ $RT_difference_with_standard = ($difference1 + $difference3)/2; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n"; print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }elsif ($st1 ne "yes" and $st2 eq "yes" and $st3 eq "yes"){ $RT_difference_with_standard = ($difference2 + $difference3)/2; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n"; print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }elsif ($st1 eq "yes" and $st2 ne "yes" and $st3 ne "yes"){ $RT_difference_with_standard = $difference1; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n"; print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }elsif ($st1 ne "yes" and $st2 eq "yes" and $st3 ne "yes"){ $RT_difference_with_standard = $difference2; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n"; print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }elsif ($st1 ne "yes" and $st2 ne "yes" and $st3 eq "yes"){ $RT_difference_with_standard = $difference3; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n";

249 print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }else{ $RT_difference_with_standard = 0; print OUTFILE "\n\tThe average difference with standards (std - sample): $RT_difference_with_standard\n"; print OUTFILE_TABLE "\t$RT_difference_with_standard\t"; }

#let's adjust RT and also add area in new file $save_file = $file_name; $save_file =~ s/\.csv/_RT_adj\.csv/; if (-e $save_file){ unlink ($save_file); } open (OUTFILE_NEW, ">$save_file") or die "Error-cannot open save file"; print OUTFILE_NEW "sample,RT,RT_adj,Area,ion_num,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72, 73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104 ,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129 ,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154 ,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179 ,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204 ,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229 ,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254 ,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279 ,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304 ,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329 ,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354 ,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379 ,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404 ,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429 ,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454 ,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479 ,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504 ,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529 ,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554 ,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579 ,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604 ,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629 ,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650\n"; $sql6 $sth6 $sth6 while

= "SELECT RT from HJK_temp2 ORDER BY RT"; = $dbh ->prepare ($sql6); ->execute; (@row6 = $sth6->fetchrow_array) { $RT = $row6[0]; $RT_adj = $RT + $RT_difference_with_standard; $sql13 = "UPDATE HJK_temp2 SET RT_adj = $RT_adj $sth13 = $dbh ->prepare($sql13); $sth13 ->execute;

WHERE RT = $RT";

}

#Write new table to new file $sql7 = "SELECT Sample, RT, RT_adj, Area, ion_num, area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,area63,ar ea64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,area77,area 78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area91,area92 ,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,area105,ar ea106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area117,area11 8,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,area130,ar ea131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area142,area14 3,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,area155,ar ea156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area167,area16 8,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,area180,ar ea181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area192,area19

250 3,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,area205,ar ea206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area217,area21 8,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,area230,ar ea231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area242,area24 3,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,area255,ar ea256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area267,area26 8,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,area280,ar ea281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area292,area29 3,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,area305,ar ea306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area317,area31 8,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,area330,ar ea331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area342,area34 3,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,area355,ar ea356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area367,area36 8,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,area380,ar ea381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area392,area39 3,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,area405,ar ea406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area417,area41 8,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,area430,ar ea431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area442,area44 3,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,area455,ar ea456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area467,area46 8,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,area480,ar ea481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area492,area49 3,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,area505,ar ea506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area517,area51 8,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,area530,ar ea531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area542,area54 3,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,area555,ar ea556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area567,area56 8,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,area580,ar ea581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area592,area59 3,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,area605,ar ea606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area617,area61 8,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,area630,ar ea631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area642,area64 3,area644,area645,area646,area647,area648,area649,area650 from HJK_temp2 ORDER BY RT_adj"; $sth7 = $dbh ->prepare ($sql7); $sth7 ->execute; $j = 0; while (@row7 = $sth7->fetchrow_array) { $j++; $i = 0; foreach $each (@row7){ $i++; chomp ($each); if ($i == 1){ print OUTFILE_NEW "$each"; }else{ print OUTFILE_NEW ",$each"; } } print OUTFILE_NEW "\n"; } $average_ion_num = $total_ion_num/$j; print OUTFILE_TABLE "$j\t$total_ion_num\t$average_ion_num\n"; #The number of peaks $sql11 = "TRUNCATE table HJK_temp2"; $sth11 = $dbh ->prepare ($sql11); $sth11 ->execute; } $sth2 ->finish; if ($st1 eq "yes" or $st2 eq "yes" or $st3 eq "yes"){ $sth4 ->finish; $sth5 ->finish; $sth8 ->finish; $sth19 ->finish; $sth20 ->finish;

251 } $sth6 ->finish; $sth7 ->finish;

$sth21 ->finish; $sql9 $sth9 $sth9 $sth9

= "DROP table HJK_temp2"; = $dbh ->prepare($sql9); ->execute; ->finish;

$sql10 $sth10 $sth10 $sth10

= "DROP table pearson"; = $dbh ->prepare($sql10); ->execute; ->finish;

$sql25 $sth25 $sth25 $sth25

= "DROP table max"; = $dbh ->prepare($sql25); ->execute; ->finish;

print print

"\nAll processes have ended.\n\n"; OUTFILE "\nAll processes have ended.\n\n";

$dbh -> disconnect; close (OUTFILE); close (OUTFILE_TABLE);

252 sample

RT

RT_adj

Area

ion_num

50

51

52

F-L-7M-v435_F275-F272L7M

5.0897

5.0935

4316447

20

0

5354

0

F-L-7M-v435_F275-F272L7M

5.6779

5.6817

111385

6

0

0

0

F-L-7M-v435_F275-F272L7M

5.7893

5.7931

2271796

23

0

20754

8598

F-L-7M-v435_F275-F272L7M

6.6643

6.6681

204875

9

0

0

0

F-L-7M-v435_F275-F272L7M

6.9659

6.9697

138632

9

0

0

0

F-L-7M-v435_F275-F272L7M

7.1289

7.1327

803258

18

0

9822

0

F-L-7M-v435_F275-F272L7M

7.2656

7.2694

135138827

74

119708

1564040

733788

F-L-7M-v435_F275-F272L7M

7.5509

7.5547

2165246

36

0

17704

7505

F-L-7M-v435_F275-F272L7M

7.6375

7.6413

406482

13

0

5624

0

F-L-7M-v435_F275-F272L7M

7.7406

7.7444

31945021

45

144358

424256

69430

F-L-7M-v435_F275-F272L7M

8.1621

8.1659

770558623

81

684390

8370264

4763806

F-L-7M-v435_F275-F272L7M

8.3802

8.384

65778603

66

106515

856310

521289

F-L-7M-v435_F275-F272L7M

8.4497

8.4535

31320775

54

62358

567606

181597

F-L-7M-v435_F275-F272L7M

8.6441

8.6479

19171365

60

15113

244559

89701

F-L-7M-v435_F275-F272L7M

8.7617

8.7655

971560

22

0

7671

0

F-L-7M-v435_F275-F272L7M

8.8815

8.8853

1150178

34

0

11345

6903

F-L-7M-v435_F275-F272L7M

9.0439

9.0477

684497

17

0

12059

0

F-L-7M-v435_F275-F272L7M

9.1197

9.1235

13315884

54

15976

127235

83719

F-L-7M-v435_F275-F272L7M

9.1783

9.1821

1564063

28

0

0

0

F-L-7M-v435_F275-F272L7M

9.2874

9.2912

877336

24

0

11599

9174

F-L-7M-v435_F275-F272L7M

9.4907

9.4945

5987717

42

6565

75118

40854

F-L-7M-v435_F275-F272L7M

9.5711

9.5749

1024137

17

0

0

0

F-L-7M-v435_F275-F272L7M

9.7051

9.7089

116084

8

0

19552

0

F-L-7M-v435_F275-F272L7M

9.716

9.7198

1034838

16

0

0

0

F-L-7M-v435_F275-F272L7M

10.3347

10.3385

12385901

55

16186

140371

52147

F-L-7M-v435_F275-F272L7M

10.5047

10.5085

359672

9

0

17503

0

Supplementary Table 3 Output format of mass3_RT_adj_v2.pl. It is also input format of mass4_match_peaks_v2.pl. For example, this table has 516 rows that represent each peak in F-L7M-v435_F275-F272L7M sample. RT column is avgRT in the previous file (Supplementary Table 2), RT_adj column is RT adjusted by standards, Area column is sum of each ion area, and ion_num column is the number of ions constituting this peak. After that there are 601 columns representing each MW from 50 to 650 and the number is area of each ion from SIC.

253 mass4_match_peaks_v2.pl This script uses *peak_ready_RT_adj.csv files generated by mass3_RT_adj_v2.pl (e.g. F-L-7Mv435_F275-F272L7M_report_peak_ready_RT_adj.csv). The format of these files is shown in Supplementary Table 3. mass4_match_peaks_v2_no_log.pl script is very similar with mass4_match_peaks_v2.pl and will not shown.

#!/usr/local/bin/perl -w #This program is for comparing peaks across diffrerent files. #MySQL table used: HJK_temp4, HJK_temp5, pearson2 # #reduced size: removed or altered to reduce log file size #Author : Hyun Jo Koo #Date : 2009-06-09 #Version :2.0, reciprocal check added

use DBI; use Getopt::Long; GetOptions ("path=s"=>\$path); if (!defined $path){ print "Usage : $0 -path FOLDER_NAME\n \n For example: $0 -p project/ginger\n For path option, if you have files in ~/project/files, just input as project/files\n All input files should be *peak_ready_RT_adj.csv files.\n\n"; exit(1); } #Ready to read the file contents. @file_names = glob ("~/$path/*_peak_ready_RT_adj.csv"); $num_of_files=@file_names; if ($num_of_files == 0){ print " No file to process in the folder ~/$path\n\n"; die; }

print "Total file number: $num_of_files\n"; $dbh = DBI->connect("DBI:mysql:your_ID","your_ID","pasword",{RaiseError=>1}) or die("Connect error: $DBI::errstr"); $file_names[0] =~ /(.+)\/.+/; $save_file2 = $1; $0 =~ /(.+)\.pl/; $save_file = $save_file2."/".$1."_result_table.csv"; $log_file = $save_file2."/".$1."_result_table.log"; if (-e $log_file){ unlink($log_file); } if (-e $save_file){ unlink($save_file);

254 } open (OUTFILE_LOG, ">$log_file") or die "Error-cannot open log file"; open (OUTFILE_SAVE, ">$save_file") or die "Error-cannot open save file"; print "Now loading file data to MySQL table...\n"; $sql2 = "CREATE TABLE HJK_temp4 (Number INT NOT NULL auto_increment, Sample char (30) NOT NULL, RT_adj decimal (2,4) NOT NULL, Area decimal (10,0), Ion_num INT, Checking INT, area50 INT,area51 INT,area52 INT,area53 INT,area54 INT,area55 INT,area56 INT,area57 INT,area58 INT,area59 INT,area60 INT,area61 INT,area62 INT,area63 INT,area64 INT,area65 INT,area66 INT,area67 INT,area68 INT,area69 INT,area70 INT,area71 INT,area72 INT,area73 INT,area74 INT,area75 INT,area76 INT,area77 INT,area78 INT,area79 INT,area80 INT,area81 INT,area82 INT,area83 INT,area84 INT,area85 INT,area86 INT,area87 INT,area88 INT,area89 INT,area90 INT,area91 INT,area92 INT,area93 INT,area94 INT,area95 INT,area96 INT,area97 INT,area98 INT,area99 INT,area100 INT,area101 INT,area102 INT,area103 INT,area104 INT,area105 INT,area106 INT,area107 INT,area108 INT,area109 INT,area110 INT,area111 INT,area112 INT,area113 INT,area114 INT,area115 INT,area116 INT,area117 INT,area118 INT,area119 INT,area120 INT,area121 INT,area122 INT,area123 INT,area124 INT,area125 INT,area126 INT,area127 INT,area128 INT,area129 INT,area130 INT,area131 INT,area132 INT,area133 INT,area134 INT,area135 INT,area136 INT,area137 INT,area138 INT,area139 INT,area140 INT,area141 INT,area142 INT,area143 INT,area144 INT,area145 INT,area146 INT,area147 INT,area148 INT,area149 INT,area150 INT,area151 INT,area152 INT,area153 INT,area154 INT,area155 INT,area156 INT,area157 INT,area158 INT,area159 INT,area160 INT,area161 INT,area162 INT,area163 INT,area164 INT,area165 INT,area166 INT,area167 INT,area168 INT,area169 INT,area170 INT,area171 INT,area172 INT,area173 INT,area174 INT,area175 INT,area176 INT,area177 INT,area178 INT,area179 INT,area180 INT,area181 INT,area182 INT,area183 INT,area184 INT,area185 INT,area186 INT,area187 INT,area188 INT,area189 INT,area190 INT,area191 INT,area192 INT,area193 INT,area194 INT,area195 INT,area196 INT,area197 INT,area198 INT,area199 INT,area200 INT,area201 INT,area202 INT,area203 INT,area204 INT,area205 INT,area206 INT,area207 INT,area208 INT,area209 INT,area210 INT,area211 INT,area212 INT,area213 INT,area214 INT,area215 INT,area216 INT,area217 INT,area218 INT,area219 INT,area220 INT,area221 INT,area222 INT,area223 INT,area224 INT,area225 INT,area226 INT,area227 INT,area228 INT,area229 INT,area230 INT,area231 INT,area232 INT,area233 INT,area234 INT,area235 INT,area236 INT,area237 INT,area238 INT,area239 INT,area240 INT,area241 INT,area242 INT,area243 INT,area244 INT,area245 INT,area246 INT,area247 INT,area248 INT,area249 INT,area250 INT,area251 INT,area252 INT,area253 INT,area254 INT,area255 INT,area256 INT,area257 INT,area258 INT,area259 INT,area260 INT,area261 INT,area262 INT,area263 INT,area264 INT,area265 INT,area266 INT,area267 INT,area268 INT,area269 INT,area270 INT,area271 INT,area272 INT,area273 INT,area274 INT,area275 INT,area276 INT,area277 INT,area278 INT,area279 INT,area280 INT,area281 INT,area282 INT,area283 INT,area284 INT,area285 INT,area286 INT,area287 INT,area288 INT,area289 INT,area290 INT,area291 INT,area292 INT,area293 INT,area294 INT,area295 INT,area296 INT,area297 INT,area298 INT,area299 INT,area300 INT,area301 INT,area302 INT,area303 INT,area304 INT,area305 INT,area306 INT,area307 INT,area308 INT,area309 INT,area310 INT,area311 INT,area312 INT,area313 INT,area314 INT,area315 INT,area316 INT,area317 INT,area318 INT,area319 INT,area320 INT,area321 INT,area322 INT,area323 INT,area324 INT,area325 INT,area326 INT,area327 INT,area328 INT,area329 INT,area330 INT,area331 INT,area332 INT,area333 INT,area334 INT,area335 INT,area336 INT,area337 INT,area338 INT,area339 INT,area340 INT,area341 INT,area342 INT,area343 INT,area344 INT,area345 INT,area346 INT,area347 INT,area348 INT,area349 INT,area350 INT,area351 INT,area352 INT,area353 INT,area354 INT,area355 INT,area356 INT,area357 INT,area358 INT,area359 INT,area360 INT,area361 INT,area362 INT,area363 INT,area364 INT,area365 INT,area366 INT,area367 INT,area368 INT,area369 INT,area370 INT,area371 INT,area372 INT,area373 INT,area374 INT,area375 INT,area376 INT,area377 INT,area378 INT,area379 INT,area380 INT,area381 INT,area382 INT,area383 INT,area384 INT,area385 INT,area386 INT,area387 INT,area388 INT,area389 INT,area390 INT,area391 INT,area392 INT,area393 INT,area394 INT,area395 INT,area396 INT,area397 INT,area398 INT,area399 INT,area400 INT,area401 INT,area402 INT,area403 INT,area404 INT,area405 INT,area406 INT,area407 INT,area408 INT,area409 INT,area410 INT,area411 INT,area412 INT,area413 INT,area414 INT,area415 INT,area416 INT,area417 INT,area418 INT,area419 INT,area420 INT,area421 INT,area422 INT,area423 INT,area424 INT,area425 INT,area426 INT,area427 INT,area428 INT,area429 INT,area430 INT,area431 INT,area432 INT,area433 INT,area434 INT,area435 INT,area436 INT,area437 INT,area438 INT,area439 INT,area440 INT,area441 INT,area442 INT,area443 INT,area444 INT,area445 INT,area446 INT,area447 INT,area448 INT,area449 INT,area450 INT,area451 INT,area452 INT,area453 INT,area454 INT,area455 INT,area456 INT,area457 INT,area458 INT,area459 INT,area460 INT,area461 INT,area462 INT,area463 INT,area464 INT,area465 INT,area466 INT,area467 INT,area468 INT,area469 INT,area470 INT,area471 INT,area472 INT,area473 INT,area474 INT,area475 INT,area476 INT,area477 INT,area478 INT,area479 INT,area480 INT,area481 INT,area482 INT,area483 INT,area484 INT,area485 INT,area486 INT,area487 INT,area488 INT,area489 INT,area490 INT,area491 INT,area492 INT,area493 INT,area494 INT,area495 INT,area496 INT,area497 INT,area498 INT,area499 INT,area500 INT,area501 INT,area502 INT,area503 INT,area504 INT,area505 INT,area506 INT,area507 INT,area508 INT,area509 INT,area510 INT,area511 INT,area512 INT,area513 INT,area514 INT,area515 INT,area516 INT,area517 INT,area518 INT,area519 INT,area520 INT,area521 INT,area522 INT,area523 INT,area524 INT,area525 INT,area526 INT,area527 INT,area528 INT,area529 INT,area530 INT,area531 INT,area532 INT,area533 INT,area534 INT,area535 INT,area536 INT,area537 INT,area538 INT,area539 INT,area540 INT,area541 INT,area542 INT,area543 INT,area544 INT,area545 INT,area546 INT,area547 INT,area548 INT,area549

255 INT,area550 INT,area551 INT,area552 INT,area558 INT,area559 INT,area560 INT,area566 INT,area567 INT,area568 INT,area574 INT,area575 INT,area576 INT,area582 INT,area583 INT,area584 INT,area590 INT,area591 INT,area592 INT,area598 INT,area599 INT,area600 INT,area606 INT,area607 INT,area608 INT,area614 INT,area615 INT,area616 INT,area622 INT,area623 INT,area624 INT,area630 INT,area631 INT,area632 INT,area638 INT,area639 INT,area640 INT,area646 INT,area647 INT,area648 primary key(Number))"; $sth2 = $dbh ->prepare ($sql2); $sth2 ->execute; $sth2 ->finish; $sql17 $sth17 $sth17 $sth17

INT,area553 INT,area561 INT,area569 INT,area577 INT,area585 INT,area593 INT,area601 INT,area609 INT,area617 INT,area625 INT,area633 INT,area641 INT,area649

INT,area554 INT,area562 INT,area570 INT,area578 INT,area586 INT,area594 INT,area602 INT,area610 INT,area618 INT,area626 INT,area634 INT,area642 INT,area650

INT,area555 INT,area563 INT,area571 INT,area579 INT,area587 INT,area595 INT,area603 INT,area611 INT,area619 INT,area627 INT,area635 INT,area643 INT,

INT,area556 INT,area564 INT,area572 INT,area580 INT,area588 INT,area596 INT,area604 INT,area612 INT,area620 INT,area628 INT,area636 INT,area644

INT,area557 INT,area565 INT,area573 INT,area581 INT,area589 INT,area597 INT,area605 INT,area613 INT,area621 INT,area629 INT,area637 INT,area645

="CREATE TABLE pearson2 (x float NOT NULL, y float NOT NULL)"; = $dbh ->prepare ($sql17); ->execute; ->finish;

$sql18 ="INSERT INTO pearson2 (x, y) VALUES (?,?)"; $sth18 = $dbh ->prepare ($sql18); $sql7 ="SELECT (sum(x*y)-sum(x)*sum(y)/count(x))/SQRT((sum(x*x)-sum(x)*sum(x)/count(x))*(sum(y*y)sum(y)*sum(y)/count(y))) from pearson2"; $sth7 = $dbh ->prepare ($sql7);

@first_row = (); @second_row = (); foreach $file_name (@file_names){

if ($file_name =~ /[^\s]+$path\/([^\_]+)_([^\_]+)_.+/){ print "Reading... $1 $2\n"; print OUTFILE_LOG "Reading... $1 $2\n"; $first = $1; $second= $2; $first =~ s/\-//g; $second =~ s/\-//g; push (@first_row, $first); push (@second_row, $second); }else{ $file_name =~ /[^\s]+$path\/(.+)/; print "Reading... $1\n"; print OUTFILE_LOG "Reading... $1\n"; $first = $1; $first =~ s/\-//g; push (@first_row, $first); push (@second_row, ''); } #Insert file contents to the table. $sql4 = "INSERT INTO HJK_temp4 (Sample, RT_adj, Area, Ion_num,Checking,area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61 ,area62,area63,area64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,a rea76,area77,area78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,are a90,area91,area92,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,a rea104,area105,area106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area1 16,area117,area118,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,a rea129,area130,area131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area1 41,area142,area143,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,a rea154,area155,area156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area1 66,area167,area168,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,a rea179,area180,area181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area1 91,area192,area193,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,a rea204,area205,area206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area2 16,area217,area218,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,a

256 rea229,area230,area231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area2 41,area242,area243,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,a rea254,area255,area256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area2 66,area267,area268,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,a rea279,area280,area281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area2 91,area292,area293,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,a rea304,area305,area306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area3 16,area317,area318,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,a rea329,area330,area331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area3 41,area342,area343,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,a rea354,area355,area356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area3 66,area367,area368,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,a rea379,area380,area381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area3 91,area392,area393,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,a rea404,area405,area406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area4 16,area417,area418,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,a rea429,area430,area431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area4 41,area442,area443,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,a rea454,area455,area456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area4 66,area467,area468,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,a rea479,area480,area481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area4 91,area492,area493,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,a rea504,area505,area506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area5 16,area517,area518,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,a rea529,area530,area531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area5 41,area542,area543,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,a rea554,area555,area556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area5 66,area567,area568,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,a rea579,area580,area581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area5 91,area592,area593,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,a rea604,area605,area606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area6 16,area617,area618,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,a rea629,area630,area631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area6 41,area642,area643,area644,area645,area646,area647,area648,area649,area650) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? ,?,?,?,?,?,?)"; $sth4 = $dbh ->prepare ($sql4); open (INFILE, $file_name); $m = 0; while ($line=){ $m++; if ($m == 1){ #The first lane has column info, so remove it when import to MySQL next; }else{ chomp ($line); ($sample, $rt, $rt_adj, $area, $ion_num, $area50,$area51,$area52,$area53,$area54,$area55,$area56,$area57,$area58,$area59,$area60,$area61,$are a62,$area63,$area64,$area65,$area66,$area67,$area68,$area69,$area70,$area71,$area72,$area73,$area74, $area75,$area76,$area77,$area78,$area79,$area80,$area81,$area82,$area83,$area84,$area85,$area86,$are a87,$area88,$area89,$area90,$area91,$area92,$area93,$area94,$area95,$area96,$area97,$area98,$area99, $area100,$area101,$area102,$area103,$area104,$area105,$area106,$area107,$area108,$area109,$area110,$ area111,$area112,$area113,$area114,$area115,$area116,$area117,$area118,$area119,$area120,$area121,$a rea122,$area123,$area124,$area125,$area126,$area127,$area128,$area129,$area130,$area131,$area132,$ar ea133,$area134,$area135,$area136,$area137,$area138,$area139,$area140,$area141,$area142,$area143,$are a144,$area145,$area146,$area147,$area148,$area149,$area150,$area151,$area152,$area153,$area154,$area 155,$area156,$area157,$area158,$area159,$area160,$area161,$area162,$area163,$area164,$area165,$area1

257 66,$area167,$area168,$area169,$area170,$area171,$area172,$area173,$area174,$area175,$area176,$area17 7,$area178,$area179,$area180,$area181,$area182,$area183,$area184,$area185,$area186,$area187,$area188 ,$area189,$area190,$area191,$area192,$area193,$area194,$area195,$area196,$area197,$area198,$area199, $area200,$area201,$area202,$area203,$area204,$area205,$area206,$area207,$area208,$area209,$area210,$ area211,$area212,$area213,$area214,$area215,$area216,$area217,$area218,$area219,$area220,$area221,$a rea222,$area223,$area224,$area225,$area226,$area227,$area228,$area229,$area230,$area231,$area232,$ar ea233,$area234,$area235,$area236,$area237,$area238,$area239,$area240,$area241,$area242,$area243,$are a244,$area245,$area246,$area247,$area248,$area249,$area250,$area251,$area252,$area253,$area254,$area 255,$area256,$area257,$area258,$area259,$area260,$area261,$area262,$area263,$area264,$area265,$area2 66,$area267,$area268,$area269,$area270,$area271,$area272,$area273,$area274,$area275,$area276,$area27 7,$area278,$area279,$area280,$area281,$area282,$area283,$area284,$area285,$area286,$area287,$area288 ,$area289,$area290,$area291,$area292,$area293,$area294,$area295,$area296,$area297,$area298,$area299, $area300,$area301,$area302,$area303,$area304,$area305,$area306,$area307,$area308,$area309,$area310,$ area311,$area312,$area313,$area314,$area315,$area316,$area317,$area318,$area319,$area320,$area321,$a rea322,$area323,$area324,$area325,$area326,$area327,$area328,$area329,$area330,$area331,$area332,$ar ea333,$area334,$area335,$area336,$area337,$area338,$area339,$area340,$area341,$area342,$area343,$are a344,$area345,$area346,$area347,$area348,$area349,$area350,$area351,$area352,$area353,$area354,$area 355,$area356,$area357,$area358,$area359,$area360,$area361,$area362,$area363,$area364,$area365,$area3 66,$area367,$area368,$area369,$area370,$area371,$area372,$area373,$area374,$area375,$area376,$area37 7,$area378,$area379,$area380,$area381,$area382,$area383,$area384,$area385,$area386,$area387,$area388 ,$area389,$area390,$area391,$area392,$area393,$area394,$area395,$area396,$area397,$area398,$area399, $area400,$area401,$area402,$area403,$area404,$area405,$area406,$area407,$area408,$area409,$area410,$ area411,$area412,$area413,$area414,$area415,$area416,$area417,$area418,$area419,$area420,$area421,$a rea422,$area423,$area424,$area425,$area426,$area427,$area428,$area429,$area430,$area431,$area432,$ar ea433,$area434,$area435,$area436,$area437,$area438,$area439,$area440,$area441,$area442,$area443,$are a444,$area445,$area446,$area447,$area448,$area449,$area450,$area451,$area452,$area453,$area454,$area 455,$area456,$area457,$area458,$area459,$area460,$area461,$area462,$area463,$area464,$area465,$area4 66,$area467,$area468,$area469,$area470,$area471,$area472,$area473,$area474,$area475,$area476,$area47 7,$area478,$area479,$area480,$area481,$area482,$area483,$area484,$area485,$area486,$area487,$area488 ,$area489,$area490,$area491,$area492,$area493,$area494,$area495,$area496,$area497,$area498,$area499, $area500,$area501,$area502,$area503,$area504,$area505,$area506,$area507,$area508,$area509,$area510,$ area511,$area512,$area513,$area514,$area515,$area516,$area517,$area518,$area519,$area520,$area521,$a rea522,$area523,$area524,$area525,$area526,$area527,$area528,$area529,$area530,$area531,$area532,$ar ea533,$area534,$area535,$area536,$area537,$area538,$area539,$area540,$area541,$area542,$area543,$are a544,$area545,$area546,$area547,$area548,$area549,$area550,$area551,$area552,$area553,$area554,$area 555,$area556,$area557,$area558,$area559,$area560,$area561,$area562,$area563,$area564,$area565,$area5 66,$area567,$area568,$area569,$area570,$area571,$area572,$area573,$area574,$area575,$area576,$area57 7,$area578,$area579,$area580,$area581,$area582,$area583,$area584,$area585,$area586,$area587,$area588 ,$area589,$area590,$area591,$area592,$area593,$area594,$area595,$area596,$area597,$area598,$area599, $area600,$area601,$area602,$area603,$area604,$area605,$area606,$area607,$area608,$area609,$area610,$ area611,$area612,$area613,$area614,$area615,$area616,$area617,$area618,$area619,$area620,$area621,$a rea622,$area623,$area624,$area625,$area626,$area627,$area628,$area629,$area630,$area631,$area632,$ar ea633,$area634,$area635,$area636,$area637,$area638,$area639,$area640,$area641,$area642,$area643,$are a644,$area645,$area646,$area647,$area648,$area649,$area650) = split (/,/, $line); $sample =~ s/\-//g; if ($sample =~ /([^\_]+)\_[^\_]+/){ $sample = $1; } $Checking = 0; # execute ($sample, $rt_adj, $area, $ion_num, $Checking, $area50,$area51,$area52,$area53,$area54,$area55,$area56,$area57,$area58,$area59,$area60,$area61,$are a62,$area63,$area64,$area65,$area66,$area67,$area68,$area69,$area70,$area71,$area72,$area73,$area74, $area75,$area76,$area77,$area78,$area79,$area80,$area81,$area82,$area83,$area84,$area85,$area86,$are a87,$area88,$area89,$area90,$area91,$area92,$area93,$area94,$area95,$area96,$area97,$area98,$area99, $area100,$area101,$area102,$area103,$area104,$area105,$area106,$area107,$area108,$area109,$area110,$ area111,$area112,$area113,$area114,$area115,$area116,$area117,$area118,$area119,$area120,$area121,$a rea122,$area123,$area124,$area125,$area126,$area127,$area128,$area129,$area130,$area131,$area132,$ar ea133,$area134,$area135,$area136,$area137,$area138,$area139,$area140,$area141,$area142,$area143,$are a144,$area145,$area146,$area147,$area148,$area149,$area150,$area151,$area152,$area153,$area154,$area 155,$area156,$area157,$area158,$area159,$area160,$area161,$area162,$area163,$area164,$area165,$area1 66,$area167,$area168,$area169,$area170,$area171,$area172,$area173,$area174,$area175,$area176,$area17 7,$area178,$area179,$area180,$area181,$area182,$area183,$area184,$area185,$area186,$area187,$area188 ,$area189,$area190,$area191,$area192,$area193,$area194,$area195,$area196,$area197,$area198,$area199, $area200,$area201,$area202,$area203,$area204,$area205,$area206,$area207,$area208,$area209,$area210,$ area211,$area212,$area213,$area214,$area215,$area216,$area217,$area218,$area219,$area220,$area221,$a rea222,$area223,$area224,$area225,$area226,$area227,$area228,$area229,$area230,$area231,$area232,$ar ea233,$area234,$area235,$area236,$area237,$area238,$area239,$area240,$area241,$area242,$area243,$are a244,$area245,$area246,$area247,$area248,$area249,$area250,$area251,$area252,$area253,$area254,$area 255,$area256,$area257,$area258,$area259,$area260,$area261,$area262,$area263,$area264,$area265,$area2 66,$area267,$area268,$area269,$area270,$area271,$area272,$area273,$area274,$area275,$area276,$area27 7,$area278,$area279,$area280,$area281,$area282,$area283,$area284,$area285,$area286,$area287,$area288

258 ,$area289,$area290,$area291,$area292,$area293,$area294,$area295,$area296,$area297,$area298,$area299, $area300,$area301,$area302,$area303,$area304,$area305,$area306,$area307,$area308,$area309,$area310,$ area311,$area312,$area313,$area314,$area315,$area316,$area317,$area318,$area319,$area320,$area321,$a rea322,$area323,$area324,$area325,$area326,$area327,$area328,$area329,$area330,$area331,$area332,$ar ea333,$area334,$area335,$area336,$area337,$area338,$area339,$area340,$area341,$area342,$area343,$are a344,$area345,$area346,$area347,$area348,$area349,$area350,$area351,$area352,$area353,$area354,$area 355,$area356,$area357,$area358,$area359,$area360,$area361,$area362,$area363,$area364,$area365,$area3 66,$area367,$area368,$area369,$area370,$area371,$area372,$area373,$area374,$area375,$area376,$area37 7,$area378,$area379,$area380,$area381,$area382,$area383,$area384,$area385,$area386,$area387,$area388 ,$area389,$area390,$area391,$area392,$area393,$area394,$area395,$area396,$area397,$area398,$area399, $area400,$area401,$area402,$area403,$area404,$area405,$area406,$area407,$area408,$area409,$area410,$ area411,$area412,$area413,$area414,$area415,$area416,$area417,$area418,$area419,$area420,$area421,$a rea422,$area423,$area424,$area425,$area426,$area427,$area428,$area429,$area430,$area431,$area432,$ar ea433,$area434,$area435,$area436,$area437,$area438,$area439,$area440,$area441,$area442,$area443,$are a444,$area445,$area446,$area447,$area448,$area449,$area450,$area451,$area452,$area453,$area454,$area 455,$area456,$area457,$area458,$area459,$area460,$area461,$area462,$area463,$area464,$area465,$area4 66,$area467,$area468,$area469,$area470,$area471,$area472,$area473,$area474,$area475,$area476,$area47 7,$area478,$area479,$area480,$area481,$area482,$area483,$area484,$area485,$area486,$area487,$area488 ,$area489,$area490,$area491,$area492,$area493,$area494,$area495,$area496,$area497,$area498,$area499, $area500,$area501,$area502,$area503,$area504,$area505,$area506,$area507,$area508,$area509,$area510,$ area511,$area512,$area513,$area514,$area515,$area516,$area517,$area518,$area519,$area520,$area521,$a rea522,$area523,$area524,$area525,$area526,$area527,$area528,$area529,$area530,$area531,$area532,$ar ea533,$area534,$area535,$area536,$area537,$area538,$area539,$area540,$area541,$area542,$area543,$are a544,$area545,$area546,$area547,$area548,$area549,$area550,$area551,$area552,$area553,$area554,$area 555,$area556,$area557,$area558,$area559,$area560,$area561,$area562,$area563,$area564,$area565,$area5 66,$area567,$area568,$area569,$area570,$area571,$area572,$area573,$area574,$area575,$area576,$area57 7,$area578,$area579,$area580,$area581,$area582,$area583,$area584,$area585,$area586,$area587,$area588 ,$area589,$area590,$area591,$area592,$area593,$area594,$area595,$area596,$area597,$area598,$area599, $area600,$area601,$area602,$area603,$area604,$area605,$area606,$area607,$area608,$area609,$area610,$ area611,$area612,$area613,$area614,$area615,$area616,$area617,$area618,$area619,$area620,$area621,$a rea622,$area623,$area624,$area625,$area626,$area627,$area628,$area629,$area630,$area631,$area632,$ar ea633,$area634,$area635,$area636,$area637,$area638,$area639,$area640,$area641,$area642,$area643,$are a644,$area645,$area646,$area647,$area648,$area649,$area650); $rt = 0; # finish; #Creating result table @first_row2 = @first_row; @first_row3 = @first_row; # execute; $sth9 ->finish; unshift (@second_row, "sub_title"); $sql10 ="INSERT INTO HJK_temp5 ($title2) VALUES ($values)"; $sth10 = $dbh ->prepare ($sql10); $sth10 ->execute (@second_row); $sth10 ->finish;

259 print " ... done, now data handling\n"; print OUTFILE_LOG "\nData handling start pearson R (matched ion number, matched ion ratio based on smaller number) less than 5 [matching ion ratio] less than 0.6 [pearson coefficient] less than 0.7 *: it is not best match\n\n"; # Data handling. $sql25 = "SELECT MAX(Number) from HJK_temp4"; $sth25 = $dbh ->prepare ($sql25); $sth25 ->execute; @row25 = $sth25->fetchrow_array(); $total_number = $row25[0]; $sth25 ->finish; $progress = 1; $black_list = " "; $sql5 = "SELECT Number, Sample, RT_adj, Area, Ion_num, area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,area63,ar ea64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,area77,area 78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area91,area92 ,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,area105,ar ea106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area117,area11 8,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,area130,ar ea131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area142,area14 3,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,area155,ar ea156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area167,area16 8,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,area180,ar ea181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area192,area19 3,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,area205,ar ea206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area217,area21 8,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,area230,ar ea231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area242,area24 3,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,area255,ar ea256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area267,area26 8,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,area280,ar ea281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area292,area29 3,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,area305,ar ea306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area317,area31 8,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,area330,ar ea331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area342,area34 3,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,area355,ar ea356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area367,area36 8,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,area380,ar ea381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area392,area39 3,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,area405,ar ea406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area417,area41 8,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,area430,ar ea431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area442,area44 3,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,area455,ar ea456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area467,area46 8,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,area480,ar ea481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area492,area49 3,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,area505,ar ea506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area517,area51 8,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,area530,ar ea531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area542,area54 3,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,area555,ar ea556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area567,area56 8,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,area580,ar ea581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area592,area59 3,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,area605,ar ea606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area617,area61 8,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,area630,ar ea631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area642,area64 3,area644,area645,area646,area647,area648,area649,area650 from HJK_temp4 ORDER BY Area DESC"; $sth5 = $dbh ->prepare ($sql5); $sth5 ->execute; while (@row5 = $sth5->fetchrow_array) {

260 print "$progress/$total_number\n"; $progress++; $number5 = shift (@row5); $sample5 = shift (@row5); $rt_adj5 = shift (@row5); $area5 = shift (@row5); $ion_num5 = shift (@row5); print OUTFILE_LOG "$sample5\t$rt_adj5\t$area5\t$ion_num5\t#$number5"; if ($black_list =~ / $number5 /){ print OUTFILE_LOG "\talready picked1\n"; next; # prepare($sql12); ->execute; ->finish;

print OUTFILE_LOG "\n"; if ($area5 > 5E8){ $fr = 0.65; $rr = 0.1; }elsif ($area5 > 1E8){ $fr = 0.3; $rr = 0.1; }elsif ($area5 > 5E5){ $fr = 0.2; $rr = 0.1; }else{ $fr = 0.15; $rr = 0.1; } $catched = 0; @update4_sample =(); @update4_r = (); @update4_area = (); @update4_number = (); @update4_ion_ratio = (); @matched_area_accumulated = (); $forward_range = $rt_adj5 - $fr; $reverse_range = $rt_adj5 + $rr; $sql6 = "SELECT Number, Sample, RT_adj, Area, Ion_num, area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,area63,ar ea64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,area77,area 78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area91,area92 ,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,area105,ar ea106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area117,area11 8,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,area130,ar ea131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area142,area14 3,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,area155,ar ea156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area167,area16 8,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,area180,ar ea181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area192,area19 3,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,area205,ar ea206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area217,area21 8,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,area230,ar ea231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area242,area24 3,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,area255,ar ea256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area267,area26 8,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,area280,ar ea281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area292,area29

261 3,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,area305,ar ea306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area317,area31 8,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,area330,ar ea331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area342,area34 3,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,area355,ar ea356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area367,area36 8,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,area380,ar ea381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area392,area39 3,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,area405,ar ea406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area417,area41 8,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,area430,ar ea431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area442,area44 3,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,area455,ar ea456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area467,area46 8,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,area480,ar ea481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area492,area49 3,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,area505,ar ea506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area517,area51 8,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,area530,ar ea531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area542,area54 3,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,area555,ar ea556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area567,area56 8,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,area580,ar ea581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area592,area59 3,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,area605,ar ea606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area617,area61 8,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,area630,ar ea631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area642,area64 3,area644,area645,area646,area647,area648,area649,area650 from HJK_temp4 WHERE RT_adj BETWEEN $forward_range AND $reverse_range and Sample != '$sample5' and Checking = 0"; $sth6 = $dbh ->prepare ($sql6); $sth6 ->execute; while (@row6 = $sth6->fetchrow_array) { $number6 = shift (@row6); $sample6 = shift (@row6); $rt_adj6 = shift (@row6); $area6 = shift (@row6); $ion_num6 = shift (@row6); @row51 = @row5; @row61 = @row6; ## reduce size print OUTFILE_LOG "\t$sample6\t$rt_adj6\t$area6\t$ion_num6 \t#$number6"; $rt_adj_reduced = substr $rt_adj6, 0, -2; print OUTFILE_LOG "\t$sample6\t$rt_adj_reduced\t$area6\t$ion_num6 \t#$number6 "; if ($black_list =~ / $number6 /){ print OUTFILE_LOG "\talready picked2\n"; next; # 0){ $n++; $sth18 ->execute ($ion5, $ion6); } } if ($n prepare($sql8); $sth8 ->execute; next; } if ($ion_num5 >= $ion_num6){ $ion_ratio = $n/$ion_num6;

262 }else{ $ion_ratio = $n/$ion_num5; } $ion_ratio_reduced = substr $ion_ratio, 0, 4; if ($ion_ratio < 0.6){ print OUTFILE_LOG "\t$n, [$ion_ratio_reduced]\n"; $sql8 = "TRUNCATE table pearson2"; $sth8 = $dbh ->prepare($sql8); $sth8 ->execute; next; } $sth7 ->execute; @row7 = $sth7 ->fetchrow_array; $pearson_r = $row7[0]; $pearson_r_reduced = substr $pearson_r, 0, 5; if ($pearson_r < 0.7){ print OUTFILE_LOG "\t{$pearson_r_reduced} ($n, $ion_ratio_reduced)\n"; $sql8 = "TRUNCATE table pearson2"; $sth8 = $dbh ->prepare($sql8); $sth8 ->execute; next; } $catched = 1; print OUTFILE_LOG "\t$pearson_r_reduced ($n, $ion_ratio_reduced)\n"; $sql8 = "TRUNCATE table pearson2"; $sth8 = $dbh ->prepare($sql8); $sth8 ->execute;

push push push push push push push push

(@update4_sample, $sample6); (@update4_rt, $rt_adj6); (@update4_r, $pearson_r); (@update4_area, $area6); (@update4_number, $number6); (@update4_ion_number, $ion_num6); (@update4_ion_ratio, $ion_ratio); (@matched_area_accumulated, @row6);

} # Now let's find best matching peak in each sample if ($catched == 1){ @update7_sample = (); @update7_number = (); @update7_area = (); push (@update4_sample, "it's end"); $up4s1 = shift (@update4_sample); $up4rt1 = shift (@update4_rt); $up4r1 = shift (@update4_r); $up4a1 = shift (@update4_area); $up4n1 = shift (@update4_number); $up4in1 = shift (@update4_ion_number); $up4ir1 = shift (@update4_ion_ratio); @up4aa1 = (); for ($i =50; $i = 0.8 and $up4ir1 >= 0.7 or $up4r1 >= 0.7 and $up4ir1 >= 0.8){ if ($up4a1 > 5E8){ $frr = 0.65; $rrr = 0.1; }elsif ($up4a1 > 1E8){ $frr = 0.3; $rrr = 0.1; }elsif ($up4a1 > 5E5){ $frr = 0.2; $rrr = 0.1; }else{ $frr = 0.15; $rrr = 0.1; } $k = 0; @up4aa1_1 = @up4aa1; $forward_range_2 = $up4rt1 - $frr; $reverse_range_2 = $up4rt1 + $rrr; $sql20 = "SELECT Number, Sample, RT_adj, Area, Ion_num, area50,area51,area52,area53,area54,area55,area56,area57,area58,area59,area60,area61,area62,area63,ar ea64,area65,area66,area67,area68,area69,area70,area71,area72,area73,area74,area75,area76,area77,area 78,area79,area80,area81,area82,area83,area84,area85,area86,area87,area88,area89,area90,area91,area92 ,area93,area94,area95,area96,area97,area98,area99,area100,area101,area102,area103,area104,area105,ar ea106,area107,area108,area109,area110,area111,area112,area113,area114,area115,area116,area117,area11 8,area119,area120,area121,area122,area123,area124,area125,area126,area127,area128,area129,area130,ar ea131,area132,area133,area134,area135,area136,area137,area138,area139,area140,area141,area142,area14 3,area144,area145,area146,area147,area148,area149,area150,area151,area152,area153,area154,area155,ar ea156,area157,area158,area159,area160,area161,area162,area163,area164,area165,area166,area167,area16 8,area169,area170,area171,area172,area173,area174,area175,area176,area177,area178,area179,area180,ar ea181,area182,area183,area184,area185,area186,area187,area188,area189,area190,area191,area192,area19 3,area194,area195,area196,area197,area198,area199,area200,area201,area202,area203,area204,area205,ar ea206,area207,area208,area209,area210,area211,area212,area213,area214,area215,area216,area217,area21 8,area219,area220,area221,area222,area223,area224,area225,area226,area227,area228,area229,area230,ar ea231,area232,area233,area234,area235,area236,area237,area238,area239,area240,area241,area242,area24 3,area244,area245,area246,area247,area248,area249,area250,area251,area252,area253,area254,area255,ar ea256,area257,area258,area259,area260,area261,area262,area263,area264,area265,area266,area267,area26 8,area269,area270,area271,area272,area273,area274,area275,area276,area277,area278,area279,area280,ar ea281,area282,area283,area284,area285,area286,area287,area288,area289,area290,area291,area292,area29 3,area294,area295,area296,area297,area298,area299,area300,area301,area302,area303,area304,area305,ar ea306,area307,area308,area309,area310,area311,area312,area313,area314,area315,area316,area317,area31 8,area319,area320,area321,area322,area323,area324,area325,area326,area327,area328,area329,area330,ar ea331,area332,area333,area334,area335,area336,area337,area338,area339,area340,area341,area342,area34 3,area344,area345,area346,area347,area348,area349,area350,area351,area352,area353,area354,area355,ar ea356,area357,area358,area359,area360,area361,area362,area363,area364,area365,area366,area367,area36 8,area369,area370,area371,area372,area373,area374,area375,area376,area377,area378,area379,area380,ar ea381,area382,area383,area384,area385,area386,area387,area388,area389,area390,area391,area392,area39

264 3,area394,area395,area396,area397,area398,area399,area400,area401,area402,area403,area404,area405,ar ea406,area407,area408,area409,area410,area411,area412,area413,area414,area415,area416,area417,area41 8,area419,area420,area421,area422,area423,area424,area425,area426,area427,area428,area429,area430,ar ea431,area432,area433,area434,area435,area436,area437,area438,area439,area440,area441,area442,area44 3,area444,area445,area446,area447,area448,area449,area450,area451,area452,area453,area454,area455,ar ea456,area457,area458,area459,area460,area461,area462,area463,area464,area465,area466,area467,area46 8,area469,area470,area471,area472,area473,area474,area475,area476,area477,area478,area479,area480,ar ea481,area482,area483,area484,area485,area486,area487,area488,area489,area490,area491,area492,area49 3,area494,area495,area496,area497,area498,area499,area500,area501,area502,area503,area504,area505,ar ea506,area507,area508,area509,area510,area511,area512,area513,area514,area515,area516,area517,area51 8,area519,area520,area521,area522,area523,area524,area525,area526,area527,area528,area529,area530,ar ea531,area532,area533,area534,area535,area536,area537,area538,area539,area540,area541,area542,area54 3,area544,area545,area546,area547,area548,area549,area550,area551,area552,area553,area554,area555,ar ea556,area557,area558,area559,area560,area561,area562,area563,area564,area565,area566,area567,area56 8,area569,area570,area571,area572,area573,area574,area575,area576,area577,area578,area579,area580,ar ea581,area582,area583,area584,area585,area586,area587,area588,area589,area590,area591,area592,area59 3,area594,area595,area596,area597,area598,area599,area600,area601,area602,area603,area604,area605,ar ea606,area607,area608,area609,area610,area611,area612,area613,area614,area615,area616,area617,area61 8,area619,area620,area621,area622,area623,area624,area625,area626,area627,area628,area629,area630,ar ea631,area632,area633,area634,area635,area636,area637,area638,area639,area640,area641,area642,area64 3,area644,area645,area646,area647,area648,area649,area650 from HJK_temp4 WHERE RT_adj BETWEEN $forward_range_2 AND $reverse_range_2 and Sample = '$sample5' and Checking = 0 and Number != $number5"; $sth20 = $dbh ->prepare ($sql20); $sth20 ->execute; while (@row20 = $sth20 ->fetchrow_array) { $number20 = shift (@row20); $sample20 = shift (@row20); $rt_adj20 = shift (@row20); $area20 = shift (@row20); $ion_num20 = shift (@row20); $rt_adj20_reduced = substr $rt_adj20, 0, -2; print OUTFILE_LOG "\t\t\t$sample20\t$rt_adj20_reduced\t$area20\t$ion_num20 \t#$number20 "; $n1 = 0; for ($i = 50; $i 0 and $ion6_1 > 0){ $n1++; $sth18 ->execute ($ion20, $ion6_1); } } if ($ion_num20 >= $up4in1){ $ion_ratio2 = $n1/$up4in1; }else{ $ion_ratio2 = $n1/$ion_num20; } } $ion_ratio2_reduced = substr $ion_ratio2, 0, 4; if ($n1 prepare($sql8); $sth8 ->execute; next; } if ($ion_ratio2 < 0.6){ print OUTFILE_LOG "\t$n1, [$ion_ratio2_reduced]\n"; $sql8 = "TRUNCATE table pearson2"; $sth8 = $dbh ->prepare($sql8); $sth8 ->execute;

265 next; } $sth7 ->execute; @row7 = $sth7 ->fetchrow_array; $pearson_r2 = $row7[0]; $pearson_r2_reduced = substr $pearson_r2, 0, 5; print OUTFILE_LOG "\t$pearson_r2_reduced ($n1, $ion_ratio2_reduced)\n"; if ($pearson_r2 >= $up4r1){ $k = 1; # prepare($sql8); $sth8 ->execute; } # execute; $sth23 ->finish; print OUTFILE_LOG "\t\t$column\t$row\n"; $black_list = $black_list."$number_for_black_list "; } } # prepare ($sql15); $sth15 ->execute; while (@row15 = $sth15->fetchrow_array) {

266 $i = 0; foreach $each (@row15){ $i++; chomp ($each); if ($i == 1){ $each--; print OUTFILE_SAVE "$each"; }else{ print OUTFILE_SAVE ",$each"; } } print OUTFILE_SAVE "\n"; } $sth5 $sth6 $sth6 $sth7 $sth8

->finish; ->finish; ->finish; ->finish; ->finish;

#$sth13 ->finish; $sth15 ->finish; $sth18 ->finish; $sth20 ->finish; #$sql9 #$sth9 #$sth9 #$sth9

= "DROP table HJK_temp4"; = $dbh ->prepare($sql9); ->execute; ->finish;

$sql10 $sth10 $sth10 $sth10

= "DROP table pearson2"; = $dbh ->prepare($sql10); ->execute; ->finish;

$sql16 $sth16 $sth16 $sth16

= "DROP table HJK_temp5"; = $dbh ->prepare($sql16); ->execute; ->finish;

$dbh -> disconnect; close (OUTFILE_LOG); close (OUTFILE_SAVE); print "All process finished\n\n";

267 Number

ORIGIN

RT_adj

FL7Mv435

FL7Mv740

FL7Mv743

FL7Mv747

FR7Mv444

0

sub_title

0

F275F272L7M

F276L7M

F274L7M

F271L7M

F275R7M

1

TRh5Mv498

8.6938

65778603

39329039

45641104

49271483

13568668

2

TRh5Mv870

8.6837

770558623

3810631

502763448

3

TRh5Mv872

8.6611

19171365

520862608

120824824

4

GRh7Mv425

22.9202

13032944

7464269

14812323

8920805

166875839

5

FRh5Mv857

22.9686

82437491

42827427

55910107

39184038

2122999

6

FRh7Mv767

23.1531

1664331

7

FRh5Mv867

22.9592

464311

8

FRh3Mv551

23.0229

9

GRh3Mv595

22.9515

10

FRh5Mv863

23.0006

11

FRh7Mv767

24.1001

117886

269924

3361266

12

FRh3Mv551

8.7258

971560

150672909

3331036

268336832

13

GRh6Mv796

22.9461

14

FRh3Mv579

23.0235

15

FRh3Mv551

23.9851

419197

401879

200074

1630706

16

FRh5Mv867

29.1459

768597

12792421

17

FRh5Mv857

29.1279

139748

130409

18

FRh5Mv863

29.1769

179825

19

FRh3Mv579

8.7111

20

GRh7Mv425

23.2796

16726637

21

FRh7Mv443

30.4399

717043220

22

FRh7Mv764

29.2262

61346087

23

FRh3Mv579

29.0727

486655904

24

FRh7Mv764

29.4699

7564525

25

FRh5Mv865

29.3203

770885672

298179

8387739

127010 109168176 479586608

160534

67088264

13601655

436144651 42622006

224728

287632

8988357

Supplementary Table 4 Result file of mass4_match_peaks_v2.pl. There are two header rows. If input file has “_” in the file name (e.g. G-Rh-7M-v425_G252Rh7M_report.txt), two parts are put in two rows after removing “-” (e.g. GRh7Mv425 in the first row, G252Rh7M in the second row). Number column means unique peak (this file has numbers to 23349). ORIGIN column is the sample that has largest peak area for this peak. RT_adj column is RT_adj (Supplementary Table 3) of the peak from the sample shown in ORIGIN column. After that, samples are shown. In this study there are 66 sample columns. For example, Number 1 peak is shared by these 5 samples.

268

APPENDIX D - METABOLIC PROFILING AND PHYLOGENETIC ANALYSIS OF MEDICINAL ZINGIBER SPECIES: TOOLS FOR AUTHENTICATION OF GINGER (ZINGIBER OFFICINALE ROSC.)

Manuscript “Metabolic profiling and phylogenetic analysis of medicinal Zingiber species: tools for authentication of ginger (Zingiber officinale Rosc.)”, was published in Phytochemistry.

269

Metabolic profiling and phylogenetic analysis of medicinal Zingiber species: tools for authentication of ginger (Zingiber officinale Rosc.) Hongliang Jianga,b,c,d, Zhengzhi Xiea,b,c,d, Hyunjo Kooa,b,c, Steven P. McLaughlina,b, Barbara N. Timmermanna,e, and David R. Ganga,b,c* a

Arizona Center for Phytomedicine Research, College of Pharmacy, University of Arizona, Tucson, AZ 85721, USA b Department of Plant Sciences, College of Agriculture and Life Sciences, University of Arizona, Tucson, AZ 85721, USA c Bio5 Institute, University of Arizona, Tucson, AZ 85721, USA d Department of Pharmaceutical Sciences, College of Pharmacy, University of Arizona, Tucson, AZ 85721, USA e Department of Pharmacology and Toxicology, College of Pharmacy, University of Arizona, Tucson, AZ 85721, USA Molecular and chemical data can be used to distinguish ginger from other medicinal plants in Zingiber. In addition, phylogeny based on the DNA sequences (A) matched closely the phylogeny based on chemical profiles (B) for these Zingiber species.

A

63.7 49.0

53.7

L1

Z. officinale

L1

L15

Z. zerumbet

L37

L37

Z. montanum

L15

L46 L31

Z. mioga Z. spectabile

L46 L31

B

50.2 30.8

55.0

270

Title: Metabolic profiling and phylogenetic analysis of medicinal Zingiber species: tools for authentication of ginger (Zingiber officinale Rosc.)

Hongliang Jianga,b,c,d, Zhengzhi Xiea,b,c,d, Hyun Jo Kooa,b,c, Steven P. McLaughlina,b, Barbara N. Timmermanna,e, and David R. Ganga,b,c*

a

Arizona Center for Phytomedicine Research, College of Pharmacy, University of

Arizona b

Department of Plant Sciences, College of Agriculture and Life Sciences, University of

Arizona c

BIO5 Institute, University of Arizona

d

Department of Pharmaceutical Sciences, College of Pharmacy, University of Arizona

e

Department of Pharmacology and Toxicology, College of Pharmacy, University of

Arizona

*Corresponding author: David R. Gang Department of Plant Sciences and BIO5 Institute, University of Arizona, Tucson, AZ 85721-0036, USA

271

Tel: 520-621-7154 Fax: 520-621-7186 email: [email protected]

272

Abstract Phylogenetic analysis and metabolic profiling were used to investigate the diversity of plant material within the ginger species and between ginger and closely related species in the genus Zingiber (Zingiberaceae). In addition, anti-inflammatory data were obtained for the investigated species. Phylogenetic analysis demonstrated that all Z. officinale samples from different geographical origins were genetically indistinguishable. In contrast, other Zingiber species were significantly divergent, allowing all species to be clearly distinguished using this analysis. In the metabolic profiling analysis, the Z. officinale samples derived from different origins showed no qualitative differences in major volatile compounds, although they did show some significant quantitative differences in non-volatile composition, particularly regarding the content of [6]-, [8]-, and [10]-gingerols, the most active anti-inflammatory components in this species. The differences in gingerol content were verified by HPLC. The metabolic profiles of other Zingiber species were very different, both qualitatively and quantitatively, when compared to Z. officinale and to each other. Comparative DNA sequence/chemotaxonomic phylogenetic trees showed that the chemical characters of the investigated species were able to generate essentially the same phylogenetic relationships as the DNA sequences. This supports the contention that chemical characters can be used effectively to identify relationships between plant species. Anti-inflammatory in vitro assays to evaluate the ability of all extracts from the Zingiber species examined to inhibit LPS-induced PGE2 and TNF-α production suggested that bioactivity may not be easily predicted by either phylogenetic analysis or gross metabolic profiling. Therefore,

273

identification and quantification of the actual bioactive compounds are required to guarantee the bioactivity of a particular Zingiber sample even after performing authentication by molecular and/or chemical markers.

Key words: Zingiber officinale, Zingiberaceae, ginger, authentication, metabolic profiling GC/MS, HPLC, gingerols, anti-inflammatory, trnL-F, rps16

1. Introduction

Ginger (Zingiber officinale Rosc.), a member of the tropical and sub-tropical Zingiberaceae, has been cultivated for thousands of years as a spice and for medicinal purposes. It is used extensively in Traditional Chinese Medicine to treat headaches, nausea and colds and in Ayurvedic and Western herbal medicinal practice for the treatment of arthritis, rheumatic disorders and muscular discomfort (Dedov et al., 2002). This species contains biologically active constituents including the main pungent principles, the gingerols and shogaols. The gingerols, a series of chemical homologs differentiated by the length of their unbranched alkyl chains, were identified as the major active components in the fresh rhizome (Govindarajan, 1982), with [6]-gingerol (5hydroxy-1-[4´-hydroxy-3´-methoxyphenyl] decan-3-one) being the most abundant. In addition, the shogaols, another homologous series and the dehydrated form of the gingerols, that result from the elimination of the OH group at C-5 and the consequent formation of a double bond between C-4 and C-5, are the predominant pungent

274

constituents in dried ginger (Connell and Sutherland, 1969; Mustafa et al., 1993). [6]Gingerol has been found to possess various pharmacological and physiological effects including anti-inflammatory, analgesic, antipyretic, gastroprotective, cardiotonic, and antihepatotoxic activities (Bhattarai et al., 2001; Jolad et al., 2004). Due to these properties, ginger has gained considerable attention as a botanical dietary supplement in the USA and Europe in recent years, and especially for its use in the treatment of chronic inflammatory conditions. Despite ginger’s widespread medicinal and culinary uses, the authentication of ginger samples remains a difficult problem due to heterogeneity of the plant material, contamination with similar looking plants and by the purposeful adulteration of some commercial samples. Because of these problems, authentication of the raw material is very important to ensure that specific batches of dried, chipped or ground ginger are of the quality desired for the manufacture of reliable botanical dietary supplements. To help address this problem, we investigated a large diversity of plant material within the ginger species and between ginger and several closely related species in the genus Zingiber, using molecular and metabolic profiling coupled with anti-inflammatory activity data. The species analyzed were selected because they are have documented medicinal uses, specifically anti-inflammatory, are used as substitutes for ginger as spices/flavoring agents, or because they are sometimes mistaken in the popular market for ginger (Ando, et al., 2005; Nakamura, et al., 2004; Miyoshi, et al., 2003; Murakami, et al., 2004; Yang, et al., 1999.).

275

2. Results and discussion

The major goals of this project were to investigate the effect of genetic diversity instead of environmental influences on the chemistry and bioactivity (anti-inflammatory activity) of ginger samples and to set a framework for the authentication of these important botanicals. To accomplish these goals, fresh frozen greenhouse-grown Zingiber samples were used for DNA sequence-based phylogenetic analysis, GC/MS-based metabolic profiling, HPLC quantitation of gingerols, and anti-inflammatory assays. As described in the Experimental section, all samples were grown at the same time under identical conditions in the same greenhouse to ensure that environmental effects were eliminated in this study.

2.1. Phylogenetic analysis based on molecular data One of our objectives was to investigate the genetic variability of ginger obtained from different sources and of the interspecific differences between ginger (Z. officinale) and other medicinal Zingiber species. A phylogeny of 104 species in 41 genera representing all four tribes of the Zingiberaceae was reported by Kress et al. (2002). That study, which was based on DNA sequences of the nuclear internal transcribed spacer (ITS) and plastid matK regions, did not include Z. officinale (ginger), Z. mioga, Z. montanum, Z. spectabile and Z. zerumbet.

These two regions, as described in the

Experimental section, are unsuitable for the phylogenetic determination of ginger specimens. The reason is that they do not amplify well using standard primers (for the

276

matK-trnK flanking intergenic spacer regions) or because the gene (ITS) is present in more than one copy in the genome of ginger thus, leading to undeterminable sequence data from PCR-product based sequencing runs. We, therefore, used solely the rps16 and trnL-F regions for our analysis of Z. officinale, Z. mioga, Z. montanum, Z. spectabile, Z. zerumbet and Alpinia galanga.

All species were chosen because of their traditional use

as medicinal plants and/or because they are used as adulterants of ginger (Langner et al., 1998). A. galanga also served as the outgroup for our phylogenetic analyses. In the plant specimens that we examined, the intron of rps16 had a total aligned length of 742 bp, and the trnL-F region contained a total aligned length of 891 bp. The combined matrix with the indel characters contained 1633 bp. Phylogenetic analysis (using maximum parsimony, see Experimental) of the joined sequences from the intron of rps16 and the trnL-F region resulted in a single consensus unrooted parsimonious tree (Fig. 1A). The consensus tree produced when rps16 and trnL-F regions were used independently to produce parsimonious trees did not differ from the tree produced when the datasets were combined (data not shown). All of the ginger (Z. officinale) samples contained the identical sequence over this entire region, resulting in a single, undifferentiated clade for these samples in the phylogenetic analysis, even though many of these sample lines had been obtained from very different geographical origins (see Table 1). However, the sequence of the ginger samples differed from those of Z. mioga, Z. montanum, Z. spectabile, Z. zerumbet, and the outgroup Alpinia galanga for both the rps16 intron and the trnL-F region, resulting in clear delineation of all species in our analysis. In particular, there are many single nucleotide polymorphisms in these

277

sequences, allowing us to distinguish ginger from other species based on the sequence data. Some of these sequence differences are illustrated in Fig. 2.

2.2. GC/MS-based metabolic profiling and fingerprinting of Zingiber species Because medicinal ginger is often sold as ground rhizome powders or as alcoholic or non-polar solvent extracts from these powders, it may be very difficult to obtain DNA evidence to demonstrate adulteration or authentication of particular samples. Thus, we evaluated the utility of chemical characters derived from metabolic profiling experiments to reconstruct the phylogeny of the medicinal Zingiber species and to distinguish between species. Non-polar compounds were extracted with Methyl t-Butyl Ether (MTBE) and thereafter analyzed by GC/MS. Compounds detected, identified and quantified by GC/MS are listed in Table 2. Many compounds present in small quantities were not included in this analysis because they could not be readily identified due to insufficient mass spectrum quality or because their relative concentration could not be adequately evaluated. Based on the compounds that were detected and/or identified from the different samples, we found that all ginger (Z. officinale) samples showed very similar metabolic fingerprints, i.e., there was no apparent qualitative difference in their GC/MS chromatograms. These plants were grown at the same time under identical conditions in the same greenhouse. Thus, when all differential environmental factors were eliminated, the ginger samples appeared to be chemically very similar, at least at the metabolic fingerprint level (they produced the same compounds), even though these lines were

278

originally obtained from very diverse populations around the country. This result matched the molecular data based on the joined sequences of the rps16 intron and the trnL-F region. At the metabolic profile level, however, clear differences between lines could be observed, where many of the compounds were found at different levels in the different ginger lines. This suggests that genetic factors control not only which specific compounds are produced, but at what levels. This is not surprising, but does suggest that ginger obtained from different sources may have significantly different levels of active compounds. This is addressed further below in section 2.3. Interestingly, the phylogenetic trees generated using the metabolic profiling and the DNA data were almost identical in structure (see Fig. 1B). The only observed difference was that Z. zerumbet was found to be more closely related to Z. officinale based on molecular data whereas Z. montanum was more closely related to Z. officinale based on the chemical data. However, bootstrap support for these differences was not very strong. Thus, major volatile chemical markers were very effective at distinguishing Zingiber species and at reconstructing essentially the same phylogeny as was obtained using the DNA sequence data, at least when the chemical characters were considered on a presence/absence basis. Other phylogenetic studies using both molecular and chemical data have been performed, such as with the genera Peltigera (Peltigeraceae) and Sticta (Stictaceae). Studies of these two genera also showed that both types of characters are useful and complementary (McDonald et al., 2003 and Miadlikowska and Lutzoni, 2000).

279

Many of the compounds identified in our metabolic profiling analysis could be used as marker compounds to distinguish between the different Zingiber species. For example, many of these compounds were only detected in Z. officinale samples and not in the other species that we examined. These included the gingerols and their derivatives, the shogaols and paradols, which have not been reported in any species besides Z. officinale, and which represent unique marker compounds for this species. In addition, citronellal, (E)- and (Z)-citral, (+)-cyclosativene, zingiberene, α-cubebene, germacrene D, cedr-8-ene, and α-farnesene, among others (see Table 2), were also only present in Z. officinale. These compounds, which were not detected in the other examined Zingiber species, could also be used as chemical markers to distinguish Z. officinale from other Zingiber species. Similarly, a number of other compounds were present in extracts from only one of the other Zingiber species. For example, 3-carene and limonene were detected only in Z. zerumbet; 1,3-cyclohexadiene, 1-methyl-4-(1-methyl)- and 4-isopropyl-1-methyl-2cyclohexen-1-ol were detected only in Z. montanum; and 1,4-Bis(methoxy)-triquinacene, was found only in Z. mioga. These compounds can be used as markers for the identification of ginger (Z. officinale) samples that have been adulterated by these other species. Z. spectabile did not offer any known compound that distinguished it from the other species. We observed, however, the presence of several unknown compounds that were found only in Z. spectabile (Table 2). These compounds, identified as DRG-GM1N1-8.86-136-93-121, DRG-GM1-N1-9.03-134-119-105 and DRG-GM1-N1-9.11-13693-68, were named following the nomenclature rules outlined by Bino et al. (2004) for the naming of unknown compounds in metabolic profiling investigations.

280

We also identified a number of compounds that were not detected in Z. officinale but were detected in more than one of the other Zingiber species. These included 3-thujene; 1-methyl-4-(1-methylethyl)-1,4-cyclohexadiene; and p-menth-1-en-4-ol, among others (Table 2). These compounds could also be used as markers to identify supposedly pure samples of Z. officinale that have actually been adulterated with Z. spectabile (or the other Zingiber species in which they were found).

2.3. Quantitation of gingerols using HPLC In order to further investigate potential variation of chemical composition in samples of Z. officinale from different sources, three major bioactive components [6]-, [8]-, and [10]-gingerols were quantitatively determined (metabolite target analysis). Calibration curves were derived from three independent injections of five concentrations of [6], [8]-, and [10]-gingerols versus the peak area. Linearity was found in the concentration range between 25 and 200 µg·ml-1, with high reproducibility and accuracy. Regression analysis of the experimental data points showed a linear relationship with excellent correlation coefficients (r2) for [6]-, [8]-, and [10]-gingerols, being 0.999 for each, suggesting high precision in this analysis. The linear regression equations for the curves of [6]-, [8]-, and [10]-gingerols concentrations were y = 52.85x – 95.27, y = 46.02x -73.92, and y = 43.57x -59.58, respectively, where x was the concentration of standard gingerol (µg·ml-1) and y was the total peak area. Although the gingerols are able to undergo dehydration reactions, leading to formation of the shogaols, this only occurs at

281

elevated temperatures or if solutions of the gingerols are dried in air. We saw no evidence for the presence of the shogaols in the standard solutions used for this analysis (although we were able to detect, identify and quantify the shogaols in extracts from ginger rhizomes, see Table 2). Thus, we believe that are results are not only precise, but are accurate representations of the actual content of the gingerols in these samples. The content of gingerols in Z. officinale samples, as summarized in Table 3, were determined by HPLC for 10 accessions that were obtained from different origins. The total gingerols content varied from 1.931 to 3.577 mg/g, with [6]-, [8]-, and [10]gingerols ranging from 1.284 to 1.905, 0.220 to 0.595, and 0.310 to 1.128 mg/g, respectively (see Table 3). The observed differences were significant for most pair-wise comparisons, as determined by ANOVA analysis (P