Primer Prim'r: A web based server for automated primer design

13

Journal of Structural and Functional Genomics 5: 13–21, 2004. © 2004 Kluwer Academic Publishers. Printed in the Netherlands.

Primer Prim’ r: A web based server for automated primer design John K. Everett 1,2, Thomas B. Acton 2 & Gaetano T. Montelione 1,2,* 1

Department of Biochemistry, Robert Wood Johnson Medical School, Piscataway, NJ 08854-5638, USA; Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, 679 Hoes Lane, Rutgers University, Piscataway, NJ 08854-5638, USA; *Author for correspondence (Fax: +1-732-235-5633; E-mail: [email protected]) 2

Received 30 December 2003; accepted in revised form 15 January 2004

Key words: domain parsing, Northeast Structural Genomics Consortium, primer design program, structural genomics Abstract The Northeast Structural Genomics Consortium (NESG) is one of nine NIH-funded pilot projects created to develop technologies needed for structural studies of proteins on a genome-wide scale. One of the most challenging aspects of this emerging field is the production of protein samples amenable to structural determination. To do this efficiently, all steps in the protein production pipeline must be automated. Here we describe the Primer Prim’ r program (linked from www-nmr.cabm.rutgers.edu/bioinformatics), a web-based primer design program freely available to the scientific community, which was created to automate this time consuming and laborious task. This program has the ability to simultaneously calculate plasmid specific primer sets for multiple open reading frame (ORF) targets, including 96-well and greater formats. Primer Prim’ r includes a library of commonly used plasmid systems and possesses the ability to upload user-defined plasmid systems. In addition to calculating gene-specific annealing regions for each target, the program also adds appropriate restriction endonuclease recognition or viral recombination sites while preserving a reading frame with plasmid based fusions. Primer Prim’ r has several useful features such as sorting calculated primer sets by target size, facilitating interpretation of PCR amplifications by agarose gel electrophoresis, as well as supplying the molecular biologist with many important characteristics of each target such as the expected size of the PCR amplified DNA fragment and internal restriction sites. The NESG has cloned over 1500 genes using oligonucleotide primers designed by Primer Prim’ r. Introduction In high throughput structural and functional proteomic efforts, a key challenge is the task of maintaining a rapid and continuous flow of high quality protein samples. Although many aspects of cloning, protein expression and protein screening have been automated through robotics, the design of oligonucleotides required for generating protein expression constructs is not as well developed. The design of PCR primers for cloning projects is a time consuming proThe software can be accessed interactively at: http://www-nmr.cabm.rutgers.edu/bioinformatics/ It can be run through its web-based interface using data provided by the Examples Icon on the Entering Cloning Targets page.

cess. Incorrectly designed primers invariably lead to cloning project failures and loss of valuable time. The process of calculating the template annealing region of primers is only half of the design process, and unfortunately is the point at which most primer design software suites stop. Restriction endonuclease sites or viral recombination sites also need to be designed into primers in order to insert the appropriate PCR product into a plasmid while maintaining a proper reading frame. In order to advance and complete the automation of the primer design process, we have developed a web-based server, ‘Primer Prim’ r’. This program designs PCR primers engineered to amplify fixed regions of template DNA that are destined to be inserted into user-defined plasmids, specifica-

14 tions for which are provided by the program’s plasmid libraries or by the user. Guided by the unique requirements of each specific plasmid, endonuclease restriction sites and frameshift corrections or viral recombination sites (e.g., Invitrogen’s Gateway® cloning system [1]) are introduced into designed primer sets. The program has the ability to simultaneously calculate primers for multiple targets, and routinely provides primer sets for use in 96-well format robotic cloning processes.

Primer Prim’ r program The Primer Prim’ r program can be accessed via the internet from: www-nmr.cabm.rutgers.edu/bioinformatics. From the program’s main page (Figure 1A), users may either proceed to the program’s Options section (Figure 1B) or Plasmids section (Figure 1C), or they may directly begin the primer design process. Here we will describe the program as it proceeds through each successive step in the primer design process and then describe the user controlled options.

Materials and methods PCR amplification and agarose gel electrophoresis

Target entry

PCR amplifications were carried out with Advantage HF 2 Polymerase (BD Biosciences) in a volume of 50 µl, using the vendor supplied buffer reagents and procedure. For prokaryotic targets, 10 ng of genomic DNA (ATCC) extracted from each respective organism was used as template for each reaction. A twostep PCR program was used with an ABI 9700 thermocycler, which increased the annealing temperature as the cycles progressed. Powerscript Reverse Transcriptase (BD Biosciences) was used for the reverse transciptase reactions, again using the buffers and procedure supplied by the manufacturer. Oligo dT 12–18 or a gene specific primer was used in the RT reactions. Drosophila melanogaster embryo (6947-1), larva (6946-1) and adult (6945-1) derived PolyA ⫹ RNA from BD Biosciences were used as the template for the RT-PCR reactions for this organism. The template for the human target RT-PCR reactions was PolyA ⫹ RNA derived from human colorectal adenocarcinoma (6586-1), brain (6516-1) and testis (6535-1), also from BD Biosciences. In the case of both eukaryotes, a single large-scale RT-PCR reaction was carried out for each developmental or cell type. For each organism, the resulting reaction products were combined to form a common pool of nascent cDNA and 1/100 volume was used as template for each ensuing PCR reaction. All PCR results were visualized on a 2% agarose gel run in 1× Tris-acetateEDTA with ethidium bromide, using a 100 bp DNA Ladder (New England Biolabs Inc.) as a molecular weight standard.

After accessing the program’s main page, the primer design process begins by clicking the start icon and accessing the target acquisition step (Figure 2A). In the field provided, users enter FASTA formatted DNA target(s) into the program. Here, any number of targets can be entered simultaneously. Though designed to accept large lists of targets (96- or 384-well formats), the program can also be used to design a primer set for a single target.

Target verification Once the program has acquired the targets, each target sequence is screened for errors and potential problems (Figure 2B). Target errors such as nonstandard characters, incomplete codons, internal stop codons and warnings such as rare codons and lack of a start codon are reported before the primer design process continues. Errors must be corrected before the primer design process can continue while warnings may be corrected at the user’s discretion. Warnings indicate target sequence features that do not interfere with the cloning of an intact ORF but may effect expression of the target. Placing the mouse over an icon spawns an information window detailing the error or warning that the icon demarks.

Plasmid selection Once the protein target sequences have been uploaded into the program and verified as described above, users select the plasmid systems for which primer sets are to be designed (Figure 2C). The program displays

15 vectors [2], is the default plasmid set. Users may upload plasmids detailed in a file on their local computer. The program selects the most 5⬘ and most 3⬘ restriction sites of the polylinker of a given plasmid as the default restriction sites. This is done in an attempt to limit the number of expressed non-native residues (nnr), since the addition of extraneous residues can often have a negative effect on protein solubility, folding or structural studies [3]. Primer Prim’ r calculates and lists any non-native residues that would be expressed between the target sequences and plasmid based fusions. Users may manually select restriction sites and the nnr list is dynamically updated in response to their selection. Restriction sites internal to the target of interest are not made available for incorporation into the primer set. Many functional genomics efforts are using ligation-independent cloning technology, which takes advantage of viral recombinases [4]. Primer Prim’ r also has the ability to calculate primers intended for the Gateway® recombinational cloning system. In order to expedite primer design while considering multiple targets, we have introduced an ‘Express’ option. When selected, this option instructs the program to apply the plasmid systems selected for the current target to all additional targets.

Program output

Figure 1. Primer Prim’ r’s main page, showing Program Options and Plasmid Options pages. (A) The Primer Prim’ r main page linked from www.nmrcabm.rutgers.edu/bioinformatics. Here the user gains entry into the primer design process or can access either the program options section or the plasmids options section. (B) In the Options section, users can customize a multitude of parameters for primer calculations, such as melting temperature range and host expression system. (C) The Plasmids Options window allows the user to upload plasmid systems from their computer for use with Primer Prim’ r.

a list of available plasmid systems that the user can employ. The NESG multiplex expression vector system, a collection of customized pET T7 expression

The program initially displays the calculated primers in an outline format (Figure 3A). Users may toggle the output to a tabular format (Figure 3B) or vendor specific formats (Figure 3C) listed atop the output page. The output formats are customizable. For example, the primer sets can be listed such that all forward primers are listed separately from the reverse primers. This is extremely helpful for setting up automated 96-well PCR reactions, in which one plate contains all of the forward primers and another plate has the reverse primers in the corresponding well of each target’s forward primer. The components of each primer sequence are uniquely colored. For example, the target annealing region, restriction endonuclease site, nucleotides added to preserve the translational reading frame, nucleotides added to ensure efficient endonuclease cleavage and stop codons each possess an identifying color. Aside from the calculated primers, the output section reports several other important features useful to the molecular biologist. These include the primer set

16

Figure 2. Target acquisition, target verification and plasmid selection windows. (A) In the target acquisition step, users enter FASTA formatted DNA sequences to be used for primer design. (B) Targets are submitted for target verification. Errors and warnings, such as the presence of internal stop codons and rare codons for the selected expression host, are reported. Red error icons indicate target errors that need to be addressed before the primer design process can continue. Yellow warning icons indicate warnings of potential problems that may be addressed at the user’s discretion. Placing the mouse over an icon spawns an information window that elaborates additional information about the error or warning. (C) Once target sequences are verified, users select the plasmid(s) into which targets will be inserted. Non-native residues (if any) that will be expressed between targets and plasmid based fusions are reported. If users wish to clone into the same plasmid(s) for a list of additional targets, the ‘Express’ button can be checked.

melting temperature (T M) and the number of base pairs expected from the PCR amplification of each target. The output also lists not only which restriction enzymes should be used in the cloning steps, but also

any restriction sites that are internal to the target sequence. Other important features that are listed include which plasmids the primer sets are designed for, and the length of each primer. In addition, the T M of

17

Figure 3. Primer Prim’ r output. (A) Default report form. For each primer set, the forward and reverse primers are listed as well as the intended plasmid. The ‘Dimers’ section reports the T M’s of predicted forward:forward, forward:reverse and reverse:reverse primer dimers. Clicking on the Dimers section reveals a detailed report of the dimer predictions. If requested in the Program Options section, the genomic primer scan (gps) results will be displayed in the report. Roman numerals indicate chromosomes upon which alternative annealing sites reside, and these are color coded according to the length of the alternative site. If the requested genome is not organized into chromosomes, appropriate genomic markers are used. Clicking on the gps section reveals a detailed report of the scan. Additionally, the output can be displayed in a tabular format (B) and vendor specific formats (C) as well.

potential primer dimer structures and the results of genomic primer scans (described below) are reported.

When PCR amplifying several targets at one time, it is time consuming and laborious to verify with agarose gel electrophoresis that each target has an ampli-

18 fication product of the correct length. In addition, complex amplification methods such as RT-PCR may result in multiple DNA bands for each target. The correct band must therefore be identified for excision from the gel. This level of complication is not compatible with high-throughput protein production. In order to circumvent these difficulties, users can organize the targets according to the size of the expected PCR amplified DNA fragment. This organization facilitates the analysis and excision of DNA bands, since the resulting correct set of PCR products forms a ladder pattern on an agarose gel (Figure 4A). This ladder pattern allows for quick and easy detection of an incorrect PCR product. Alternately, users can organize primer sets so that the PCR products form a staggered ladder pattern on an agarose gel (Figure 4B), decreasing the probability of cross-contamination between DNA bands during excision and other manipulations.

Program options Although the default values of Primer Prim’ r are appropriate for most applications, it may be advantageous to change program parameters in order to optimize the primer design process for different systems and strategies. To address this, we have designed Primer Prim’ r to allow customization of several parameters, which are controlled by the menu shown in Figure 1B. The following is a list of customizable parameters: Host expression system Users can define which host the designed expression system constructs are destined for. Defining the host expression system allows the program to identify rare codons that may reduce protein expression in the selected organism. The program currently supports Escherichia coli and Saccharomyces cerevisiae expression systems.

Figure 4. Agarose gel electrophoresis of a PCR amplification using oligonucleotides designed by Primer Prim’ r. In each panel, the NESG target identification code is listed, in this case E for E. coli and R for Rutgers protein production center, followed by the assigned target number. Under each target name, the expected size of the PCR product is listed, the center lane contains a 100 bp ladder. (A) The program can sort the output file such that the targets are ordered in increasing size. Primer plates with this configuration are then used for automated PCR setup, the results form a step-like pattern on the gel. This is of great utility in assaying the success of the PCR reactions. (B) An alternating step pattern can also be selected; this is done in an effort to decrease the probability of crosscontamination between adjacent lanes during band excision.

Codon completion In many instances one or two nucleotides must be added between a target sequence and an endonuclease restriction site or viral recombination site in order to preserve the reading frame between the target and plasmid based fusions. These additional nucleotides lead to the expression of a non-native residue. Since

any of the four standard nucleotides may be used to preserve the frame, the resulting non-native residue may sometimes be tailored to avoid non-native hydrophobic or cysteine residues. In the interest of improved solubility and structural determination, the addition of a more innocuous amino acid residue such as serine rather than a large hydrophobic residue such

19 as tryptophan is often advantageous. Therefore, we have designed the program to use only those residues provided in an input list as additional non-native amino acids in the translated sequence, ranked from most to least preferred. Based on this hierarchy, the program selects the added nucleotides in such a manner as to encode the least disruptive non-native residue(s).

truncation series or the target may be a domain from a multi-domain protein. If a start codon is not present at the beginning of a target sequence, the program can add a start codon to the forward primer. The start codon can be either appended to the beginning of the target sequence or it may replace the first codon of the sequence. Genomic primer scan

Truncation cascade This option offers the calculation of a truncation cascade, which is particularly useful for defining multiple constructs of potential domains within protein sequences. Protein truncations have been shown to exhibit different solubility and oligomerization properties than their full length counterparts [5,6]. The program allows for the creation of a series of truncations from either protein terminus or both, resulting in a set of related constructs with different lengths. The user can define the number of different truncations as well as the number of amino acids removed by each truncation. These truncations are created by the design of additional primer sets that anneal within the interior of target sequences. Non-fusion reverse primer The program allows for the creation of an alternate reverse primer that possesses a stop codon that prevents the expression of plasmid-based carboxy terminal fusions. The alternative reverse primer allows for the creation of two constructs (C terminal fusion and non-fusion) with only three primers (forward, reverse and reverse with stop codon) and a C terminal fusion plasmid. Melting temperature range All primer sets created for a group of targets possess a T M that falls within a T M range. This option allows users to define the T M range for the designed primers. The default melting temperature range is 60 °C to 70 °C. The T M of all primer sets are geared towards the center of the user-defined T M range.

PCR reactions that employ genomic DNA as a PCR template are often plagued with difficulties resulting from primers annealing to unintended regions of the genomic template. The program has the ability to scan a requested genome for such alternate annealing sites within the complete genome sequence of a particular organism. If significant alternate annealing sites are identified, the primer set T M range should be increased in order to yield longer primers with greater specificity. The program currently supports scanning of the Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Saccharomyces cerevisiae and Homo sapiens genomes. T M formula There are several formulas currently used by scientists to calculate the melting temperature of complementary DNA fragments. The default used in the Primer Prim’ r program is the Wallace formula [7]. However, we have enabled the program to employ three other formulas for this calculation (see below). If the user chooses to employ a formula that requires the input of the PCR buffer conditions, the options section provides fields (cation type, cation concentration, primer concentration) to characterize the PCR reaction. The program offers the following T M calculation formulas [7–10]: Wallace et al.

T m ⫽ 2 ⴱ AT ⫹ 4 ⴱ GC

Rychlik et al.

Tm ⫽

⌬H ⌬S ⫹ R ⴱ ln 共C/4兲 ⫹ 16.6 log 关K兴 ⫺ 273.15

SantaLucia et al. Missing start codon This option instructs the program how to handle instances of targets that are missing start codons. A target’s start codon may be absent due to a requested

Tm ⫽

⌬H 共共⌬S ⫹ 共0.386 ⴱ 共len ⫺ 1兲 ⴱ ln 关K兴兲兲 ⫹ R ⴱ ln 共C/4兲兲 ⫺ 273.15

20

LeNovere

Tm ⫽

⌬H ⌬S ⫹ R ⴱ ln 共C/2兲 ⫹ 12.5 log [Na兴 ⫺ 273.15

Plasmids options The plasmid systems available to the program are defined in the plasmids options, shown in Figure 1C. The default plasmids are the T7 polymerase-based NESG multiplex expression vectors [2]. These default plasmid systems are defined in a plasmid library file located on the program’s server. Users may instruct the program to use a plasmid library file on their own computer in order to design primer sets for their own plasmid systems. Extensive documentation is provided explaining how to upload user plasmid systems and their use in the Primer Prim’ r program.

Results and discussion The Northeast Structural Genomics Consortium is on pace to produce approximately 2500 expression constructs per year. Designing the large number of corresponding primers necessary for producing these clones would be an extremely daunting task if done manually, and certainly prone to errors. Therefore it was essential to find the means to automate the primer design step in protein production. After searching literature databases as well as internet resources, it became apparent that there was a lack of software available to meet our needs. As a structural genomics project, unlike other large scale genomic efforts such as microarray studies, our focus is on producing large quantities of high-quality protein samples. This necessitates cloning targets into protein expression vectors, which brings about a new set of requirements for primer design programs since reading frames need to be preserved and vector and genomic-specific issues need to be considered. Specifically, we require a resource that can consider multiple targets (96-well format) and add the proper cloning sites for each target while preserving any open reading frame with plasmid based affinity tags. In order to fulfill this requirement in our protein production pipeline, we created the Primer Prim’ r program, a web-based primer design program. In keeping with the spirit of the National Institute of General Medical Sciences (NIGMS) / National Institute of Health (NIH) Protein

Table 1. Rate of successful PCR amplifications using Primer Prim’ r. Organism

Correct size product (%) # of targets

E. coli B. subtillis Archea A. aeolicus D. melanogaster (RT-PCR) H. sapiens (RT-PCR)

99 95 93 91 87 70

120 141 116 96 151 596

Structure Initiative, this resource is freely available to the scientific community. Utilizing the Primer Prim’ r program, we have successfully PCR amplified and cloned over 1500 targets for the NESG structural genomics effort. In Table 1, we list some of the organisms from which we have amplified protein open reading frames that have been cloned and amplified using Primer Prim’ r based oligonucleotides, together with statistics on the percentage of correctly amplified products. Overall, considering genes from prokaryotic sources, greater than ninety percent of the oligonucleotide primers designed by Primer Prim’ r have the correct-size DNA product and sequence after amplification. In addition to the organisms listed in this table, we have had similar success with Aeropyrum pernix, Aquifex aeolicus, Archaeglobus fulgidis, Brucella melitensis, Campylobacter jejuni, Caulobacter crescentus, Deinococcus radiodurans, Fusobacterium nucleatum, Lactococcus lactis, Neisseria meningitides, Pyrococcus furiosus, Pyrococcus horikoshi, Staphylococcus aureus, Streptococcus pyogenes, Streptomyces coelicolor and Thermus thermophilus ORFs. In the case of the eukaryotic ORFs, including Drosophila melanogaster and Homo sapiens targets, the percentage of correct PCR products isolated is lower (87% and 70%, respectively) than that obtained using bacterial targets (Table 1). However, for D. melanogaster RTPCR reactions were employed using PolyA RNA isolated from embryonic, larval and adult life stages as template, a significantly more complex reaction. For the human targets, RT-PCR reactions are used with PolyA-RNA isolated from adenocarcinoma, brain and testes cells used as template. A likely explanation of the decreased success rates for eukaryotic targets is the fact that many of the ORFs targeted in these RTPCR experiments are not expressed as mRNA transcripts in these cell types under the harvested stage of development. Indeed, somewhat better success was

21 obtained with the Drosophila targets, in which a greater coverage of developmental stages and organs is included in the RNA template pool, than with the narrower pool of human RNA transcripts. In conclusion, we have instituted the Primer Prim’ r program into the NESG protein production pipeline with great success and it has become an invaluable resource for the consortium.

2.

3. 4. 5. 6. 7.

Acknowledgements We thank K. Ho for expert technical assistance. This work was supported by a grant from the Protein Structure Initiative of the National Institutes of Health (P50 GM62413).

References 1.

Walhout, A.J., Temple, G.F., Brasch, M.A., Hartley, J.L., Lorson, M.A., van den Heuvel, S. and Vidal, M. (2000) Methods Enzymol. 328, 575–592.

8. 9. 10.

Acton, T.B., Xiao, R., Clement, T., Gunsalus, K, Everett, J.K., Shastry, R., Ma, L.C., Chiang, Y., Wu, M., Ho, C.K., Palacios, D., Montelione, G.T., to be published. Qamar, S., Islam, M. and Tayyab, S. (1993) J. Biochem. (Tokyo) 114, 786–792. Yokoyama, S. (2003) Curr. Opin. Chem. Biol., 7, 39–43. Scott, E.E., Spatzenegger, M. and Halpert, J.R. (2001) Arch. Biochem. Biophys. 395, 57–68. Rebrin, I., Geha, R.M., Chen, K. and Shih, J.C. (2001) J. Biol. Chem. 276, 29499–29506. Wallace, R.B., Shaffer, J., Murphy, R.F., Bonner, J., Hirose, T. and Itakura, K. (1979) Nucleic Acids Res. 6, 3543–3557. Rychlik, W., Spencer, W.J. and Rhoads, R.E. (1990) Nucleic Acids Res. 18, 6409–6412. SantaLucia, J. Jr, Allawi, H.T. and Seneviratne, P.A. (1996) Biochemistry 35, 3555–3562. Le Novere, N. (2001) Bioinformatics 17, 1226–1227.