Introduction to Programming: Perl for Biologists - Bioinformatics ...

7 downloads 352 Views 3MB Size Report
Introduction to Programming: Perl for Biologists. Timothy M. Kunau. Center for Biomedical Research Informatics. Academic Health Center. University of ...
Introduction to Programming: Perl for Biologists Timothy M. Kunau Center for Biomedical Research Informatics Academic Health Center University of Minnesota [email protected] Bioinformatics Summer Institute 2007

1

Introduction to Programming: Day two Timothy M. Kunau Center for Biomedical Research Informatics Academic Health Center University of Minnesota [email protected] Bioinformatics Summer Institute 2007

2

Day I •Art and Programming •Getting Started •Biology and Computer Science •Bioinformatics Data •Perl basics: •Strings and Variables •Math and Logic •Looping, operators, and functions 3

Day II •Assignment discussion •Data from outside the program •Writing out data •Data into arrays and hashes •Array operations •Scope and Good practices •RegEx 4

Day I: assignment review. 1. Calculate the reverse complement of a DNA strand using the tr/// operation. 2. Read about file handling. (Safari on-line documentation is available.) 3. Read about Regular Expressions (regex). (Safari) 4. Find CPAN.ORG and locate a module that would be useful to you as a biologist. 5. Read about that module and email me ([email protected]) the following details: 1. Name of the module. 2. The name of the person who wrote it. 3. What it does. 4. How it would be useful to you?

5

Day I: assignment review.

1. Calculate the reverse complement of a DNA strand using the tr/// operation.

6

The tr/// operator (translate)

• Match and replace what is in the first section, in order, with what is in the second.

• $dna

=~ tr/[A-Z]/[a-z]/;

# lowercase

• $dna

=~ tr/[A-Z]/[B-ZA]/;

# shift cipher

• $dna

=~ tr/[ACGT]/[TGCA]/;

# revcom

• $dna

= reverse($dna);

7

7

s/// operator (substitute)

• Allows you to substitute whatever is matched in

first section with value in the second section. (See m//.)

• $sport

=~ s/football/soccer/g;

• $tdfwinner

=~ s/Lance Armstrong/Ivan Basso/g;

8

8

Reverse compliment of a DNA strand #!/usr/bin/perl -w # Calculating the reverse complement of a strand of DNA # The DNA my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "Here is the starting DNA:\n\n$DNA\n\n"; # Calculate the reverse complement my $revcom = reverse $DNA; # The Perl translate/transliterate command is just what we need: $revcom =~ tr/ACGTacgt/TGCAtgca/; print "Here is the reverse complement DNA:\n\n$revcom\n";

9

CPAN

10

Day I: assignment review, CPAN modules 1. Name of the module. 2. The name of the person who wrote it. 3. What it does. 4. How it would be useful to you.

11

Getting Data from Files open(HANDLE, "contig2_MT.fa") || die $!; while (defined($line = )) { if( $line =~ /^\>/ ) { print $line, "\n"; } } close(HANDLE); % ./file-handles.pl >ContigId:Contig2 AssemblyProcessId:MtSC AssemblyProcessVersion:1 12

12

Getting Data from Files open(HANDLE, "contig2_MT.fa") || die $!; while () { if( $_ =~ /^\>/ ) { # tests first line print $_, "\n"; # prints first line } } close(HANDLE); % ./file-handlesII.pl >ContigId:Contig2 AssemblyProcessId:MtSC AssemblyProcessVersion:1 13

13

Getting Data from Files open(HANDLE, "contig2_MT.fa") || die $!;

@slurp = ;

print @slurp; close(HANDLE);

% ./file-handlesIII.pl >ContigId:Contig2 AssemblyProcessId:MtSC AssemblyProcessVersion:1 GGGTATACTTCCTCCTCCATTGTTTGAGATATCACAAGACTTGAAATTGA GCACGACCCATATTCTACTTCAAGGCGTTGAAGCAAAAACTCACCATGGG AAACTAAACAGGTTAGTAAGTAGGCATCACCATCATTTTATATCGATATG GATAATAATGCACAAGACTTTCAAAGTTATCTTCAGATTCTTCCCCCTGT TGAGTTTGCTTGCGTTTATGGATCATCTCTTCATCCAACCAATCATGACA AGACAACCATGGTTGATTATATTCTTGGAGTTTCTGACCCTATACAATGG CATTCTGAGAATCCGAAAATGAATAAGCATCACTATGCGTCATGGATGGT GCACCTTGGTGGAGAGAGGCTGATTACCGCAGATGCAGATAAAATTGGTG TGGGAGTACATTTCAACCCTTTTG

14

14

Pass data into a program while() { print “stdin read: $_”; }

15

15

Pass data into a program open(GREP, “grep ‘>’ $filename”) || die $!; my $i = 0; while() { $i++; } close(GREP); print “$i sequences in file\n”; 16

16

Writing out data open(OUT, “>outname”) || die $!; print OUT “sequence report\n”; close(OUT);

17

17

Writing out data # appending with >> open(OUT, “>>outname”) || die $!; print OUT “append this\n”; close(OUT);

18

18

Filehandles as variables my $var = \*STDIN;

19

19

Filehandles as variables open($fh, “>report.txt”) || die $!;

print $fh “line 1\n”;

20

20

Filehandles as variables open($fh2, “report”) || die $!; $fh = $fh2;

while() { something interesting goes here; } 21

21

Zero based economy...

•The first element is ‘0’ for an index or first character in a string •computer scientists like it this way •as do most programming languages, including Perl •Biologists often number first base in a sequence as ‘1’ •GenBank •BioPerl •Interbase coordinates (Kent-UCSC, Chado-GMOD) 22

22

Coordinate systems

• Zero based, interbase coordinates A A T G G G T A G A 0 1 2 3 4 5 6 7 8 9

• 1 based coordinates A T G G G T A G A 1 2 3 4 5 6 7 8 9 23

23

Arrays as Lists

• Lists are sets of items • Can be mixed types of scalars (numbers, strings, floats) • Perl uses lists extensively • Variables are prefixed by @

24

24

List operations

• reverse # reverse list order • $list[$n] # get the $n-th item • $two = $list[2]; # get which item?

25

25

List operations

• reverse # reverse list order • $list[$n] # get the $n-th item • $three = $list[2]; # get the third

item

26

26

List operations

• scalar # get length of array • $len = scalar @list; • $last_index = $#list; • delete $list[10]; # delete entry

27

27

Autovivication

• Autovivify : to bring oneself to life. • Automatically allocates space for an array item element: $array[0] = ‘apple’; $array[4] = ‘elephant’; $array[25] = ‘zebra’; delete $array[25];

28

28

29

pop,push,shift,unshift # remove last item $last = pop @list; # remove first item $first = shift @list; # add to end of list push @list, $last; # add to beginning of list unshift @list, $first; 30

30

splicing an array splice ARRAY,OFFSET,LENGTH,LIST splice ARRAY,OFFSET,LENGTH splice ARRAY,OFFSET splice ARRAY

31

31

splicing an array

@list = (‘alice’,’chad’,’rod’); ($x,$y) = splice(@list,1,2); splice(@list, 1,0,(‘marvin’,’alex’));

32

32

Sorting with sort @list = (‘tree’,’frog’, ‘log’); @sorted = sort @list; # reverse order @sorted = sort { $b cmp $a } @list;

33

33

Sorting with arrays of numbers @list = (25,21,12,17,9,8);

# sort based on numerics @sorted = sort { $a $b } @list; # reverse order of sort @revsorted = sort { $b $a } @list;

34

34

LAB: files % pico files2arrays.pl

#!/usr/bin/perl -w # # Reading protein sequence data file. # File containing the sequence data my $fastafilename = 'contig2_MT.fa'; # First we have to "open" the file open(FASTAFILE, $fastafilename); # Read the fastafrom file, and store it # into the array variable @protein @fasta = ; # Print the protein onto the screen print @fasta; # Close the file. close FASTAFILE; exit;

35

LAB: files % pico files2arrays.pl

#!/usr/bin/perl -w # # Reading protein sequence data file. # File containing the sequence data my $fastafilename = 'contig2_MT.fa'; # First we have to "open" the file open(FASTAFILE, $fastafilename) || die $!; # Read the fastafrom file, and store it # into the array variable @protein @fasta = ; # Print the protein onto the screen print @fasta; # Close the file. close FASTAFILE; exit;

36

LAB: get a file in FASTA format http://www.ncbi.nlm.nih.gov/

37

LAB: navigate to GenBank

38

LAB: search for your favorite protein

39

LAB: favorite protein entries, change display

40

LAB: change display to FASTA

41

LAB: we return to our program, already in progress % pico kinase.fa

#!/usr/bin/perl -w # # Reading protein sequence data file. # File containing the sequence data my $fastafilename = 'kinase.fa';

% pico files2arrays.pl

# First we have to "open" the file open(FASTAFILE, $fastafilename) || die $!;

Add the name of the FASTA file you created to the program.

# Read the fastafrom file, and store it # into the array variable @protein @fasta = ;

Run the program.

# Print the protein onto the screen print @fasta; # Close the file. close FASTAFILE; exit;

42

LAB: break it. What happens when?: 1.

You added the file?

2.

Did the error message go away?

How would you protect your user from an error like this?

3.

Did you think that was harder than it needed to be?

43

LAB: a safer method % pico files2arrays.pl

% ./files2arrays.pl

Run the program.

#!/usr/bin/perl -w # Reading data from a file using a loop # File containing the sequence data my $fastafilename = 'kinase.fa';

open(FASTAFILE, $fastafilename) || die $!; # Read file one line at a time and print while ($protein = ) { print $protein; } close FASTAFILE;

exit;

44

LAB: breaking it. Why is this more safe than reading the file into an array?

#!/usr/bin/perl -w # Reading data from a file using a loop # File containing the sequence data my $fastafilename = 'kinase.fa';

open(FASTAFILE, $fastafilename) || die $!; # Read file one line at a time and print while ($protein = ) { print $protein; } close FASTAFILE;

exit;

45

A brief break

46

Scope TM proctor & gamble

• Section or subsection of a program where a variable is valid.

• Defined by braces { } • Use ‘my’ to declare variables. • use

strict; variables.

• use

warnings;

# mandates declaration of # or ‘-w’ on shebang line 47

47

Good practices



‘my’ operator declares a variable or a list of variables to be local (private) to the enclosed block, subroutine, or file. It will also be recognized in blocks contained by that region.



The region in which the private variable is recognized is called its scope, variables declared with ‘my’ are called lexically scoped variables.



Lexical (private) variables are not recognized outside of their scope.



A private variable of a function will not be recognized in another function called by that function. If you want that to happen, declare the variable as ‘local’.



It is recommended that you declare all of your variables with ‘my’. 48

48

Someone else’s code @list = (‘aardvark’, ‘baboon’, ‘cat’, ‘dog’,’lamb’,’kangaroo’); for $animal ( @list ) if( length($animal) print “$animal is } else { print “$animal is } }

{ 3, ‘cherry’ =>30, ‘lemon’ => 2, ‘peach’ => 6, ‘kiwi’ => 3);

52

52

Using hashes

• { } operator • Set a value $fruithash{‘cherry’} = 10;

• Access a value print $fruithash{‘cherry’}, “\n”;

• Remove an entry delete $fruithash{‘cherry’}; 53

53

Get the Keys

• ‘keys’ function will return a list of the hash keys my @keys = keys %fruithash; for my $key ( keys %fruithash ) { print “$key => $hash{$key}\n”; }

• produces: ‘apple’, ‘pear’, ... • Order of keys is NOT guaranteed! 54

54

Get just the values

• Similarly: # creates an array of hash values my @fruitcnt = values %fruithash; for my $itemcount ( @fruitcnt ) { print “val is $itemcount\n”; }

55

55

Iterate through a set

• Order is not guaranteed! while( my ($key,$value) = each %fruithash){ print “$key => $value\n”; }

56

56

References

• Are “pointers” to the data object instead of object itself.

• A shorthand to refer to a variable and pass it around.

• Must “dereference” whatever is pointed at to get its actual value, the “reference” is just a location in memory.

57

57

Reference Operators



\ in front gets its memory location my $ptr = \@vals;

• Pointers can be assigned directly: •[

] for arrays, { } for hashes

my $ptr = [ (‘owlmonkey’, ‘lemur’)]; my $hashptr = { ‘cdrom’ => ‘III’,

‘start’ => 23};

58

58

Dereferencing

• Need to cast reference back to datatype: my @list = @$ptr; my %hash = %$hashref;

• Can also use ‘{ }’ to clarify my @list = @{$ptr}; my %hash = %{$hashref}; 59

59

Really not so hard...

my @list = (‘fugu’, ‘human’, ‘worm’, ‘fly’); my $list_ref = \@list; my $list_ref_copy = [@list]; for my $item ( @$list_ref ) { print “$item\n”; }

60

60

Why use references?

• Simplify argument passing to subroutines • Allows updating data without making multiple copies. • What if we wanted to pass in 2 arrays to a subroutine?

sub func { my (@v1,@v2) = @_; }

• How do we know when one stops and another starts?

61

61

Why use references?

• Passing in two arrays to intermix. sub func { my ($v1,$v2) = @_; my @mixed; while( @$v1 || @$v2 ) { push @mixed, shift @$v1 if @$v1; push @mixed, shift @$v2 if @$v2; } return \@mixed; } 62

62

References also allow Arrays of Arrays my @lst; push @lst, [‘milk’, ‘butter’, ‘cheese’]; push @lst, [‘wine’, ‘sherry’, ‘port’]; push @lst, [‘bread’, ‘bagels’, ‘croissants’]; my @matrix = [ [1, 0, 0], [0, 1, 0], [0, 0, 1] ];

63

63

Hashes of arrays

$hash{‘dogs’} = [‘beagle’, ‘shepherd’, ‘lab’]; $hash{‘cats’} = [‘calico’, ‘tabby’, ‘siamese’]; $hash{‘fish’} = [‘gold’,’beta’,’tuna’]; for my $key (keys %hash ) { print “$key => “, join(“\t”, @{$hash{$key}}), “\n”; }

64

64

Subroutines

• Set of code that can be reused.

• Can also be referred to as procedures and functions.

the result of re• Often factoring and refining your solution.

little to do with • Have submarines. 65

65

Defining a subroutine



sub routine_name { }



Calling the routine:

# declaring a subroutine

routine_name; &routine_name;

# & is optional

66

66

Passing data to a subroutine

• Pass in a list of data &dosomething($var1,$var2); sub dosomething { my ($v1,$v2) = @_; } sub dosomethingelse { my $v1 = shift @_; my $v2 = shift; } 67

67

Returning data from a subroutine

• The last line of the routine sets the return value. sub dothis { my $c = 10 + 20; } print dothis(), “\n”;

• Better to specify return value and/or a condition to leave routine early.

68

68

Subroutine returns true (1) if codon is a stop codon (standard genetic code) sub is_stopcodon { my $val = shift @_; if( length($val) != 3 ) { return -1; } elsif( $val eq ‘TAA’ || $val eq ‘TAG’ || $val eq ‘TGA’ ) { return 1; } else { return 0; } } 69

69

#!/usr/bin/perl -w # A program with a subroutine to append AAAAT to DNA

LAB: subroutines % pico subroutine.pl

# The original DNA $dna = 'CGACGTCTTCTCAGGCGA'; # The call to the subroutine "addPOLYA". # argument passed in is $dna; result is $longer_dna $longer_dna = addPOLYA($dna); print "I added AAAAT to $dna and got $longer_dna\n\n";

# Here is the definition for subroutine "addPOLYA" sub addPOLYA { my($dna) = @_; $dna .= 'AAAAT'; return $dna; } exit;

70

LAB: break it. Can you?: 1.

Create better variable names?

Find a potential problem with subroutines and variable scope?

2.

3.

Get it to work with GLOBAL variables?

4.

Explain why this might be a problem?

71

LAB: add to it. Can you?: 1.

Find another way to concatenate the strings?

Add a subroutine that provides a reverse transcription service?

2.

Test for a poly-A tail before adding a poly-A tail and add one only if it isn’t already there?

3.

Create a file of FASTA entries and run them through your program?

4.

72

Funny operators my @bases = qw(C A G T); my $msg =