All data will /net/share/bioperl_problem_data
Use http://doc.bioperl.org/ or http://search.cpan.org/~birney/bioperl-1.2.3/ for online documentation.
Write a script to filter sequences by length (or some other criteria) using the Bioperl module Bio::SeqIO to read in sequences.
For example, print out the names of all the sequences which are 100aa long or less.
Try this on biperl_problem_data/yeast
How about all the sequences in SwissProt which have a keyword such as 'Malaria' bioperl_problem_data/swissprot/sprot42.sp
Read in one of the sequences files:
BOSS_DROME.sp CSAMYLOID.emb U21879.gbk
located in the directory /home/bioperl/data/problem_sets
Get the annotations for the sequences. To get the annotation object for the sequence object use the 'annotation' method; my $ann = $seq->annotation;
This is a Bio::Annotation::Collection object.
Print out information from the annotations.
For example, the DBlinks for the swissprot entry provide cross-references to nucleotide sequences, see if you can print them out and use other modules (Bio::DB::EMBL or Bio::DB::GenBank) to retrieve these sequences.
DBlinks are stored in the annotation with the 'dblink' key. Print out the dbname and primary_id fields from a DBlink.
References are stored with the 'references' key, print out the authors, paper title, and journal name from the reference objects. Use the Bio::Index::Fasta module to index the FASTA protein file from the Yeast genome yeast/orf_trans.fasta.
Then try doing lookups for specific sequence accessions. YAL032C YML062C YOL122C
[Bonus] Additional use id_parser method to reset the indexing scheme so you can index on the gene name or SGDID (the second and 3rd item in the FASTA description line).
>YML062C MFT1 SGDID:S0004527
(see the Bio::Index::Fasta module and the FAQ on the bioperl website for more information)
[Bonus 2]
Try the Bio::DB::Flat module and index the same sequence files OR try indexing the swissprot format sequences either with Bio::DB::Flat or Bio::Index::Swissprot in swissprot/sprot42.sp (this is large and may take a while).
Write a script to parse a blast report (cel_vs_dmel.BLASTP) in the bioperl_problem_data/reports.
Write this script so it prints out the name and length of the query sequence, the name of the database we are searching against. For each hit print out the name, description, and length of the sequence. For each of the HSPs print out the evalue, start and end coordinates in both the query and hit sequence of the pairwise alignment.
[Bonus]
Given a set of EST BLASTed to the genome (reports/dmelest_vs_cdna.BLASTN), identify the number of mismatches, and whether they are internal or on the edge of the alignment. Produce a table with each EST and the number of mismatches it had, and their locations. Parse the HMMER report (a sequence searched against the Hidden-Markov-Model database of domains Pfam) and print out the query name, and all the hits better than an evalue 1.
One of the queries has 7 copies of the WD40 domain. Write a second script (or incorporate this into your 1st one) which can identify when a query sequence has N (a variable you'll input) hits from the same domain. Have it print out the sequence name, the domain, and the number of hits the domain had.
Use the script in bioperl/scripts/biographics/render_sequence.pl to render a sequence as a graphic.
This is in your downloaded Bioperl.
Look some more at the latest set of scripts that are distributed with Bioperl in /home/bioperl/pkg/bioperl-live/scripts.
These should be some easy to use scripts for common bioinformatics needs. Try the utilities/search2gff.PLS which will convert BLAST, HMMER, FASTA output to GFF for you.
Try bp_sreformat.PLS to convert multiple sequence alignment files from clustalw to nexus or phylip. (hint, read the documentation by running perl bp_sreformat.PLS -h