BIOL5858 Assignment 7

Procedure:  how to retrieve protein subcellular locations

Predicting protein subcellular locations is one of the important and interesting research topics in bioinformatics. When you design or evaluate a prediction tool, you need to collect known information of protein subcellular locations which often are collected from wet labs (published research articles). UniProt Swiss-Prot dataset contains protein subcellular locations collected by curators from published experimentally determined information, thus is a good source for use.  I have used the data in my research in evaluating prediction accuracy for developing prediction methods for construction of protein subloc databases.  Here are the related publication when you like to learn more:

http://bioinformatics.ysu.edu/publication/secretome_methods.pdf

http://bioinformatics.ysu.edu/publication/MetazSecKB.pdf


Below are the steps for retrieving subloc information and sequences

1) Download data from UniProt: two files are needed. Choose "Reviewed (Swiss-Prot)", fasta file and text file.
https://www.uniprot.org/downloads

After download to your own computer, do not try to open them as they are big files- it will freeze your computer.  If you like to view them, please come to the lab to use the Linux computer and use "less" your_file to view the content.
You can view a small part of sample file: uniprot_sprot_mammals.dat

2) Use "retriSubloc.pl" to retrieve sublocations for proteins from a Kingdom of species (you need to change code to choose a kingdom: default is plant kingdom).
$perl retriSubloc.pl input output
input: the text file downloaded from UniProt;
output:  subloc data

3) Use the output file from step 2 as input file to run 'separate_sub_locations.pl input' to seperate subloc data into different subcellular locations with each location in one file (several files will be produced).

4) To retrieve protein sequences, use 'seqRetriH.pl' as "perl seqRetriH.pl fasta_file seqID_file retrieved_seq". 

The seqID_file is the file generated in step 3 for each subloc.

What to turn in?

      Run the precedure, submit one of your subloc files with their fasta sequence file to prove you have done it. You can choose any kingdom of species. However, for "protist" kingdom, it is a bit tricky as no "protist" kingdom is listed, but rather you need to change the code as the organism belong to "Eukaryote" but not belong to "plants" or "fungi" or "animals" [Please check the perl code or the text file for the correct terms].
       
      As the dataset is large, my perl code is optimized using some tricks, hope you can run it on your own computer.  However, if you could not run the program at home, please come to the lab.  If you have problems in the process, please contact me or see me in the lab.