Frequently Asked Questions about the TargetIdentifier Server

Who are we?
TargetIdentifier Server was implemented by Dr. Xiangjia (Jack) Min, when he worked at Concordia University, Montreal, Quebec, Canada, with support from Drs. G. Butler, R. Storms and A. Tsang at Concordia University. Alex Spurmanis designed the logos and Wei Ding assisted in the development of the server interface. The original site was installed at Concordia University. This mirror site is supported by Youngstown State University.

Our motivation
Generating expressed sequence tags (ESTs) remains a primary method for gene discovery in most organisms. Identifying full-length cDNA clones with a specific function for further downstream characterization in the laboratory raises a critical demand for automating this process. Our server is aimed at automating identification of full-length EST cDNAs.

How does it work?
TargetIdentifier uses the frames predicted in the pre-run BLASTX. The BLASTX in the 'NCBI-blastall' package parameters are used: "-v 1 -b 1 -e 1e-5" (Note: we used version 2.2.19 - earlier or later versions may not work properly). The results to identify full-length EST derived seqeunces according to the definitions described below. It integrates (i) the prediction whether an EST or cDNA sequence is full-length or not (i.e. containing a putative translation initiation codon or not) and (ii) prediction of an open reading frame (ORF) completely sequenced or not, as well as (iii) functional annotation based on BLASTX outputs.

Definition of each category of prediction

Full-length:
A query is considered full-length when it satisfies one of the following two criteria. (1) The sequence has a 5' stop codon followed by a start codon. (2) The sequence does not have a 5' stop codon but there is an in-frame start codon present prior to the 10th codon of the subject sequence.
Short full-length:
The sequence has an in-frame start codon that is aligned to a position between the 10th to 100th codon downstream from the start codon of the subject sequence. A number is calculated to indicate the location of the potential start codon relative to the start codon in the subject.
Possible full-length:
After adding the region removed by LUCY, a quality trimming tool, it is estimated to be long enough to encode the missing amino terminal portion of the protein, the corresponding cDNA clone is categorized as "possible full-length".
Ambiguous:
The sequence has a 5' stop codon but does not have a start codon. This type of anomaly probably arose because a sequencing error introduced a frame shift. This can occur in EST sequences generated by single-pass sequencing.
Partial:
A sequence that was not assigned to one of the above categories.
3'-sequenced partial:
The query sequence aligned to the subject sequence in a negative frame (-1, -2, or -3) and it is not categorized as a full-length or a short full-length sequence, it is classified as "3'-sequenced partial". Please note the clone may be sequenced from the 5' end but the insert is inverted in the vector.

Input

  1. A file contains cDNA sequences (ESTs or contig sequences assembled from ESTs) in FASTA format.
  2. A BLASTX output file contains the BLASTX results for all queries in the "FASTA" file. Please note that a "complete, non-truncated" BLASTX output is needed for the program to work correctly. Users should provide their pre-run BLASTX output if they have installed the NCBI-blastall package. To minimize the file size of BLAST output for uploading, the following parameters are suggested in user's BLASTX setting: "-I T -v 1 -b 1 -e 1e-5".
  3. Optional: the ace file generated by Phrap. Phrap is a program used for assembling ESTs. The ace file from Phrap provides the information regarding individual ESTs being assembled in a contig.
  4. Optional: the LUCY file generated by LUCY. LUCY is a program used to remove vector and adaptor sequences, as well as low quality sequences. After LUCY processing, only sequences with a certain good quality are kept. The LUCY file provides the length of any low quality sequence removed from the sequencing end of each EST sequence by LUCY.
  5. Note: The two optional files only influence the number in the category of "possible full-length". The total combined data file size is limited to 100 Mb only. Please note your input files must be 'text' only, so use NotePad to save your files, do not use WordPad or MS Word.

Output

TargetIdentifier: The default format is in MS Excel (.xls) format. Each field (column) is tab-delimited. The fields are: (i) the name/function of the subject protein in the highest score pair (HSP) in BLASTX, (ii) query identifier, (iii) E-value, (iv) a prediction of whether the query is full-length, short full-length, possible full-length, ambiguous or partial, (v) start codon position or 'NO' if no start codon is predicted, (vi) the sequence status of the query regarding whether or not the ORF has been completely sequenced, and (vii) BLASTX output for the HSP that includes the subject definition line, length, score, E-value, identities, positives, and reading frame. A sample of TargetIdentifier output generated using our Aspergillus niger data.

Annotator: a program implemented using the algorithm of the TargetIdentifier. The output is slightly different. The default format is in MS Excel (.xls) format. Each field (column) is tab-delimited. The fields are: (i) query identifier, (ii) provisional function with or without a qualifier (Table 1), (iii) a prediction of whether the query sequence is full-length, short full-length, possible full-length, ambiguous or partial, (iv) start codon position or 'NO' if no start codon is predicted, and (v) the sequence status of the query regarding whether or not the ORF has been completely sequenced. A sample of Annotator output generated using our Aspergillus niger data.

Table 1. Qualifiers added to functional annotation by Annotator to indicate different levels of similarity between a cDNA-derived query sequence and the annotated hit subject in the BLASTX HSP
E-value Qualifier
E <= 1e-100 No qualifier
E <= 1e-50 Homologous to
E <= 1e-30 Highly similar to
E <= 1e-10 Similar to
E <= 1e-5 Weakly similar to
E <= 0.1 Very weakly similar to
E > 0.1 Extremely weakly similar to

Accuracy evaluation
The accuracy of the algorithm was evaluated using the human UniGene dataset and our in-house generated A. niger EST sequences. The overall accuracy was > 90%. See details.

Security of user submitted data
The data submitted to our server will be automatically deleted after they are processed. We do not keep data submitted by a user.

How to obtain user's results
The results can be downloaded from the server web site. The output file(s) will be kept for 2 days only after data generation, then will be deleted. The results are saved in Microsoft Excell (.xls) format and can be changed to "text" format as each field (column) is tab-delimited.

How to cite us
Min, X.J., Butler, G, Storms, R. and Tsang, A. TargetIdentifier: a web server for identifying full-length cDNAs from EST sequences. Nucleic Acids Res., 2005, Vol. 33, Web Server Issue W669-W672. Our server URL (http://proteomics.ysu.edu/tools/TargetIdentifier.html) can also be used as your reference.

Standalone TargetIdentifier Software availability
The standalone version of the TargetIdentifier software is available free for academic use only. It is written in Perl - easy to run in any OS. Please contact Dr. Min in the YSU Bioinformatics Lab.

Comments and suggestions
Please contact Dr. Min at the YSU Bioinformatics Lab.


Back to the TargetIdentifier Server Top of Page Back to Index Page