STAP “phylogenetic tree-based Ss-rrnaTaxonomy and Alignment Pipeline”

The comparative analysis of ss-rRNA sequences is one of the most powerful approaches for studying phylogenetic relationships among all organisms. Our ss-rRNA Taxonomy Assigning Pipeline (STAP) combines publicly available packages such as, PHYML, BLASTN, and CLUSTALW with our own automatic alignment masking and tree parsing programs. STAP makes automatic taxonomic assignments for ss-rRNAs based on neighbor-joining or maximum likelihood phylogenetic trees rather than on the top BLASTN hits, and thus its results are more reliable and accurate. Most importantly, the automation yields results comparable to those achievable by manual analysis, yet offers speed and capacity that are unattainable by manual efforts.

First, ss-rRNA sequences obtained either by PCR of environmental samples or by metagenomic shotgun sequencing are searched against our ss-rRNA database by BLASTN to select a related data set. STAP then automatically generates, masks, and trims the multiple sequence alignments. Next, it builds a phylogenetic tree by either the maximum likelihood or neighbor-joining method. Automated analysis of the tree yields taxonomic assignments for each query sequence.

A paper describing STAP has been published (7/2/08) in PLoS One.

These are the requirements to run the STAP pipeline:

  <1> Linux OS (kernel version 2.6 or later)
  <2> Perl (version 5.0 or later)
  <3> NCBI blast package (download ncbi.tar.gz from ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/)
  <4> PHYML  (more information from  http://atgc.lirmm.fr/phyml/scripts/binaries.php) [Note: You might need to change the permissions of this file, using "chmod 577"to execute it.]
  <5> CLUSTALW (download from www.ebi.ac.uk/Tools/clustalw)
  <6> Perl Module: BioPerl 1.4 or later (more information from http://www.bioperl.org/wiki/Main_Page)
  <7> Perl Module: XML::DOM (more information from http://search.cpan.org/~tjmather/XML-DOM-1.44/lib/XML/DOM.pm)
      Note:  If you are using MacOS, you might find it is easiest to install both Bioperl and XML-DOM using fink: http://www.finkproject.org/download/index.php.
      For more information, see: http://www.bioperl.org/wiki/Getting_BioPerl#MacOS_X_using_fink.  Both Bioperl and XML-DOM will be installed if you use this.
      The one possible problem is that fink puts the bioperl libraries in /sw/lib, so you might need to add this to your path (e.g., export PERL5LIB=$PERL5LIB:/sw/lib/perl5/5.8.6)

Installation:

  <1> download database db.tar.gz into the desirable directory
      tar -xzvf db.tar.gz
      cd db_dir
      Write down the full path of the database directory db_dir
  <2> Download rRNA_scripts.tar.gz into the desirable directory
      tar -xzvf rRNA_scripts.tar.gz
      cd rRNA_pipeline_scripts
  <3> Launch the perl setup program.
      perl setup.pl
      Answer the questions when prompt to set the pipline options.
  <4> To run STAP, run this script:
      rRNA_pipeline_for_one.pl
      This script can be located anywhere you want, but the directory
      rRNA_pipeline_scripts and its content should be kept intact.

Run the Program

<1> Prepare the input sequence file in fasta format, one sequence per input file

Example: file test.seq
  >accession
  TTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGAAACGAGAAGTAGCTTGCTACTTCGGCGT
  CGAGCGGCGGACGGGTGAGTAATGCATAGGAAGTTGCCCAGTAGAGGGGGATAACCATTGGAAACGATGG
  CTAATACCGCATAACCTCTTCGGAGCAAAGCGGGGGACCTTCGGGCCTCGCGCTACTGGATACGCCTATG
  TGGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGGATGATC
  AGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGG
  CGAAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGTTGTGA
  GGAAGGTTTCGTAGTTAATAACTGCGTTGCTTGACGTTAGCAACAGAAGAAGCACCGGCTAACTCCGTGC
  CAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGG
  TTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAAGGTCATTTGAAACTGGCAAACTAGAGTA
  CTGTAGAGGGGGGTAGAATTTCAGGTGTAGCGGTGAAATGCGTAGAGATCTGAAGGAATACCAGTGGCGA
  AGGCGGCCCCCTGGACAGATACTGACACTCAGATGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCT
  GGTAGTCCACGCCGTAAACGATGTCTACTTGGAGGTTGTGGCCTTGAGCCGTGGCTTTCGGAGCTAACGC
  GTTAAGTAGACCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAATGAATTGACGGGGGCCCGCACAA
  GCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCAGAGAACTT
  TCCAGAGATGGATTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGT
  GAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTGTTTGCCAGCGAGTAATGTCGGGAAC
  TCCAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAG
  TAGGGCTACACACGTGCTACAATGGCGCATACAGAGGGCAGCAAGCTAGCGATAGTGAGCGAATCCCAAA
  AAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGA
  TCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGC
  AAAAGAAGTAGGTAGTTTAACCTTCGGGAGGACGCTTACCACTTTGTGGTTCATGACTGGGGTGAAGTCG
  TAACAAGGTAGCCCTAGGGGAACC

<2>Run the Program:

rRNA_pipeline_for_one.pl -i [input]

where [input] is replaced by the name of the input sequence file.

Example: rRNA_pipeline_for_one.pl -i test.seq

More parameters:

-i 	(required) Input format (one sequence per file)

-n 	(optional) The number of unclassified database ss-rRNA nearest-neighbors to
                  ignore when identifying the nearest-neighbor in the first tree
                  Allowed entries:		0-20
                  Default:			5
-t 	(optional) When building the second tree, the number of taxonomic levels to
                  step up from the taxonomic assignment made in the first tree.
                  Default:	1
-o 	(optional) Output directory
                  Default:	the current directory
-d 	(optional) Domain of the input sequence
                  Allowed entries:	P (prokaryote)
                                       E (eukaryote)
                  If left blank, the program will assign the domain automatically.

Read the results

The following files will be generated by the STAP program:

  accession.seq:                   sequence file
  accession.mltree1:               the first tree
  accession.mltree1.with_taxonomy: The first tree with taxonomy infomation attached
  accession.mltree1.tab:           the parsing output of the first tree (based upon mid-point rooting)
  accession.mltree2:               the second tree
  accession.mltree2.with_taxonomy: the second tree with taxonomy infomation attached
  accession.mltree1.tab:           the parsing output of the scond tree (rooted by an outgroup defined by the first tree)
  accession.results:               the summary of the taxonomy assgnments

Format of the tree parsing results:

Each line is tab-delimited with the information illustrated in the following figure: The output file is a tab-delimited text file, i.e one record per line, the data fields separated by tabs. The contents of each column is explained in the following list and figure:

  Column 1: Accession of the input sequences
  Column 2: Accession of a database sequence
  Column 3: Distance between the input and the database entry
            (highlighted by the blue line)
  Column 4: Distance between the nodes that the input and the database entry are attached to
            (highlighted by the orange line)
  Column 6: The branching position of the database sequence (marked by the red circle)
            along the path that connects the input and the outgroup
            (highlighted by the red line)
  Cloumn 7: The taxonomy annotation of the database entry

Image:Treetab.jpg
The parsing results are sorted by column 6, 5, 4 and 3 to identify the nearest neighbor of the input sequence.
Format of the final result summary file:

The results summary file is also a tab-delimited text file. The contents of each column is illustrated in the following list and figure:

Column 1:  Accession of the input sequence
Column 2:  The top Blast hit and its taxonomy
Column 3:  The nearest neighbor in the first tree and its taxonomy
Column 4:  The nearest neighbor in the second tree and its taxonomy.
           This taxonomy is the final taxonomy assignment to the query.
Column 5:  The domain of the query sequence if automatic domain identification
           process was triggered.
           Domain is determined by a representative neighbor joining tree
           illustrated by the following figure. The branching position of
           the query related to the domain braching point (indicated by the
           red triangle) in terms of number of nodes (indicated by the rec
           circles) is also reported in this column

Image:Domaintree.jpg

Increasing the speed of the analysis

Two approaches can be used to decrease the analysis time.

<1> If the user knows the domain of a query sequence, the -d option should always be defined to skip the timeconsuming domain-identification step.

<2> A linux cluster is required for processing massive datasets. The STAP program can be run in parallel on nodes of a linux cluster (seehttp://www.rocklinux.org/wiki/Main_Page for more details).

An extra program to align large ss-rRNA sequences to each other

In the package, we include a script to align large ss-rRNA sequences against each other. The script is align_to_profile.pl. It takes a ss-rRNA multifasta format and builds alignments of all the sequences in the input file using the clustalw profile aligning method.

   Usage: align_to_rRNA_profile.pl -i [input] -d [domain] -o [output]
          Input parameters:
          -i: sequence file in multifasta format
          -d: domain (A for archaeal, B for bacteria, E for eukaryotes, P for prokaryotes,
              X for all domain; Default X)
          -o: output file (in aligned multifasta format)

For large amount of sequences, it is recommended to break the multifasta file into smaller sections and run align_to_profile.pl on a linux cluster. As long as the same domains are defined for all the processes, the output files of align_to_profile.pl can be concatenated into a large aligned multifasta file in which all the sequences are aligned to each other. In the package, we include the script trim_all_gap.pl to remove the all gapped columns of the concatenated multiple alignments.

   Usage: trim_all_gap.pl  -i [input] -o [output]

5 thoughts on “STAP “phylogenetic tree-based Ss-rrnaTaxonomy and Alignment Pipeline””

  1. Dear Dr. Eisen,
    I have been unsuccesful in downloading the STAP software, it appears that the bobcat server is offline or has been removed from the DNS. Is there any other way to get the software and database?

    thanks a lot,
    Diego

    Like

  2. I just installed STAP on our server however it requires old executables (blastall and formatdb) that are no longer being released by NCBI.

    A workaround is to change program checks and actual commands by adding “legacy_blast” before the program names.
    This allows to not have to change any of the parameters (which have also changed names in the more recent versions).

    Also, doing this on a Mac, the indexing of the sequences filled up our hard drive. Not sure what happened but it behaved normally on a Linux machine.

    Like

Leave a comment