HGT and Trait Correlations

I am using phylogenetic comparative methods to look for correlates of HGT in the ~800 sequenced genomes that I’ve built a tree for. Here are my notes.

36 Responses to HGT and Trait Correlations

  1. jennomics says:

    To do the correlations I need three things.
    1. A tree.
    2. Some traits (metadata) for each of the genomes.
    3. Some measure of HGT in each of the genomes.

    I have the first, and I (mostly) have the second. I’m going to focus on the third for a few days.

  2. jennomics says:

    SIGI-HMM has been run on all of the complete genomes. Now, I need to run it on all of the draft genomes. Of course, I have no recollection of how I did this before (which is why taking these notes is a good idea!!)

    I’m starting here:
    /home/jlmorgan/HGT/SIGI-HMM/genbank_files/draft_genomes
    Here is what the evidence suggests I did:
    1. Make a list of all of the draft genomes “draft.list”. This is a list of taxon names, e.g., Acetobacter_pomorum_DM001_uid65823
    2. use wget.pl to get all of those genomes
    #!/usr/bin/perl

    open (LIST, “draft.list”) || die;

    @list = ;
    foreach $thing (@list) {
    print $thing;
    chomp $thing;

    system (“wget –continue –timeout 10 -t 0 –retry-connrefused ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/$thing/*scaffold.gbk.tgz“);
    }

    Now, I am doing this:

    3. extract all of those genomes.
    find . -name “*.tgz” -exec tar xzf {} \;
    4. I found the script “system.pl” which appears to do what I need:
    #!/usr/bin/perl
    open (LIST, “complete_genomes/complete.list”) || die;
    @list = ;
    foreach $thing (@list) {
    chomp $thing;
    system (“SigiHMM input=./complete_genomes/$thing output=./complete_genomes/$thing.out.gbk > ./complete_genomes/$thing.log”);
    }

    So, I’m using it for all of these freshly-extracted draft genomes…

    This takes time, so I am moving on with the SIGI-HMM output from the completed genomes.

  3. jennomics says:

    Parsing SIGI-HMM output:

    1. I’m using grep+excel to create a file that has columns with the taxID \t total number of genes(BioObjects) \t number of alien genes (PHXEmissions)
    2. Note that each of these .gbk files is a scaffold, so several may belong to a single genome. Hopefully, I will be able to find an easy way to piece them together.
    3. grep “SOURCE ” *.gbk gets the taxon names for each of the scaffolds
    (moved *.out.gbk into a subdirectory so I wouldn’t get everything twice, grepped from the original .gbk files because some were missing when I used the .out.gbk files. Not sure why, but I’m sure it’ll come out in the end)
    4. used excel to put them all together here:
    /home/jlmorgan/HGT/SIGI-HMM/genbank_files/complete_genomes/completed_counts.txt

    Tomorrow,
    1. add scaffold numbers together when they belong to the same organism
    2. figure out how to map the taxon names to the things in my tree
    3. figure out what metric to use for the HGT

  4. jennomics says:

    1. wrote script (add_scaffolds.pl) to add scaffolds and get HGT/NORMAL ratios for the genomes.
    2. add_scaffolds is looking for a file with:
    scaffold ID HGT Count Normal Count ratio species name
    3. Some scaffolds have different names for the same organism. Here is an example of something stupid that is screwing me up right now:

    Nostoc sp. PCC 7120
    Nostoc sp. PCC 7120 (Anabaena sp. PCC 7120)
    Nostoc sp. PCC 7120 (Anabaena sp. PCC7120)

    super annoying…

  5. jennomics says:

    I think a lot of the draft genome scaffolds have not been annotated. At least, the genbank files that I downloaded do not have the annotations. I will just work with what I have and go back and figure it out manually if I don’t already have all of the genomes that I need.

    So, now I need to parse the SISI-HMM output for the draft genomes, to catch them up to the complete genomes.

    1. move files into subdirectories.
    2. grep BioObjects from *.log to get total gene number
    3. grep PHXEmissions from *.log to get alien genes
    4. grep SOURCE from *.gbk to get the taxon name
    5. generate tab-delimited file (draft.counts) that looks like this:
    scaffold_ID HGT_Count Normal_Count ratio species_name
    6. run add_scaffolds.pl to collapse scaffold data
    7. double-check names for errors (easiest in excel)
    8. save error-checked file as draft.counts

    Concatenated complete.counts and draft.counts into HGT.counts here:
    /home/jlmorgan/HGT/SIGI-HMM/genbank_files

  6. jennomics says:

    Need to map HGT data to taxa in my tree. I’m guessing that the best way to do this will be via the taxon names. Well, I got 385 this way with a first pass, grepping the 841.tax_info.list file with the list of HGT.tax.list.

  7. jennomics says:

    Well, that leaves me with 464 to retrieve manually, minus the organelles. Obviously, that is not reasonable! So, I’m trying to figure out why I don’t have all of them. First, I noticed by spot-checking that they may not have the required gbk file format for SIGI-HMM. I don’t think there’s anything I can do about this. In at least one other case, the gbk file is there, but it didn’t download for some reason. I created a list of “rouge” taxa and I’m just using wget.pl to try to retrieve them again.

    That got me 117 more genomes (for some mysterious reason). I’m going to process them to make sure that all is well before I search for the other missing genomes.

    1. find . -name “*.tgz” -exec tar xzf {} \;
    2. 117 unique taxa were unzipped, so that’s good
    3. run SIGI-HMM on these 117 organisms (5554 scaffolds) with system.pl
    * this didn’t work because these gbk files only have nucleotide data. It seems that for these, what I’m looking for can be found in the contig.gbk files instead of the scaffold.gbk files. So, I’ll get all of those and see how it goes.
    4. That seems to have pulled 227 from the Bacteria_DRAFT genomes. I’ll see if SIGI-HMM works with those and then see if I get any from the completed genomes.
    5. This did not work. Most of these gbk files do not have the protein translations. Many are really small.
    6. I think I’m going to have to use only the completed genomes for this part of the analysis. So, let’s see how many genomes that will leave me with…

    7. There are a total of 1397 completed genomes with HGT predictions. There are 415 complete genomes with HGT predictions on my tree, spanning 31 phyla.

  8. jennomics says:

    I had a discussion with Aaron about 3 options: A. Use my tree and the 415 complete genomes for the analysis, B. Use all 1397 complete genomes and C. use the genomes from a paper Jonthan sent us today. Option C was quickly abandoned because that dataset is so old. Aaron and I both prefer option B even though it makes me pretty sad to not be using my Big Tree. :( (
    We also discussed the fact that because many of the complete genomes are in multiple scaffolds, sometimes representing multiple replicons, some of which may have no ribosomal proteins on them, that the genbank files need to be concatenated before analyzing with SIGI-HMM.

    So, the next step for me is to figure out how to concatenate the genbank files using bioperl. Ugh.

  9. jennomics says:

    I found a bioperl script that will do what I want with a little help. Now, I just need to make it work for me. It is here on edhar:

    /home/jlmorgan/HGT/merge_gbk.pl

    This script doesn’t work, but in a forum, the author was advised that this would fi it:

    Bio::SeqUtils->cat(@seqs);
    my $mergedseqs=$seqs[0];

  10. jennomics says:

    Jonathan feels strongly that I should use my own tree for the HGT analysis instead of building a new one. I am OK with this because I already have the metadata for mine. He is OK with using only a subset of my tree and likes the idea of only using completed genomes.

    Guillaume helped me debug the concatenation script. Yay! It looks like this now:

    !/usr/bin/perl
    #
    #

    use strict;
    use warnings;
    use Bio::SeqIO;
    use Bio::SeqUtils;
    use Getopt::Long;
    # get command-line arguments, or die with a usage statement

    my $USAGE=< \$listfile, #list of genbank files to concatenate
    ‘o|outfile=s’ => \$outfile, #concatenated genbank file
    );

    ########################################################################

    if (!$listfile) { die “$USAGE\nNo input file! (-listfile)\n”; }
    if (!$outfile) { die “$USAGE\nNo output file! (-outfile)\n”; }

    my $seqout = Bio::SeqIO->new(-file=> “>$outfile”, -format => “genbank”);
    my @seqs;

    open (LIST, “$listfile”);
    my @list = ;

    #put each genbank file on the list into an array of genbank files

    foreach my $file (@list){
    chomp $file;
    my $seqin = Bio::SeqIO->new(-file => $file, -format=>”genbank”);
    my $primaryseq = $seqin->next_seq();
    push (@seqs, $primaryseq);
    }

    # concatenate the sequences held in the array @seqs.
    #
    #This function updates the first element in the array to hold the concatenated file. The author apologizes for this awkward way of doing it.

    Bio::SeqUtils->cat(@seqs);
    my $mergedseqs=$seqs[0];

    # write the new merged sequence into the specified outfile
    $seqout->write_seq($mergedseqs);
    exit;

  11. jennomics says:

    Now that I’m back to option A, here is what i need to be focused on doing:

    1. Decide on a cutoff for a number of scaffolds a genome must be in in order to include it in my analysis. Asked Jonathan, he says 20-25.
    2. Concatenate the genbank files for all genomes with more than one.
    3. Run SIGI-HMM on all of the concatenated genbank files.
    4. Parse SIGI-HMM output for mapping.
    5. Map HGT data to my genomes.

  12. jennomics says:

    I spent the weekend making my list of organism names match up with the file names in ncbi so that I can download those genbank files. So, now I have a mapping file here on edhar:

    /home/jlmorgan/HGT/831.list

    For some reason wget.pl wasn’t working on edhar “HTTP does not accept wildcards”. So, I am downloading the genomes onto merlot (where the wget script is working) and then I’ll move them over to edhar for concatenation.

  13. jennomics says:

    Here’s what’s going on now:

    1. the completed genomes are still downloading
    2. the draft genomes have finished downloading and they are unzipped and now I need to merge the genbank files
    3. the GEBA genomes have been downloaded with multiple scaffold in a single genbank formatted file, separated by //. I’m not sure if it’s necessary, but I am going to merge them into a single record before using SIGI-HMM.

    So, tonight, I need to merge all of the genbank files.

  14. jennomics says:

    There are 513 completed genomes, 225 draft genomes, and 37 GEBA genomes. So, that’s 775 genomes. 56 missing. Not too bad. I’m just going to process these genomes and then come back and figure out why they are missing.

    Now on to merge the genbank files. The draft ones are in the gbk format and are ready to be merged.

    1. Get the unique taxon ids by grepping “db_xref=\”taxon:” from the .gbk files.
    2. Create files for which the name is the taxon id and the content is a list of the .gbk files to be merged. -> “make.merge.lists.pl
    3. done! these files are here on merlot:
    /home/jlmorgan/HGT/SIGI-HMM/genbank_files/draft_genomes/gbk_files

    4. Did the same thing for the completed genomes. Moved all of the .gbk files into a tmp directory just to make the listing process easier.

    Tomorrow:

    I need to either write the script to split apart and then merge the GEBA genomes or test with SIGI-HMM whether or not it matters that the .gbk files contain more than one scaffold. As of right now, Sigi-HMM is (of course) not working, so i couldn’t test that.

  15. jennomics says:

    Sigi-hmm wasn’t running because you have to run it from within the directory in which it is installed. It seems to be working now.

    It also doesn’t like the concatenated genbank files, so I will need to split those up and then merge them.

    1. Split up the concatenated genbank files with split_files.pl:
    #!/usr/bin/perl

    open (IN, “list”) || die “no list”;
    @list = ;

    foreach $thing (@list){
    print $thing;
    chomp $thing;
    $count=0;

    open (IN2, “$thing”) || “cannot open $thing”;

    @gbk = ;
    $line = join(//, @gbk);

    @array = split(/\/\//, $line);
    foreach $file (@array) {
    open (OUT, “>$thing.$count”);
    print OUT $file;
    close OUT;
    $count = $count+1;

    }
    }
    `perl -pi -e “s/^1//g” *.gbk.*`

    2. create the list files for merging with make (I just added this to the split_files.pl script.

    This produces empty genbank files and list files with a blank line at the end. I had to clean all of that up. If I have to do this again, I should fix the script.
    3. merge the files with run_merge_gbk.pl
    4. Did this for the GEBA genomes, now on to the others…

    **I accidentally put all of the completed genomes in the draft_genomes directory on edhar, so I will have to remember to put them back where they belong on merlot after they’ve been merged***

    Now, all of the completed and draft files are getting merged on edhar. This takes some time…

  16. jennomics says:

    1. Move all of the merged genbank files to merlot, remembering to put them back to the proper directory.
    2. Sigi-HMM is not working on any of these files. I sent an email to the guy who is developing the software to see if he can help. I may also try to do some file format conversion to see if I can fix whatever is causing the problem.
    3. So, I think there are two problems with the genbank files. One of them is that there are some lines that look like this:
    misc_feature complement(order(3338001..3338006,3338010..3338012, 3338031..3338033,3338043..3338045,3338238..3338240, 3338031..3338033,3338043..3338045,3338238..3338240, 3338244..3338246,3338256..3338261,3338313..3338315, 3338325..3338333,3338592..3338594))

    with newlines at the end of each line, and those newlines are screwing things up. This appears to be a problem with the original genbank files. I can fix this (in theory) with a simple script.

    The other “problem” is that Sigi-HMM actually WILL take a file that has several concatenated genbank records, so the whole process of merging them into a single record has been a total waste of time!

    So, tonight:

    1. Concatenate the draft genomes and test Sigi-HMM. Fix the newline problem where necessary.

  17. jennomics says:

    Sigi-HMM is running on the concatenated genbank files, but the program is running out of memory:

    java.lang.OutOfMemoryError: Java heap space”

    I got advice from the developer that I can modify the SigiHMM file, which is just a single java command line:

    java -cp .:./libs/biojava.jar:./libs/bytecode-0.92.jar:./libs/commons-cli.jar:./libs/commons-collections-2.1.jar:./libs/commons-dbcp-1.1.jar:./libs/commons-pool-1.1.jar -Xms256m -Xmx256m SigiHMM $*

    I just changed -Xmx256m to -Xmx5000m and -Xms256m to -Xms5000m

    That did not work. I got a different sort of error from the java heap space error I was getting before. Now, it crashes with “GC overhead limit exceeded”

    Then, I tried to run it with -XX:-UseGCOverheadLimit which should make the program keep running even though the GC issue is happening. But, when I did this, it crashed with the heap space error again…

    Sent an input file to the developer, so hopefully, he will be able to help me get it running!

  18. jennomics says:

    Now, I will try to get this running on the completed genomes. Maybe the issue is with the large numbers of scaffolds in the genbank file.

    1. Concatenate the genbank records with catfiles.pl

    #!/usr/bin/perl
    open (IN, “lists.list”);
    @list = ;
    foreach $thing (@list) {
    chomp $thing;
    open (LIST, “$thing.mergelist”);
    @array=();
    @files = ;
    foreach $record (@files){
    print $record;
    chomp $record;
    push (@array, “$record “);
    }
    `cat @array > $thing.cat`;
    }

    2. Move files over to merlot.
    3. Rename files for finicky sigi-hmm:
    rename “.cat” “.cat.gbk” complete_genomes/cat_gbk_files/*.cat
    4. Modify and run system.pl.

    Did not run! – there’s a parsing error with these genbank files. The same one that I saw before.

    Actually, this parsing error has nothing to do with merging or concatenating the files. The original files that i downloaded from genbank (for the completed genomes, not the draft genomes) are unreadable by Sigi-HMM or readseq, which is usually a pretty flexible reader of file formats.

    I fixed it with a little one-liner, just replacing /\,\n/ with /\,/ and the replacing /\,\s+/ with /\,/

    Now, readseq will read the file, but Sigi-HMM still will not. Argh!

    Sigi give this error:
    File-format error: org.biojava.bio.BioException: Could not read sequence
    No valid input file specified!

    I sent Aaron an email asking for his help on this…

  19. jennomics says:

    I found the problem with my genbank files after many hours of detective work. It was something small and stupid. No need to chronicle it here.

    Now, Sigi-HMM is running. The developer was kind enough to send me a modified version that was able to handle my concatenated files. (I have been dealing with a few unrelated problems)

    So, I am cautiously optimistic that I will have some Sigi-HMM output to work with soon.

  20. jennomics says:

    They are still running, but will be done soon. I have a script that calculates the % HGT “calculate_HGT.pl”. Right now, it lives here on merlot:
    /home/jlmorgan/HGT/SIGI-HMM/Colombo_v4.0_100

    It produces 5 columns: taxid, #normal genes, # HGT genes, % of genes that are HGT, # of HGTs on the plus strand (that last one was available, and I thought it could be interesting; almost exactly 50% of the HGT genes are on the plus, maybe that’s interesting, but I don’t know.)

    It looks like this:

    #!/usr/bin/perl
    open (LIST, “draft.list”) || die “cannot open list”;
    @list =;
    foreach $thing (@list) {
    chomp $thing;
    open (IN, “draft_genomes/$thing.gff”) || die “cannot open gff”;
    @lines = ;
    @PUTAL = grep(/SigiHMM PUTAL/, @lines);
    my $HGT = @PUTAL;
    @plus_strand = grep(/\+/, @PUTAL);
    my $plus = @plus_strand;
    @NORMAL = grep(/NORMAL/, @lines);
    my $norm = @NORMAL;
    $avg = ($HGT/($HGT+$norm)*100);
    print “$thing\t$norm\t$HGT\t$avg\t$plus\n”;
    }

  21. jennomics says:

    Notes:

    Moorella and Hamiltonells have two different NCBI taxon IDs. I deleted the older one from my list of genomes, but the HGT data for them is still there. Just don’t want to get confused later.

  22. jennomics says:

    Bookkeeping time! I should have 831 genomes, but for some reason (probably several), only 769 of them are accounted for at this point. I have managed to get the JGI tax ID for all of them except for 6, which are just not present in the IMG database (or, if they are, it is not clear from the name or any numbers associated with them.)

    Now, I’ve found that there are 69 genomes for which I do not have any of the metadata. So, now I need to get those. And, since I’m going through this effort, I’m just going to scrap what I have and get new metadata for all of the genomes because maybe they’ve updated the database since I last got the data from there.

    I have this handy command to generate the wget commands:

    more 769.JGI2NCBI.map | perl -ne ‘if(/^\d/){($id,$sp)=split(/\t/); chomp $sp; $output=$id.”.dat”; $url=”http://img.jgi.doe.gov/cgi-bin/geba/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=$id”; print “wget -nv –user-agent=\”Mozilla\5\.0 \(Macintosh\; U\; Intel Mac OS X 10\.5\; en-US\; rv\:1.9.1.3\) Gecko\/20090824 Firefox\/3.5.3\” -O $output \”$url\”\n”;}’ > cmd.list

    Just need to add this bit at the top and make it executable:

    #!/bin/sh
    #$ -cwd
    #$ -S /bin/bash

    Then, ./cmd.list to get all of the files.

    Now, to extract the data. The format of the traits file should be:
    traitID taxaID value

    Working on this script. The same strategy will not work for every trait, so it’s taking a little while.

    In the meantime, the new IMG metadata files have been downloaded. I requested 769 (although 6 of those are not_in_IMG) so I am missing three.

    Tomorrow/tonight:
    1. Track down the 3 missing IMG data files.
    2. Finish script to parse out all of the metadata for the traits file. It’s here on merlot:
    ~/HGT/getTrait.pl

    3. Maybe start thinking about why I have only tracked down 769 genomes instead of the 831 I was hoping for.

  23. jennomics says:

    getTrait.pl won’t work for everything. Here’s what to do for the others:
    grep -1 -m 1 “total number of bases” *.dat | grep right > Genome_Size.map
    grep -2 “DNA G+C number of bases” *.dat | grep sup > G+C.map
    grep Lineage -1 *.dat | grep TaxonList > Phylum.map
    grep Lineage -1 *.dat | grep TaxonList > Class.map

  24. jennomics says:

    grep -2 “DNA coding number of bases” ../data_files/*.dat | grep % > Percent_Coding.map

  25. jennomics says:

    In order to do the correlations, I have to have numbered values for the traits and, as far as I can tell, there is no way to do the analysis with discrete, unordered characters. So, I can only do this with data that i can code in a numerical, ordered way (binary is OK, too).

    766 Biotic_Relationships: Free living=0, Symbiotic=1
    76 Body_Product: unable to code
    220 Body_Site: unable to code
    85 Body_Subsite: unable to code
    267 Cell_Arrangement: unable to code
    660 Cell_Shape: unable to code
    807 Class: unable to code
    665 Diseases: None=0, Any=1
    289 Ecosystem: unable to code
    307 Energy_Source: unable to code
    807 G+C: continuous
    807 Genome_Size: continuous
    771 Gram_Staining: Grampos=1, Gramneg=0
    749 Habitat: unable to code
    814 HGT: continuous
    25 Host_Gender: N/A
    48 Host_Health: N/A
    282 Host_Name: N/A
    746 Isolation: unable to code
    203 Metabolism: unable to code
    595 Motility: Nonmotile=0, Motile=1
    699 Oxygen_Requirement: Anaerobe=0, Facultative=1, Aerobe=2
    807 Percent_Coding: Continuous
    109 pH: Continuous
    807 Phylum: unable to code
    776 Relevance: unable to code
    47 Salinity: too few
    807 Sequencing_Status: Draft=0, Permanent Draft=1, Finished=2
    444 Sporulation: Nonsporulating=0, Sporulating=1
    47 Symbiont_Name: N/A
    31 Symbiotic_Physical_Interaction: N/A
    28 Symbiotic_Relationship: N/A
    385 Temperature_Optimum: Continuous
    784 Temperature_Range: Psychrophile=0, Psychrotolerant=1, Mesophile=2, Thermotolerant=3, Thermophile=4, Hyperthermophile=5

  26. jennomics says:

    I ran correlations for the 13 traits for which I was able to code the data. None of them are significant. I have to figure out exactly what is reported in the phylocom output in terms of significance testing, but all of the correlation coefficients are basically 0. I do need to figure out how to report significance for these, but I think I’m either going to ask David for help or maybe use Dongying’s program which just reports an R-squared and a p-value, which makes life easy. I’ll have to reformat the data a little to do that, but I don’t feel like doing that right now, so I will come back to it once I am finished with the next thing.

  27. jennomics says:

    None of the correlations were significant. I’m OK with this negative result, but I do want to try to run some sort of positive control. I am going to look at the presence/absence of the recA and mutL genes.

    1. convert the genbank files to fasta sequence with readseq – done
    2. align the recA_ref.pep sequences from the 4th domain paper – done
    3. use hmmbuild build an hmm from the recA alignment – done
    3. use Aaron’s six frame translator to convert the nucleotide fasta sequences to amino acids for the hmmsearch -done
    4. use hmmsearch to search the genomes – working (this will take a while, so I am moving on to other things for the rest of the day)
    5. code present or absent for each genome for the correlation

    I will need to figure out some sort of cutoff for the hmmsearch, but I guess I’ll look at those results first and then decide what to use

  28. jennomics says:

    need to do group-specific correlations…

    working here: /home/jlmorgan/HGT/Metadata/by_phylum

    1. create lists of taxa for each phylum with more than 4 members (makefiles.pl)
    2. Use these lists to generate subtrees (tree_prune.pl) for use with Dongying’s code (independent_contrasts.pl)
    3. run correlations for all phyla and all traits (run_contrasts.pl)
    4. the output is summarized here: correlations.out

    5. Asked David to help with the significance testing. Once he helps me with that, don’t forget to tell Dongying what we did so that he can code it.

  29. jennomics says:

    Tomorrow:

    1. Finish the recA correlation
    2. Do significance testing
    3. Do correlations for the class-level taxonomy.

  30. jennomics says:

    I haven’t run the correlations for recA, but the avg HGT in all genomes is 4.76% and the average among those that do not have recA is 4.35%, so I don’t think it’s worth pursuing further. Two proteins involved in NHEJ (see Popa et al) are Ku and LigD. They found a significant positive correlation between the presence of both of those genes and HGT rate. So…
    1. Download LigD and Ku sequences from NCBI
    LigD: I downloaded 400 random sequences and then added 64 archaeal sequences
    KU: 400 random sequences + 31 Archaeal sequences
    2. align and build hmms
    align with fsa –fast, use hmmbuild with default parameters
    3. search genomes with hmms
    hmmsearch -E 1e-10
    4. figure out which genomes have both genes
    5. run correlations

  31. jennomics says:

    I also did not see a correlation between the presence of LigD and HGT. Two-sample t-test results:

    No LigD 674 4.704 5.1214 0.197
    Yes LigD 2 143 4.826 3.9659 0.332

    Population 1 ≠ Population 2: P-Value = 0.7886
    Population 1 Population 2: P-Value = 0.3943

    Upon reading that Popa paper a bit more carefully, they do not actually show a correlation between the presence of NHEJ and HGT rate. Instead, they show that the distance between donor-recipient pairs is greater when the recipient has NHEJ.

  32. jennomics says:

    David helped me with a regression analysis of my data and found several significant and interesting results. Because I can’t think of a way to incorporate the phylogenetic tree into the regression, I am going to provide him with clade data, which should allow us to somewhat control for phylogeny. – done

    Also, Jonathan suggested that I build trees with some of the genes that are predicted to have been transferred.

    1. get PUTAL genes from the sigi output genbank files – done
    2. figure out a way to get homologs and build trees for each of them – I’m not sure of the best way to do this. I asked Guillaume if he could help me assign them to one of the protein families. I should be able to use those alignments for the tree building.
    3. What am I going to do with these trees???? Some thoughts:
    a. compare them to my reference phylogeny. I did tree comparisons for the last paper, but I basically did them “manually” using phylip, and that will not work for a large number of comparisons, so I will have to think of another way
    b. also build trees from a comparable subsample of the “NORMAL” genes output by SIGI-HMM. Ask whether the “normal” gene trees are more similar to the reference phylogeny than the “putal” genes are

  33. jennomics says:

    It’s too much work right now to deal with these trees, so I am not going to do it. I think I’ll just take what I have and see if I can make something out of it.

  34. jennomics says:

    I’m meeting with David tomorrow to discuss the regression analysis. Here are some questions that I need to have answered:

    1. Which factors are correlated with HGT and in what direction?
    2. Do some phyla have a higher or lower rate of HGT than others?

  35. jennomics says:

    From the Smillie et al paper:

    Quartet mapping
    To test whether phylogenetic reconstruction supports our inference of HGT, we
    performed quartet mapping, in which all possible four member trees are generated and
    analyzed to simulate analysis of the larger and more computationally challenging parent
    tree. We followed a similar approach to the quartet mapping described by Daubin and
    Ochman 30. Briefly, we searched all 2,235 genomes in our analysis for homologs to each
    HGT event (defined as best reciprocal BLAST hits with > 60% nucleotide identity over >
    60% of the length of the transferred gene; see note on homology below). For HGT events
    with at least two homologs, we used MUSCLE (with default settings) to construct an
    alignment of the HGT sequences and all other non-HGT sequences. Events with fewer
    than two non-HGT homologs – 23% of the total – cannot be used to generate a quartet and
    so could not be analyzed by quartet mapping. For the quartets that remained, we used
    Tree Puzzle to analyze all possible quartet topologies among the aligned HGT and non-
    HGT sequences. With Tree Puzzle we used exact parameter estimates and gamma
    distributed rates with four rate categories. To provide phylogenetic confirmation of our
    putative HGT events, we computed the likelihood of obtaining a quartet grouping the
    HGT events together, versus the alternative, vertical model that would group sequences
    by the topology of the species phylogeny. A previously published likelihood ratio 30 was
    then used to place phylogenetic confidence in each HGT event. We used the most
    stringent confidence threshold possible, requiring a likelihood ratio of 1.0 to support
    HGT inference.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s