After lots of variations on this analysis pipeline, I've found that joining paired ends unnecessarily removes lots of sequences. Likely this is due to quality dropoff at about 250bp (fig below).
So I've concatenated both MiSeq runs (not joining) and then clipping all at 250bp. The UPARSE documentation demands this, but it is not very straightforward in their pipeline. So I used FastX toolkit to do this.
So this script is a mix of UPARSE, QIIME, and FastX. This has resulted in keeping more sequences (especially in low yield samples) than any other method I tried.
############# commands to run interactive node on aciss cluster qsub -q fatnodes -I module load usearch module load fastx_toolkit/0.0.13 module load qiime/1.8.0 printenv PATH # check to make sure lots of qiime dependencies are loaded. # Qiime has so many dependencies, it is difficult to load them all. # So might have to reload module version. This makes no sense. source /usr/local/packages/Modules/setmodule 3.2.9 source /usr/local/packages/Modules/setmodule 3.2.10 ################ ##### Check quality - this goes into R script Pickle2014/R/quality/ # split_libraries_fastq.py -v -q 0 -i raw1/r1readTRIM.fastq -b raw1/barcodesRenamed.fastq -o splitLib1F/ -m map.txt --barcode_type 16 # split_libraries_fastq.py -v -q 0 -i raw2/r1readTRIM.fastq -b raw2/barcodesRenamed.fastq -o splitLib2F/ -m map.txt --barcode_type 16 # split_libraries_fastq.py -v -q 0 -i raw1/r2readTRIM.fastq -b raw1/barcodesRenamed.fastq -o splitLib1R/ -m map.txt --barcode_type 16 # split_libraries_fastq.py -v -q 0 -i raw2/r2readTRIM.fastq -b raw2/barcodesRenamed.fastq -o splitLib2R/ -m map.txt --barcode_type 16 # Ended up combining forward reads and just analyzing these instead of joining. cat raw1/r1readTRIM.fastq raw2/r1readTRIM.fastq > seqs.fastq cat raw1/barcodesRenamed.fastq raw2/barcodesRenamed.fastq > barcodes.fastq split_libraries_fastq.py -v -q 0 --store_demultiplexed_fastq -i seqs.fastq -b barcodes.fastq -o splitLib/ -m map.txt --barcode_type 16 # -n 300 # trim to 250 length. This is not straightforward or well documented in UPARSE, so # farm out to fastx. fastx_trimmer -l 250 -i splitLib/seqs.fastq -o splitLib/seqs.trimmed.fastq -Q33 # get quality stats usearch -fastq_stats splitLib/seqs.trimmed.fastq -log splitLib/seqs.stats.log # remove low quality reads mkdir qF usearch -fastq_filter splitLib/seqs.trimmed.fastq -fastq_maxee 0.5 -fastaout qF/seqs.filtered.fasta # dereplicate sequences. Last step with files separate. mkdir deRep usearch -derep_fulllength qF/seqs.filtered.fasta -output deRep/seqs.filtered.derep.fasta -sizeout # filter singletons - This rids sigletons - Decided to do without # mkdir filterSingles # usearch -sortbysize deRep/seqs.filtered.derep.fasta -minsize 2 -output filterSingles/seqs.filtered.derep.mc2.fasta # clusterOTUs mkdir OTUs usearch -cluster_otus deRep/seqs.filtered.derep.fasta -otus OTUs/seqs.filtered.derep.repset.fasta # reference chimera check mkdir chiCheck usearch -uchime_ref OTUs/seqs.filtered.derep.repset.fasta -db scripts/gold.fa -strand plus -nonchimeras chiCheck/seqs.filtered.derep.repset.nochimeras.fasta # label OTUs using puthon script from UPARSE mkdir labelOTUs python scripts/fasta_number.py chiCheck/seqs.filtered.derep.repset.nochimeras.fasta OTU_ > labelOTUs/seqs.filtered.derep.repset.nochimeras.otus.fasta # match original quality filtered reads back to otus - this is with bash derep workaround. mkdir matchOTUs usearch -usearch_global qF/seqs.filtered.fasta -db labelOTUs/seqs.filtered.derep.repset.nochimeras.otus.fasta -strand plus -id 0.97 -uc matchOTUs/otu.map.uc # make otu table mkdir otuTable python scripts/uc2otutab_mod.py matchOTUs/otu.map.uc > otu-table.txt # convert to biom biom convert --table-type="OTU table" -i otu-table.txt -o otu-table.biom # **use QIIME 1.7, not 1.8** Dependency problem # assign taxonomy assign_taxonomy.py -t gg_13_5_otus/taxonomy/97_otu_taxonomy.txt -r gg_13_5_otus/rep_set/97_otus.fasta -i labelOTUs/seqs.filtered.derep.repset.nochimeras.otus.fasta -o assigned_taxonomy # add taxonomy to BIOM table biom add-metadata --sc-separated taxonomy --observation-header OTUID,taxonomy --observation-metadata-fp assigned_taxonomy/seqs.filtered.derep.repset.nochimeras.otus_tax_assignments.txt -i otu-table.biom -o otu_table.biom # check sequencing depth. # print_biom_table_summary.py -i otu_table.biom ## for qiime <1.8 biom summarize-table -i otu_table.biom -o otu_table_summary.txt # for qiime >=1.8
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
After reviewing the joined read results for pickle box analysis, I decided to join the forward reads from both runs, and avoid joining altogether. The reverse reads were pretty poor across the board, so we were losing way to many sequences and joining others incorrectly.
Here is the R script regarding quality: http://rpubs.com/jfmeadow/27882
Mostly it came down to really low reverse read quality that tanked the total output number, as seen here:
The x and y axes here are equal at the top sequence count. Thus the forward read in the second run had the most sequences, and by combining both forward runs, I can get the most from the total number of sequences.
I've been trying to find alternative methods to the standard QIIME OTU clustering. UPARSE (published here)
Here is the workflow I am currently using to process sequence data using a combination of QIIME and UPARSE. Ann W put this together based on Mike Robeson's post here.
The only step that takes lots of time is the OTU table python script. No idea why it is so slow, and it might be worth rewriting that script to speed things up.
We performed a Light Box trial on April 8, 2014 using both Pseudomonas and E. coli. We placed five 35mm petri dishes containing nutrient agar down the center of each box for both organisms (a total of 10 plates in each box). All nine boxes were used in this trial (3 UV-transparent, 3 visible-transparent, and 3 dark). Location 1 was closest to the window and location 5 was furthest from the window. To inoculate the plates, we pipetted 100 uL of a 10-6 dilution for Pseudomonas and a 10-7 dilution for E. coli. The OD600 reading was 0.575 and 0.803 for Pseudomonas and E. coli respectively. The day was overcast in the morning and sunny in the afternoon. The boxes were on the roof for 6.5 total hours.
Definitions:
We defined "counts" as the number of viable colonies growing on the plates after incubated for approximately 24 hours. We defined survival factor as the count divided by the mean count of the dark boxes for each organism.
Results:
For E. coli, we found that there was a slight treatment effect for the UV boxes (viable colonies increased as exposure to light decreased). There was not a clear treatment effect in the visible boxes (viable colonies were relatively consistent throughout the box scheme). Note that the mean of the dark box counts was 11.73 colonies (we want between 20-200 colonies to try and decrease variation). The data for E. coli is shown below.
For Pseudomonas, we saw the viability of colonies drastically increase as UV light exposure decreased. We saw hardly any growth for the visible boxes, and the growth we did see was exposed to the least amount of visible light. The mean of the dark box counts was 37.55 colonies, which was in the target range of countable colonies (20-200). The data for Pseudomonas is shown below.
Consequences:
In this experiment, we saw that E. coli and Pseudomonas responded differently to the various light treatments. This was the first time that we tested E. coli on this scale. The data for Pseudomonas are consistent with what we saw in the most recent past trial.
Next steps:
Going forward, I think that it would be a good idea to continue to perform this type of experiment for both E. coli and Pseudomonas to have more power in our data.
We received great feedback on the knitr documents that were submitted with the Lillis microbial surface paper that was published in Microbiome Journal. The editors actually wrote a great piece about our efforts! So I've been asked to speak recently about how to pull this off. This has given me an opportunity to create some teaching materials. Mostly composed of a small subset of those data, along with analysis scripts and example manuscript documents all created in the dynamic analysis document style, using knitr and pandoc.
This is a topic I've also been asked to talk about during my upcoming visit to San Luis Potosi, Mexico. So I'll get to reuse this potentially several times.
Lately we've been working really hard getting the next round of PickleBox studies off the ground, including IRB and planning with ESBL. We're also almost ready to submit our phone microbiome paper.
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
This is an update to my previous post on using bootstrapping to test for differences in the slopes of 2 lines.
I tested my previous method by additionally comparing the slope of lines that are apparently parallel (fig below).
This is a very similar situation to the previous example, just a different data set. When I ran the old method on new data I got a very significant difference with a t-test comparing the means of these slopes, which I shouldn't be seeing. This poked a big hole in the way I had run things previously and as I thought about it, it made sense. The t-test is very sensitive to the n of the two populations being compared. So the 10,000 replicates from each bootstrapped sample was going to reveal a significant difference between the two almost no matter how similar the slopes were.
So I searched the internet a little to see what other folks had come up with. I found this stackexchange that provided an answer.
In a nutshell, what I ended up doing was to use the same bootstrapped slopes from my previous method, take the difference between the two populations of slopes (producing 10,000 differences), and then among those differences I calculated the proportion of those that were less than or equal to zero (which is basically the probability that the slopes were equal or overlapping). I then multiplied the proportion by two (to account for two tails) and then used the result to assess significance.
Here's the R code and output from what I did on the parallel lines above:
#take the differences, calculate the % of samples where the difference is <= 0 (where values are equal or overlap) and multiply that by 2. bootdiff.eea <- boot.eea.dry$t - boot.eea.wet$t hist(bootdiff.eea) overlap.eea <- sum(ifelse(bootdiff.eea <= 0, 1, 0)) # > overlap.eea # [1] 3352 2*overlap.eea/length(bootdiff.eea) # [1] 0.6704
Here's the histogram of differences:
So you can see that these two parallel slopes are definitely not significantly different.
Here's the same analysis with the data from the previous post:
We ran an intermediate sized Light Box trial (lightboxtrial_20140320) on Pseudomonas this week. We used all 9 boxes (3 Dark, 3 UV-transparent, 3 Vis-transparent) with 5 plates down the center of the box. Location 1 was closest to the window and location 5 was furthest. It was a very clear, sunny day and we had some of the highest light exposure values of any of our experiments yet.
Results:
All plates in the Visible boxes were clean. No viable colonies grew. UV plates displayed a nice shift of viability front-to-back based on expected light exposure levels. Here's the data plotted below. Black points are means, survival factor = (treatment count)/(mean of dark box counts).
Counts from the Dark boxes were fairly consistent within a box, but there were definitely differences between them. See this plot of the count data (black points are means, colors represent points from different boxes):
It looks like we're finally to the point where we've got apparently repeatable data between replicate boxes.
Next move:
I think it makes sense to expand to full-box runs with Pseudomonas and repeat the experiment several times to generate a larger data set to explore spatial patterns.
UPDATE: this method is appears to be flawed because it relies on a t-test comparison. See this follow-up post for a new way to address this problem
I was working on the manuscript from my Master's work this weekend and had a statistical revelation regarding bootstrapping and comparing slopes that other lab folks thought would be worth sharing here.
A little context: the data I'm working with are from microbial community analysis of soil samples. I was working with a Bray-Curtis distance matrix and a geographic distance matrix to look at spatial patterns, via distance-decay.
So I wanted to perform distance-decay analysis on two subsets of my data: samples from wet sites and samples from dry sites. I generated a figure that basically looks like this:
Then I wanted to compare the slopes of these two lines and determine if the turnover of communities from dry sites was significantly different from wet sites. In order to do that I found a method in Horner-Devine's Nature paper from 2004 that utilizes a bootstrapping method. I do something slightly different, but I think achieve the same results.
A popular method to compare slopes is to use an ANCOVA, however because the points in my regressions aren't independent (they're pair-wise comparisons to generate a distance matrix) I can't do this. So instead, I bootstrapped the slopes of each line and then compared the bootstrapped slopes with a two sample t-test.
Here's my non-statistician way of explaining bootstrapping (feel free to correct me): you want to compare two statistics (like slopes of lines), but you can't just run a t-test on those two numbers because there's no sample variation to take into account which is necessary for statistical analysis. So you create a population of possible slopes by randomly subsampling from each data set -- this is bootstrapping. Then you get a population distribution that looks something like this:
Note that the slope from my original regression was 0.026.
I did this for both the dry and wet (or whatever categories you're comparing) slopes and then did a t-test to compare those two populations of slopes.
Here's snippets of the relevant code:
#define the statistic you want, in this case I want to bootstrap the slope of a regression slope <- function(formula, data, indices) { d <- data[indices,] # allows boot to select sample fit <- lm(formula, data=d) return(summary(fit)$coefficients[2]) } #run the bootstrap boot.dry <- boot(data=table.dry, statistic=slope, R=9999, formula=log(X2)~log(X1)) boot.wet <- boot(data=table.wet, statistic=slope, R=9999, formula=log(X2)~log(X1)) #look at spread of values plot(boot.dry) #test to see if there's a difference between the two slopes t.test(boot.dry$t, boot.wet$t) #boom
R in Ecology // Comparing two regression slopes by means of an ANCOVA: http://r-eco-evo.blogspot.com/2011/08/comparing-two-regression-slopes-by.html
Appendix to An R and S-PLUS Companion to Applied Regression // Bootstrapping Regression Models: http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf
Making a post on the BioBE blog today and needed to adjust the a list indenting format in html. We cannot directly mess with the CSS because the university's service doesn't allow us to (ugh) so I had to come up with a work around. Found this forum post to be very helpful.
Here's a snippet of the code that ended up working:
<style type="text/css"> ul.mylist {list-style-position: inside;} ul.mylist li {text-indent: -1em; position: relative; left: 2em; margin-right: 2em;} ul.mylist li p {display: inline;} </style> <ul class="mylist"> <li>...
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
Today Adam and I are working on a growth curve for E. coli. We decided to inoculate 300 mL of nutrient broth with E. coli (from a plate I propagated on 2/28/14) in a 500 mL Erlenmeyer flask. We then put this flask directly into the shaker (150 rpm, 37 degrees C). Every hour (and possible every 1/2 hour after the culture really starts to grow), we will take a 5 mL sample out of the Erlenmeyer flask and place it directly into the fridge. This is a slightly different procedure than what we have done in the past where we would make up the culture and distribute the liquid culture into individual test tubes that were all placed in the shaker and just remove a tube at each time interval. We are using a larger volume to try and allow the liquid culture to mix more thoroughly in the shaker and get a better representation of the growth of E. coli.
Last week we ran some small trials in the Light Boxes with E. coli and Pseudomonas monteilii directly plated onto 35mm nutrient agar.
All samples were placed down the middle of the boxes. Location 1 is nearest the window, 5 is the furthest. Counts are diluted 10-7. Survival Factor is the proportion of viable counts from the light treatment sample compared to the average of the Dark box samples.
Raw E. coli counts:
Location Vis UV Dark 1 142 144 190 2 174 177 157 3 165 191 167 4 181 183 187 5 151 224 212 Average 162.6 183.8 182.6
We see that Visible treatments tend to result in lower counts compared to the Dark & UV boxes for both organisms. Also, there is a slight decrease in counts near the front of the light treatment boxes.
The script for the first chapter of our graphic novel about the urban microbiome, Noli timere, is complete. Steve Green laid out the first storyboard - 24 pages, and is now working on the drawings. All members of our team met at the Angouleme International Comics Festival, where I received my first "Auteur" Badge
Our J^4 Seagreass Microbiome project (Jonathan Eisen, Jenna Lang, Jay Stachowicz, Jess Green) is micrometers away from being formalized. More on that soon.
Helene Morlon's lab has recently moved from Ecole Polytechnique to the Ecole Normal Superieure. I will continue coming to Ecole, and also join Helene at ENS once a week.
I have given 7 of my 10 Blaise Pascal International Research Chair lectures, all of which I will be posting on this notebook in the coming weeks.
My Indoor Air Editorial has been accepted and will be released next month, prior to the AAAS conference.
Finally dug into the phyloseq package in R. This is designed to make microbiome studies easier to analyze in R. I found that most of the functionality is sort of useless for me - since I already have functions and routines for everything they do. But they do have a nice quick function to read OTU tables in .biom format, and a similarly easy qiime mapping file entry.
Installation instructions are here: http://joey711.github.io/phyloseq/install
So here is my code for
installing phyloseq
importing data
putting those data into normal R format for analysis.
Install:
They are not on CRAN, but they do have a nice bioconductor function for installation:
Working to run new Urban Air sequencing data through QIIME recently. While trying to parallelize the pick_open_reference_otus.py script I ran in to a problem where 1 of the 12 parallel jobs failed. It turns out there's a poller.py script that's running waiting for all the jobs to finish and put their output in a specific directory before the script moves to subsequent steps. So if one of the jobs fails, the poller continues to look for these output files indefinitely -- see more details here.
As those more detailed instructions say, you just need to extract and rerun the commands for the failed job in the file ending with _jobs.txt found in the output directory for the parallel script. There is also a script to help you figure out which job failed: identify_missing_files.py.
For example, this is part of a PBS script to rerun the failed job:
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
I received 137.9 GB of sequence data from the Genomics Core at the University of Oregon after submitting 36 samples for paired end 150 bp reads on the HiSeq run in MiSeq mode.
It was run with a companion of genomic DNA, due to the fact that a HiSeq can’t cope without sufficient sequence complexity. At first the sequence data was provided to me already demultiplexed, and this initially presented a problem because of the pipeline that I had originally vetted through QIIME.
I talked to Nick Stiffler, and he concluded the following: “The qiime docs show this when using files that were already demultiplexed, we need to create a special fasta label that includes the fields from the mapping file: http://qiime.org/documentation/file_formats.html-already-demultiplexed-samples. It doesn’t look like there is a script to handle this, but we could write one fairly easily.”
In the meantime, he gave me the raw data to use. In the future, I would have to use the QIIME script referenced above, along with a custom script to create a special fasta label.
It is important to remember that when logging on to the htseq server to download your data, you need to do so from your ACISS account, otherwise you will get an error: permission denied.
starting over with raw data (not demultiplexed)
Log into htseq server where the sequence data is, and download all relevant files.
The preferred method by QIIME developers, this script clusters reads against a reference sequence collection, and any reads which do not hit in the reference collection are subsequently clustered de novo. It includes taxonomy assignmnet, sequence alignment, and tree-building steps.
So instead I tried running these last few steps by themselves, because I knew that I wanted to use at least a tree for my next script, core_diversity_analyses.py
I've been working all week to migrate lots of data onto our new storage system. We're trying to maintain a consistent directory structure for each project so that we can all navigate easily in each others' projects well after we each leave.
We published the Lillis Dust paper in PLOS ONE this week, so I've also been fielding press for that. It has shown up in Gizmodo, Popular Science, Quartz, Fast Company, Futurity, and lots of other news sites that just printed the press release. So that's fun.
The surface paper was also accepted in Microbiome Journal so that will finally be published soon.