Functional Transcription Factor Target Networks Illuminate Control of Epithelial Remodelling. Overton et al, Cancers (Basel). 2020 Sep 30;12(10):E2823.

#dc comics#dc#batman#batfam#bruce wayne#dick grayson#tim drake#batfamily#dc fanart




seen from United Kingdom
seen from United States
seen from South Korea
seen from China
seen from France
seen from China

seen from United States
seen from China
seen from China
seen from United States

seen from United Kingdom
seen from Australia
seen from United States

seen from United Kingdom

seen from United States

seen from India
seen from Germany
seen from Germany
seen from Germany
seen from United States
Functional Transcription Factor Target Networks Illuminate Control of Epithelial Remodelling. Overton et al, Cancers (Basel). 2020 Sep 30;12(10):E2823.

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
ChIP-Seqデータを解析しない
NBDCと東大農アグリバイオインフォマティクス教育研究ユニットによる「次世代シークエンサ(NGS)ハンズオン講習会」にアシスタントとして参加してきました。
http://biosciencedbc.jp/human/human-resources/workshop/h27
BioLinuxのインストールにはじまり、Linuxでのコマンドラインの使い方、スクリプト言語やNGS解析の基礎、代表的なNGSアプリケーションについてそれぞれの解析方法など、約2週間かけて行われるハードコアな講習会です。ハードコアですが、普段からデータ解析にチャレンジしてみたい、でもどこから取りかかればよいか分からないという人にとっては、えいやと時間を取ってみっちりとトレーニングを受ける、絶好の機会ではないかと思います。「全日程参加できる方から優先的に参加を受け付ける」としていたこともあり、他のトレーニングコースに比べてモチベーションのより高い参加者の方が多い印象があります。
しかしこのようなハードな講習会に来られている方の多くは、普段データ解析をするインフォマティシャンやそのテクニックに触れる機会が少ないのではないかと推察されます。そうであればなおのこと、折角時間を取って覚えたことも使わないと忘れてしまいますし、本当に大変なのはトレーニングを受けたあと、いかに学んだことを日々の暮らしに組み込んでいけるかという点だと思います。
その点で、アシスタントといえど軽い気持ちでサポートをするわけにはいきません。たとえば「講師の言う通りにコマンドを打ったけれども動きません」とヘルプを求められたとき「こうすれば動きますよ」と単に直してあげるだけなら簡単です。しかし「なぜ動かなかったのかを推測する」「推測される原因から解決策を探す」その方法を含めて示さないことには、別の場面で応用ができませんし、そうなってしまうと折角のトレーニングの機会が無駄になってしまいます。自分のラボに戻れば講師もアシスタントもいないわけですから、2週間のトレーニングコースを普段からコンピュータを使ってデータを解析している人の習慣や考え方に気付くチャンスとして活用してもらうことで、長く役に立つスキルを持って帰ってもらえるといいのかもしれない、アシスタントをしながらそんなことを考えていました。一言「エラーメッセージをそのままGoogleにコピペして上から全部ドキュメントを読め」というだけといえば、まあそうなんですが。
アシスタントが大変なら講師はもっと大変で、大勢の人を相手にハンズオンをやるというのは、僕自身の経験からも、それはもう本当に大変です。しかし僕が参加したアメリエフ株式会社 服部さんによるPythonのコースと理研森岡さんによるChIP-Seqデータ解析のコースはどちらも充実したよい内容でした。参加者の多いハンズオンは一度に多くの人に対してコースを提供できるメリットと同時に、参加者の興味や技術レベル、リテラシレベルの分散によってはいまいち響かないものになってしまうリスクもあるのですが、お二人のコースに関してはとてもうまくいっていたと思います。興味のある方は、 こちらのページ に資料がアップされていますので、是非チェックしてみてください。
講師のテンションがおかしい
さて久々にエントリを書いたかと思えばお利口さん気取りの感想文だけを置きにきたわけではもちろんありません。ではなんなのかというと、まずは @suimye こと理研森岡さんによる愉快なトークでお届けしたChIP-Seqデータ解析コース、その講習用ページをご覧ください。https://github.com/suimye/NGS_handson2015
ChIP-Seqの原理からはじまりプロットまで、講習の流れが示されているところをざっと見て、ページ一番下の「中級者以上コース(一匹狼たちへの課題)」というリンクに注目してください。括弧内の文言の意味がよくわかりませんがクリックするとこちらのページに飛びます。ページを開いてみてください。https://github.com/suimye/NGS_handson2015/wiki/NGS_senior
控えめに言ってもファンキー過ぎるページですが、僕はこういうの好きなので気にしません。いずれにせよ、1日のトレーニングは朝から夕までおおむねこのテンションで続きました。参加者の方々は始終「こんなときどんな顔をすればいいか分からない」というような表情でいらっしゃったのが傍から見ている分には大変愉快でした。笑えばいいと思うよ。
ともあれ、ウェブページです。何かヤバいクスリでもキメてるんじゃないかというテンションのあたりはとりあえず無視して見ていきますと「問題」という欄があります。
ファイル kaeru.nazo.fastq.gzは、ヒトの細胞を由来とするChIP-seqデータである。実習および講義で教わる操作および独自のアイディアで、このChIP-seqデータの転写因子名を調べよ
つまりChIP-Seqのトレーニングコースは自分でさっさとやってしまった!簡単過ぎるぜ、もっとタフなタスクをよこせ!というマッドな参加者を想定してこのような課題を用意してくださったようです。気が利くんだか、カエルの写真も脈絡がないし単に貼りたかっただけなんじゃねえーかとか、まあ、いいんですけど、こういうの、みなさん好きですよね?
やってみよう
というわけで、この課題にチャレンジしてみます。「素敵なプレゼント」とか、嫌な予感しかしないですけど。どうせカエルのぬいぐるみです。
正攻法でいくと、fqをヒトゲノムにマップしてピークコール、結合領域のモチーフから転写因子を探るなどのやり方がありますが、それだとあまり面白くないので、データベース屋として、データベースだけを駆使して転写因子を当ててみましょう。
まず、fastqをダウンロードして中身を見てみます。gzipで圧縮されていますが、気にせずlessで中身を見ます。
$ less kaeru_nazo.fastq.gz @KAERU.13652678 HWUSI-EAS366_145:6:26:14947:20927/1 TGGCTGGCAACAATAGATACTGGGGACTACTAGACA + #################################### @KAERU.46814293 HWUSI-EAS366_145:6:95:12328:13457/1 AATGGAATTGAATGGAATGGAATTGAATGGAATGGG + ,7>>21?############################# @KAERU.9734561 HWUSI-EAS366_145:6:19:12517:8892/1 TTGAGATGGAGTCTTGCTCTGTCGCCCAGGCTGGAG
おそらく答えとなる転写因子はあまりマニアックすぎないもので、クセのないデータを用意しているのでは、そして自身の研究データはコラボレータとの同意などの手間を考えると使いづらいので、公共DBで公開されたデータを使っているのではないか、公開データだとしたらfastqのリードIDの文字列中にSRAのIDが入っているのではないか…という予想からファイルの中身を見てみましたが、なんか KAERU とか書いてある。芸が細かいというかなんというか。
SRA IDがあればそのIDで検索すればすぐ分かるのですが、今回はそれを許してはくれないようです。残念。しかし一歩前進です。@で始まるID列、”HWUSI..."から始まるこの部分はシーケンサの機械固有のID、レーン番号、タイル番号、クラスタの座標と続く、イルミナのシーケンサから出力されたものだと思われます。そしてこの機種名の前に文字列が入っており、そこをKAERUとわざわざ書き換えているということは、これは元々SRA Run ID (SRR ID)があったものと考えられます。
IDはありませんでしたが、諦めるのはまだ早い。シーケンサの機械固有のID、これをキーにして公共DBから同じ機械によって出された配列データをリストアップすれば何か分かるかもしれません。
このようなシチュエーションにはこれまで遭遇したことがなかったので、各公開データとシーケンサの機械固有IDを紐付けたDBは残念ながら手元にありません。イルミナのシーケンサから得られた全ての公開fastqデータからシーケンサの固有IDを攫うにしても、なんせHiSeq2000だけで56万実験セットとかあるので、ちょっと時間がかかりそう。まあ、ダメ元でgoogleに突っ込んでみましょう。
なんか出た!!
2件ですがヒットしました。どうやら同じことを考えたのか、 @ma_ko というインターネットユーザがこのファイルを見てうんざりしたらしいという情報が2件目に見つかったのも収穫ですが、1件目にヒットしたこれはまさしく公開されたfastqのリードIDの文字列です。しかもURLから見るに、今回と同じNGS講習会の、昨年のテストデータのようです。これは怪しい。「実はこのデータは昨年使ったものでした〜!ワハハ!」みたいな展開、めっちゃありそう。
しかし、検索にヒットしたものはレーン番号が3番で、課題のfqは6番でした。わざわざレーン番号に細工をするというのはちょっと考えにくいので、「同じシーケンサから出た別のデータ」を疑って調べていきます。プロジェクト単位で見るために、この SRR445816 っていうやつを http://sra.dbcls.jp/search で見てみましょう。
手前味噌なんですけど、これはうちで作ってる公開データのレポートを出力する検索システムです。SRAの新しいスキーマにまだ対応していなくて、しばらく新しいデータが更新されておらず、現在システムのアップデート中なのですが。ともあれ、試しに検索してみましょう。
なんかそれっぽいの出た!!
転写因子 Oct4, Sox2, Klf4, c-Myc をゲノム上にマップするプロジェクトにおける一連のシーケンスデータのセットのようです。これは正解に近づいた予感がする。講師も「論文のピークを信じるな、ピークコールのスレッショルドは目的に応じて恣意的に決められるのだから自分の研究に利用したければ元データを再計算しろ」と言っていたし、この論文の再解析を行った可能性はありそう。
同じプロジェクトで得られた Run は全部で7つ。最初のSRR445816がgoogleでヒットした、昨年の講習会でテストデータに使われたデータですね。この下に並んでる中からレーン番号6番でシーケンスされたデータが見つかればそれが一番怪しい。 “FTP” をクリックしてDDBJのFTPに接続して、データをダウンロードして、それぞれのデータを順番に見てみます。
$ less SRR445817.fastq.bz2 @SRR445817.1114 HWUSI-EAS366_145:4:1:5805:1023 length=36 NNNNNNNNNTNNNNNNNNNNNNNNNCNGNNNNNNAN +SRR445817.1114 HWUSI-EAS366_145:4:1:5805:1023 length=36 #################################### @SRR445817.1117 HWUSI-EAS366_145:4:1:5974:1021 length=36 NNNNNNNNNANNNNNNNNNNNNNNNANCNNNNNNAN +SRR445817.1117 HWUSI-EAS366_145:4:1:5974:1021 length=36 #################################### @SRR445817.1118 HWUSI-EAS366_145:4:1:6008:1017 length=36 NNNNNNNNNCNNNNNNNNNNNNNNNTNTNNNNNNCN
4番レーン。ちがう
$ less SRR445818.fastq.bz2 @SRR445818.781 HWUSI-EAS366_145:5:1:1107:1023 length=36 NGTCCNNNNNNNGNNNNGGNAAGAACGNNAAANNNG +SRR445818.781 HWUSI-EAS366_145:5:1:1107:1023 length=36 #################################### @SRR445818.782 HWUSI-EAS366_145:5:1:1156:1021 length=36 NGTATNNNNNNNNNNNNGNNNACTTTGNNNANNNNC +SRR445818.782 HWUSI-EAS366_145:5:1:1156:1021 length=36 #################################### @SRR445818.784 HWUSI-EAS366_145:5:1:1335:1022 length=36 NAAGANNNNNNNNNNNNATNNGGGGGCNNTTTNNNA
5番レーン。ちがう
$ less SRR445819.fastq.bz2 @SRR445819.912 HWUSI-EAS366_145:6:1:1164:1034 length=36 GAGGCNNNNNNANNNNNNNACTGGCACAANATTNAA +SRR445819.912 HWUSI-EAS366_145:6:1:1164:1034 length=36 #################################### @SRR445819.913 HWUSI-EAS366_145:6:1:1323:1025 length=36 NGNNNNNNNNNNNNNNNNNNNNNAANNCNNNNNNNN +SRR445819.913 HWUSI-EAS366_145:6:1:1323:1025 length=36 #################################### @SRR445819.914 HWUSI-EAS366_145:6:1:1420:1030 length=36 NAGGNNNNNNNNNNNNNNNANCGAGGGCNNGANNNN
6番レーンだ!!!
ということでこの SRR445819 が同じシーケンサの機械の同じレーンから得られたデータだと分かりました。もちろん、まだ同一 Run である保証はないので、確かめてみましょう。
まず展開後の kaeru_nazo.fastq が 31,236,448行 (7,809,112 reads), 946MB で SRR445819.fastq が 240,113,360行 (60,028,340 reads), 12GB 。サイズは全然違いますが、ヒト細胞のChIP-Seqで700万リードというのは少ないので、これは実習の難易度 (というか計算時間) を下げるためにreduceしたと考える方が自然です。そうすると、 kaeru_nazo.fastq 中のリードが SRR445819.fastq からも出てくればまあ当たりだと考えていいんじゃないでしょうか。
$ head kaeru_nazo.fastq @KAERU.13652678 HWUSI-EAS366_145:6:26:14947:20927/1 TGGCTGGCAACAATAGATACTGGGGACTACTAGACA + ####################################
$ grep -A 4 'HWUSI-EAS366_145:6:26:14947:20927' SRR445819.fastq @SRR445819.13652678 HWUSI-EAS366_145:6:26:14947:20927 length=36 TGGCTGGCAACAATAGATACTGGGGACTACTAGACA +SRR445819.13652678 HWUSI-EAS366_145:6:26:14947:20927 length=36 ####################################
当たった。
もうちょっとちゃんと確かめた方がいいんだろうけど、まあ、もうこれでいいんじゃないかな…。
kaeru_nazo.fastq は SRR445819 由来ということにします。じゃあこの SRR445819 は先に挙げた論文のシーケンスのうち、どの転写因子の抗体を使ったものなのかを調べれば答えが分かります。通常、Sampleの情報に使った抗体や細胞株を記述してあって欲しいのですが…
書いてなーい。イェーイ。
ご安心ください。よくある話です。別のどこかのDBにこのサンプルのデータが登録/記述されていないか見てみましょう。
SRR445819に対応するサンプルのIDはSRS300774です。今回のデータに関連する一連のIDはSから始まっており、これはNCBIに登録されたデータであることを示しています(ちなみにEから始まるのがEBIが受け付けたデータ、Dから始まるのがDDBJ)。NCBI, EBI, DDBJではさまざまな登録データのサンプル情報を別に管理するBioSampleというDBを運用しており、ここに情報がある可能性が高い。おもむろにNCBI BioSampleにSRSのIDを突っ込んで検索してみます。
雑な検索だけど気にしない。
イエーイ!
c-Mycの抗体を使ったChIP-Seqに使われたSampleの情報が表示されました。つまりこれが先ほどの配列データの元のサンプルの正体です。
無事それっぽい答えにデータベースだけを使って辿りつくことができました。辿りつけましたが、なんかあんまり達成感、ないですね…。fastqを投げ込むと「これだよ」と答えを出してくれるシステムを作ろうかな。
講習終了後、参加者がいなくなってからこっそり講師に「あれc-Mycでしょ」と伝えると「正解―!!!」正解だそうです。「プレゼントあげるよー!!2つあるから2つあげる!!!」
2つももらってしまった。インチキなのに…。まあいいか。何が入っているんでしょう。
なんだこれ……………………………。
オーガナイザのNBDCの方々、アグリバイオの門田さんはじめアシスタントのみなさん、そして偉大な講師のみなさま、お疲れさまでした。おしまい。
EpigeneticSeq 2015 Workshop
In the last decade, genome-wide approaches have become key tools toward the comprehension of genome’s structure and gene regulation. With the advent of high-throughput sequencing technologies, the amount of today available genome-wide data is impressive, and constantly growing, but this growth is not efficiently counterbalanced by efficient data analysis.
Given the success of the first edition, Next Generation Intelligence and the Human Genetics Foundation (HuGeF), with the support of the University of Torino and Illumina, are proud to announce the second edition of the EpigeneticSeq Workshop 2015, a beginners practical course on ChIP-Seq and Bisulfite-Seq data analysis, and their integration with RNA-Seq data.
For more informations:
http://epigeneticseq.hugef-research.org
or write an e-mail to:
workshop[at]nextgenintelligence.com
EpigeneticSeq 2014 Workshop
In the last decade, genome-wide approaches have become key tools toward the comprehension of genome's structure and gene regulation. With the advent of high-throughput sequencing technologies, the amount of today available genome-wide data is impressive, and constantly growing, but this growth is not efficiently counterbalanced by efficient data analysis.
In this perspective, Next Generation Intelligence and the Human Genetics Foundation (HuGeF), with the support of the University of Torino and Illumina, are proud to announce the first edition of the EpigeneticSeq Workshop 2014, a beginners practical course on ChIP-Seq and Bisulfite-Seq data analysis, and their integration with RNA-Seq data.
For more informations:
http://epigeneticseq.hugef-research.org
or write an e-mail to:
workshop[at]nextgenintelligence.com
Slides from a presentation
Recently, I did a presentation for my lab group at school, and I wanted to do an "experiment" and try to do everything without PowerPoint.
Crazy you say? Well, I considered some options, and I went ahead and made my presentation using Slidify!!
I was super impressed with Slidify because it was very easy to get started and it uses R Markdown syntax which I have already used for some blog posts. To get started, you just use a textfile and separate the slides using a newline with three dashes "---". After just a few minutes, you can easily start generating HTML output slideshows that work great.
It helps if you have some web design experience to use Slidify also, because you can style Markdown code using CSS. Here is a discussion about how to align images in a markdown document, and I particularly like the second answer. Unfortunately. I had some trouble aligning images too, so I think I could probably do with some practice (see Figure 1)
Figure 1. Example slide from my presentation with a figure that is cutoff. I used the CSS style "float:right" but for some reason it didn't stay on the page.
I also wanted to publish my slides on the web, and found that you can use Slidify to publish to your Github repository using Github Pages. Again, this process was also surprisingly simple: you simply make a branch on your repository that is named "gh-pages" and push all your Slidify slides to it!
The outcome: Comparison and analysis of ChIP-seq experiments
Feel free to check out the slides, and you can also see the code for generating the slides on my Github Branch. Soon, I may also update the code for the slides to generate the figures instead of just linking to figures

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
Adding a custom normalization to MACS
One of the issues with ChIP-seq is that each dataset has it's own special characteristics. What works well on one might perform poorly on another! There are variations in the sequencing depth as well as in the enrichment of the protocol, and the protocols for ChIP can vary widely themselves specialized for using small numbers of cells or other scenarios.
In the paper "Normalization of ChIP-seq with control" (Liang and Keles, 2012), they make a convincing case for how to estimate the normalization factor to scale the background distribution of reads. They analyzed a couple different openly available ChIP-seq datasets, and I found these figures to be very illustrative for the unique shapes that each dataset has.
Supp Figures 1-3, from Liang et al. (2012) - These figures show the binned read counts of various ChIP-seq experiments versus their controls. Each experiment has a unique shape that depends on the sequencing depth and the enrichment of the protocols.
How to estimate the normalization the NCIS way?
In applications like MACS, the ChIP-seq and control is normalized by the sequencing depth. However, we might not want to scale down the peaks from the ChIP-seq data: we might only want to normalize the "background", the non-specific and noisey DNA-protein interactions.
We know that the binding sites are basically peaks in the data, so a "reasonable" way to estimate the background is to use all binned read counts that have less than "t" reads. This corresponds to the set Bw(t) ofall bins B of width w that have less than t reads. With that in mind, we can see the first figure of the NCIS paper and actually make sense of it.
Figure 1. from Liang et al (2012) - Shows the estimated ratio of the background reads in the ChIP-seq/control for a given threshold t. The optimal t is called when at least 3/4 of the genome are included in the background distribution and thereafter when the estimated ratio starts "going up" (r(t)>r(t-1)), indicating that we 'grabbing' the chip-seq signal and stopping there.
How can we apply this technique to MACS?
Originally, I found NCIS on the MACS mailing list and I wanted to figure out how I can answer this question for myself. As it turns out, MACS can perform pitifully on some datasets. This includes datasets like Zheng et al which I have looked at in my previous posts. One particular experiment with SEG1 produces some very troublesome results when MACS is used to find peaks. Therefore, I wanted to see if we can use the NCIS ratio estimates for MACS. Unfortunately, for MACS 1.4.2, there no way to specify a custom normalization factor, so I made a quick hack that ignores the affect of the default --to-small=BOOL flag if a new flag that I specified, called --ratio=FLOAT, is set. The --ratio is a scaling factor affects the control dataset, but i could have designed it to affect the chip-seq dataset as well.
Calculation of the of NCIS ratio for the SEG1 dataset
As in the NCIS paper, I also tried to analyze the SEG1 dataset using MACS. So first I tried to reproduce some of the results of NCIS in their SEG1 application, but to do this, i had to do some guesswork about NCIS methods. First, i deduced that they had pooled all the SEG1 replicates together, which gives me a very close ratio estimate to their published results. SEG1 ratio given in NCIS paper NCIS estimated ratio:1.265 estimated background proportion:0.763 SEG1rep1,2,3 pooled ratio based on my calculation using NCIS NCIS estimated ratio: 1.2627640 estimated background proportion: 0.7672308
Sidenote: The NCIS package is very easy to use in R, which is great, but i also ran into some memory problems. I would try using the MCS or BED format instead of AlignedRead format. Personally, I used MACS script elandresult2bed to convert the the eland_result files to BED instead of using readAligned to get the eland_result files into memory directly Peak calling using the modified MACS
After calculating the normalization, i did the peak calling with and without the ratio parameters to see how this performed. Again, I had to guess the peak calling parameters because they aren't given in NCIS, so I kind of guessed. Unfortuantely, my numbers didn't match their numbers, but the effects of using the --ratio flag are clear. First, I called peaks without the --ratio flag
macs14 -g 1.2e7 -m 4,30 -t SEGChIP_rep123u_eland_result.txt -c SEG1inputu_eland_result.txt -n SEG1rep123
This command resulted in only 703 called peaks and 971 negative peaks (yikes! bad fdr). I also tried adding the parameter --to-small, which downscales the ChIP-seq data (6M reads) instead of upscaling the control data (4M reads), but it didn't improve much: 366 peaks called and 713 negative peaks (yikes again!)
So, then using the --ratio flag that i added to MACS, I ran MACS using the NCIS ratio estimate which is r=1.2627640. This is in comparison with the raw sequencing depth ratio that MACS applies which is r=1.64.
macs14 -t SEG1ChIP_rep123u_eland_result.txt -c SEG1inputu_eland_result.txt -m 4,30 -g 1.2e7 --ratio 1.26 -n SEG1rep123-ncismod
This resulted in 991 peaks and only 19 negative peaks. Arguably, this result is a lot better! Changing only the --ratio from ~1.6 to ~1.2 creates a much better control of the FDR.
Why is it so much of a better result?
We can visually see how the different ratios compare based on figures from Liang et al. again, and here the sequencing depth ratio (black) greatly exceeds the the NCIS ratio (blue) which follows the background distribution. If the sequencing depth ratio is used as a scaling factor, then the control data will be amplified into the region where true peaks can occur, so many negative peaks will be called, and many positive peaks will be obscured!
Figure 4. from Liang et al (2012) - The black line represents the difference in sequencing depth, the blue line is the NCIS estimate
Conclusion
These are positive results for using the NCIS estimator with MACS, and it only involved a very simple patch to the MACS source code to achieve. Overall, I found that the new --ratio parameter combined with the NCIS ratio estimate was very effective at fixing the problems with the high FDR and weak peak calling observed in SEG1.
I'm surprised I had not heard of this technique earlier, since it appears to also be implemented to some degree by cisgenome and others. I guess I like surprises though! There are additional results from the NCIS paper regarding the control of the FDR that I would like to figure out still. I also should probably try it on different datasets and see how it compares.
Download
If you want to try out MACS 1.4.2 using the --ratio flag, I made a github fork of the code and added a tag "add-ratio-branch" which let's you download the package as a zip here https://github.com/tonto/MACS/tags
You can also see the exact changes that I made to MACS using this commit log here https://github.com/tonto/MACS/commit/d1c9f9931871ee8706e0aca1041f647857b2f8e5
Hello all!
I just completed my undergrad senior thesis recently and I wanted to share a little bit with you. All in all, it was a amazing experience! I had a lot of help with my thesis from my advisors and other students but it was a very independent project for me also, and I feel very privileged to have gotten involved in such a cool field--bioinformatics!
The thesis is titled "Comparison of transcription factor binding sites from ChIP-seq experiments" and I can make it available on request or once it's published on my university's website. It contains a lot of stuff that I have already talked about on this blog already, but a lot of new stuff too!
For my final project, one of the main things that I did was to use the limma package from Bioconductor as a way to compare ChIP-seq data. My original goal was to use some standard t-tests to do comparisons, but I found digging into the literature that there are improvements including the moderated t-test from the limma package.
As a starting point, I use the binned read counts that are outputted as wiggle files from MACS, find the bins that are differentially bound by comparing the entire datasets using limma (Figure 1). This reveals intuitively that the read counts that are high in one experiment but low in the other are identified as differential.
Figure 1. Differential ChIP-seq read counts from two ChIP-seq experiments that are identified using limma
The next thing that I did was to merge all the nearby binding sites together. I used a very simple merge approach using BEDTools by merging nearby positions only <20bp apart. It turns out that the merging performs very well and many nearby binding sites are merged together from the high-confidence sites that limma identifies (Figure 2)
Figure 2. The ratio of the merged peaks (output) to the inputted bins (input) The decreasing ratio is a sign of many nearby bins that are identified by limma being merged together at different p-value thresholds.
The next thing I did was inspect the 'signals' to show that these peaks resemble intuitively differential peaks. Indeed, many of these these peaks are highly significant and highly differentially bound (Figure 3). I attribute the success of this approach to the limma package. In my paper, I also looked at how my method compares with other software such as MACS and DIME for finding differential binding sites.
Figure 3 An example of a differential binding site, showing a peak that is reproduced in the replicates from one strain but that is not present in the replicates of the other strain.
These differential binding sites appear in many different forms, and one question is what effect do these binding sites have? These binding sites essentially regulate the gene expression of nearby genes, so I looked at the gene expression data that was also made available and assessed the differential gene expression vs the differential binding (Figure 4). This method is inspired by plotMM from the Rcade package from Bioconductor. The gene expression data was obtained with help from the GEO2R interface that is fairly new interface for GEOquery on the NCBI website!
Figure 4. The differences in binding sites are associated with changes to gene expression also. Comparison of the log ratio of differential binding sites vs the log ratio of the gene expression data. The lightly colored points are the gene expression from a random gene, and the dark colored points are gene expression for the nearest gene to the binding site
Overall, I was able to really deeply explore one approach for comparing ChIP-seq data. I think the availability of high-throughput sequencing data is very cool on the GEO database, and many new types of bioinformatics tools are needed to help analyze it! I don't think my work will stop just because I finished my project, and I will hope to find new and interesting things for the future!
PS: there are some interesting comparisons of limma and edgeR that I did not consider in my project but which would be interesting to analyze.
Comparison of ChIP-seq data
In my last post, we looked at how normalization can be used for analyzing high throughput sequencing data. I am interested now in how to identify differences across conditions.
To compare the ChIP-seq data from different experiments, we looked at the yeast data from [3]. To start, the reads overlapping each genome position are binned into windows (spacing=10) using wiggle files in MACS [1]. Then MACS identifies peaks in the data, identifying potential transcription factor binding sites, and we found overlapping and non-overlapping (unique) peaks using bedtools [2].
When we compared biological replicate experiments of S96 yeast strain, we see that most of the differences (unique peaks) are near y=x showing that they are not significantly different from each other (Figure 1)
Figure 1. Comparison of S96 replicates shows high correlation (rho~0.9) and the peaks that are marked unique are clustered around the line of equality. The log2 plot does not show large differences either.
We can also use this same technique to show how the two different strains compare by using the same procedure. In this case, we compared S96 and HS959 strains and found noticeable differences in the peaks that are called unique. The raw data shows large differences between the peaks that are called unique in each dataset. Looking at the log-transform shows that most of these positions have fold-change differences greater than 2.
Figure 2. Comparison of the S96 and HS959 genomes. (Left) The raw read scores shows that the unique peaks have significantly different groupings between each strain (Right) Log-transform shows large fold-change differences exist amongst the peaks called unique
The comparison of the raw read scores is convincing that some of the peaks are differentiated but it is also clear from comparing some experiments that the read depth is not equal which results in the “banana shape” (Figure 2). We explored several options for normalizing the data including background subtraction, scaling, background subtraction and scaling, and normalized difference (NormDiff) [3]. I will include these in a new post...
To process the data, plyr (R package) was used to fit a neat design pattern called split-apply-combine. In this case I split data across factors (chromosome) and apply functions (selections), and then results are combine automatically back into one table. This cut some of the fat off of my old overweight code!
[1] Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137
[2] Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.
[3] Zheng et al. 2010 Genetic analysis of variation in transcription factor binding in yeast
additionally R packages plyr, stringr, and RColorBrewer were used to analyze this data