To start using PRI-CAT, please go to the
'Data uploading' tab and submit your data.
Look at our TUTORIAL
for how to use PRI-CAT, it is very easy.
If you just want to browse our DNA binding
maps, please use a DAS-compatible genome browser. We strongly advise the
Integrated Genome Browser (IGB):
1) Launch
IGB.
2) Choose
'configure' in the data access tab
3) In the data sources tab, add 'PRI-CAT' to Name
'http://www.ab.wur.nl/pricat/quickload' to URL, and 'quickload' to
Type.
You will have access to all
the DNA binding maps publically available in PRI-CAT.
You can directly download the maps from here.
Remember that PRI-CAT is fully compatible with GALAXY.
So, you can do genome-wide analysis of your data or any publically available
dataset from PRI-CAT using GALAXY (see this
for more detailed information).
Please, reference
Muino et al. NAR (2011)
if you use PRI-CAT.
Information about PRI-CAT:
· Motivation.
· Objectives.
· Primary analysis of plant ChIP-seq data.
· Input data format.
· How the data is analyzed.
· Visualizing DNA binding maps.
· Galaxy.
· Tutorial.
Motivation
How genes are transcriptionally regulated? How do
transcription factors (TFs) choose their targets? Why do some genes have an
enormous change in expression upon the binding of a TF, but other genes with
the binding of the same TF are apparently unaffected? Perhaps, the second
region is highly methylated? Perhaps, another TF is needed to modify the
expression of the gene? If you have dreamed at least once with these questions,
for sure you will find something interesting among the tools and datasets
included in PRI-CAT.
PRI-CAT is an ambitious project which aims to be a
central place for the generation, analysis and visualization of DNA binding
maps in plant genomes. PRI-CAT will analyze the raw sequence data obtained by
next generation sequencers and provide to the user results consisting in a list
of binding sites and potential target genes, as well as, the DNA binding map of
the analyzed protein. The result/s can be analyzed in combination with other
datasets available in PRI-CAT thanks to its compatibility with GALAXY. All
publically available DNA binding maps can be visualized with any DAS-compatible
genome browser. At this moment PRI-CAT contains more than 25 binding maps
resulting from the reanalysis of publically available ChIP-chip and ChIP-seq
experiments, as well as in silico predictions. We are committed to continuing
reanalyzing more ChIP-seq data as they are available, but users are encouraged
to use their own raw data with PRI-CAT and release their result to the public,
or send us their DNA binding maps obtained with other software.
The future success of PRI-CAT strongly depends on you,
in particular on how many users will make their DNA binding maps analyzed with
PRI-CAT or other tools publically available. Please make publically available
your DNA binding maps after publication!!
Back to index
Objectives:
PRI-CAT was created with three clear objectives:
1) Provide
an user-friendly environment for the primary analysis of plant ChIP-seq data
2) Provide
a central place for the storage of plant DNA binding maps
3) Provide
the possibility of advanced analysis combining other binding maps/genomic
information. At this moment, this is achieved through its compatibility with
GALAXY
At this moment we mainly focus in Arabidopsis thaliana,
but we are planning to extend our web-tool to other plants. Arabidopsis lyrata
and Solanum lycopersicum will be, probably, the next plant genomes included in
PRI-CAT. We currently don't have plans to include non-plant organisms.
Back to index
Primary analysis of plant
ChIP-seq data.
Chromatin Inmunoprecipitation (ChIP) combined with next
generation sequencing (ChIP-seq) is a formidable tool to generate precise and
accurate genome-wide DNA binding maps of proteins of interest. However, there
is a complete lack of web-tools for the analysis of this kind of experiments in
the plant field. The computational requirements of the algorithms used for this
type of analysis are usually beyond the typical PC computer power that
biologist may have. For these reasons, we built our web-tool using our powerful
in-house computer servers, so users will not be limited by their computer
resources. The simplicity of PRI-CAT is based on our extensive experience
analyzing plant ChIP-seq data [ref], that helped us to point out which are the
important steps of the analysis that need special attention by the user, and
which ones can be completely automated.
We implemented PRI-CAT thinking of the biologist as
final user. Users will need to introduce biological information on the system,
as for example which is the average DNA fragment size submitted to sequencing,
but any other statistical parameters of the analysis will be optimized by
PRI-CAT. Users are advised to check the quality of their sequenced libraries with
the graphical output generated by PRI-CAT.
The DNA binding maps generated by PRI-CAT report score
values based on the ratio between IP and control samples after statistical
normalization. Ratios are more suitable for a straightforward comparison of
different experiments than Poisson-based scores since they are not so dependent
on the statistical power of the test (eg: number of replicates or deep of the
sequencing effort). However, users should be careful when comparing experiments
using different types of controls (eg: input DNA, IP on mutants, ... ) and when
the homogeneity of the samples is very different. Plant ChIP experiments are
much more challenging that their animal counterpart since they are usually not
based on cell culture. Therefore, plant samples represent complex mixtures of
cell types (non homogeneous samples) and this should be taken into account when
comparing different ChIP-seq analysis results. PRI-CAT uses the same algorithms
(see below) for all the data that it analyzes, which also facilitates the
direct comparison of different experiments.
Users are encouraged to make publically available their
raw data; especially their control samples which are as expensive to generate
as IP samples. Control samples (eg: input DNA) usually don't contain sensible
information. However, they can be used by other users to analysis their IP
samples without generating an expensive control on their own. PRI-CAT allows
users to analysis their own raw sequence data in combination with the data
publically available in our server.
Back to index
Input data format.
PRI-CAT accepts input data in fasta or fastq format. The
files should be compressed before submitting them to the server. Accepted
compression formats include .zip, .tar .gz and .tar.gz. There is a limitation
of the data uploading size of 3 GB. This usually represents 2-3 ChIP-seq
experiments, including controls. In case that the user wants to analysis larger
group of files, he is advised to directly contact us for a more efficient way
to transmit the data. Sequencing files formats provided by genomic center
facilities and private sequencing companies usually can be directly used in
PRI-CAT.
Back
to index
How the data is analyzed.
The sequencing files submitted by the user are mapped to
the appropriate genome using SOAP2 (http://soap.genomics.org.cn/)
allowing a maximum of two mismatches. Sequences mapping multiple positions,
mitochondria or chloroplast regions are discarded. Enrichment is detected using
CSAR,
an R package specially developed to analyze plant ChIP-seq data. The parameter used were:
For
SOAP2: -r 0 -n 20 -l 30 and default
parameters
For CSAR:
backg=10 norm=-1 test='Ratio' times=1e6 digits=2 considerStrand='Minimum' npermutations=5
and default parameters
We chose SOAP2
because it shows a high accuracy mapping short next generation sequences (see
table 1), and because of its speed.
|
Percentage
of mapped reads
|
Percentage of falsely mapped reads
|
|
no mutations
|
mutations
|
no mutations
|
Mutations
|
|
Soapv2
|
92.14%
|
91.92%
|
0.0002%
|
0.0065%
|
|
BWA
|
92.14%
|
91.92%
|
0.0000%
|
0.0063%
|
|
Bowtie
|
84.08%
|
83.93%
|
0.0001%
|
0.0014%
|
|
Maq
|
85.99%
|
85.86%
|
0.0001%
|
0.0012%
|
Table 1. Results for the mapping process of several
algorithms. Sequences of 36 bp length were randomly generated form the
Arabidopsis genome (TAIR9) using Maq program allowing the simulation of
sequencing errors (`mutations`) or not (`no mutations`) . Table 1 summarizes
results as the average over 5 simulations.
CSAR
was implemented specifically for the analysis of plant ChIP-seq data (see Nature
Protocols and Plant Methods). It was developed to be robust against PCR-artifacts.
Taking the average size of DNA fragments subjected to sequencing into account,
the software calculates single-nucleotide read-enrichment values. After
normalization, sample and control are compared using a test based on the ratio
or the Poisson distribution. Test statistic thresholds to control the false
discovery rate are obtained through random permutations. CSAR shows better
performance analyzing plant ChIP data than other compared software (see Fig
2)

Figure 2. ChIP-seq method comparison. (A)
Enrichment of CArG boxes (known SEP3 binding motif) in peak areas detected by
the different methods. Libraries representing a SEP3 ChIP-seq experiment
were reanalyzed with the default options for the different methods. For
comparison purposes, all scores reported by the different methods were
transformed in rank scores, being 0 the highest significant peak by each
method. (B) Proportion of peaks detected by the different methods with at least
one gene differentially express. AP1 ChIP-seq experiment
was reanalyzed with the default options for the different methods. The list of
genes which expression is affected by AP1 was download from here, we used the list
denoted “Agilent and_or Operon_BH-0h”.
Back to index
Visualizing DNA binding
maps.
The output results of PRI-CAT include a list of binding
sites and potential target genes of the protein of interest in an Excel
compatible format (.csv). At this moment all the output results will have an
unique ID to its proper reference, however we are considering to change its
format in the coming months. We are contacting TAIR to find a proper way to
create stable IDs for each output that can be referenced in a similar way as
GEO IDs. We strongly recommend users to visualize their results in a genome
browser, in particular in combination of other genomic maps included in
PRI-CAT. DNA conservation, CpG islands, or the binding of other TFs can be
extremely useful to interpret the user's results. We strongly encourage the use
of the Integrated Genome Browser (IGB);
it is fully compatible with the DAS server where we store all ours maps. It
also present striking advantages compared with other browsers, for example its
speed, low computer power required by the server side, and the possibility to do
statistical transformations small analysis transformation of the maps on the
fly. The possibility to locate DNA sequence motifs on the fly is also an
extremely useful tool in order to interpret DNA binding data.
Back to index
Galaxy
GALAXY is a framework for tool and genomic data
integration. PRI-CAT is fully compatible with GALAXY, and users can
interactively download their desired PRI-CAT datasets directly from GALAXY. For
more help regarding galaxy, please see this.
At this moment, we have a GALAXY instance running in our
servers to exemplify the power of PRI-CAT when is combined with GALAXY. If you
want to work with your own GALAXY instance, please download and install this tool in your GALAXY instance (see readme
inside for installation instructions).
Back to index
Enjoy PRI-CAT!!!
On behalf of the PRI-CAT team,
Dr. Jose M Muino