Galaxy CLIP-Explorer

Welcome to the Galaxy CLIP-Explorer – a webserver to process, analyse and visualise CLIP-Seq data.

1. Getting Started with Galaxy CLIP-Explorer

Are you new to Galaxy? Are your returning after a long time, and looking for help to get started? Then take a guided tour through the user interface of Galaxy.

You have CLIP-Seq data, but you need some guidance for the CLIP-Seq data anlysis? Take a look at the CLIP-Seq data analysis tutorial on the Galaxy Training Network where you can analyse CLIP-Seq data of RBFOX2 from human liver cancer cells (Hep G2). The tutorial will help you to understand the analysis steps and the most important parameters and tools that are used in CLIP-Explorer.

The underlying workflow of the tutorial can be found here.

We recommend to follow the tutorial on FastQC for quality checks and the tutorial for IGV for data inspection.

The Galaxy Training Network tutorial uses eCLIP data from human liver cancer cells (Hep G2) and is hosted on zenodo:

Galaxy CLIP-Explorer can process large CLIP-Seq data of eCLIP and iCLIP. We processed eCLIP data with around 20 million reads from Nostrand et al. (2016). CLIP-Explorer can handle multiplexed and de-multiplexed eCLIP and iCLIP data in FASTQ and FASTA format.

2. Galaxy CLIP-Explorer – Many Possibilities

(A) Galaxy CLIP-Explorer workflows and tools; (B) Output of multiBamSummary and plotCorrelation comparing two biological replicates of a CLIP-Seq experiment and one control sample. (C) Output of plotFingerprint that shows the read coverage for the CLIP-Seq and control samples. (D) Output of CollectInsertSizeMetrics estimating the insert size for the read libraries. (E) Output of FastQC showing the duplication levels of the read libraries. (F) Sequence motifs of MEME-Chip (DREME and MEME) from binding sequence motifs that were predicted from potential binding regions (peaks) obtained by a peak caller like PEAKachu, Piranha or PureCLIP. (G-I) Example output of RCAS (RNA Centric Annotation System); (G) showing the binding coverage for the transcript and the 5’ and 3’ UTR, (H) depicting the binding coverage around the exon-intron boundaries, (I) and a generated target distribution plot which states what kind of RNAs the protein of interest prevalently binds to.

3. Workflows

Use the following workflows for an automatized data analysis for iCLIP and eCLIP data. The data needs to be in FASTA or FASTQ format and can be either multiplexed or de-multiplexed. All workflows, except the robust peak analysis, require the data as a list of dataset pairs. A tutorial to create a list of dataset pairs can be found in the CLIP-Seq data analysis tutorial or here. Please have in mind that all workflows need additional input files from the user.

3.1 Quick Example Run

If you want to make a quick run with example data, then download this example eCLIP data of RBFOX2 and run the workflow of the CLIP-Seq training material mentioned on the Galaxy Training Network. Or, use the workflow for the eCLIP data of Nostrand et al. (2016). Keep in mind, you have to provide the input data as a list of dataset pairs. A tutorial to create a list of dataset pairs can be found in the CLIP-Seq data analysis tutorial or here.

3.2 From scratch to de-multiplexed FASTQ files

If your data is not de-multiplexed yet, then use the workflows of this section. The user has to provide the in-line barcodes in a tab-delimited tabular format, for example:

rep1 TTAG
rep2 TGGC
rep3 TTAA

The raw data needs to be in FASTA or FASTQ format as a list of dataset pairs.

3.3 From scratch with de-multiplexed FASTQ files

You can choose between three different types of peak calling for the data analysis of eCLIP and iCLIP data. The data specification of each of the peak calling algorithms is listed below:

Table 1: Data specification of the different peak calling algorithms.

Tool	Biological Replicates (Yes/No)	Control Data (Yes/No)
PEAKachu	Yes	Yes
PureCLIP	No	Yes
Piranha	No	No

Note if you have used the de-mutliplexing workflows:

If you used the preceding workflows for de-multiplexing, then remove the steps of Cutadapt and UMI-tools extract from the following workflows to analyse your data. Simply, import the workflow into you account, remove the tools and connect the lose end directly to the alignment step.

Note if you use eCLIP data of Nostrand et al. (2016):

The workflow for the eCLIP data of Nostrand et al. (2016) was used to analyse the data of RBFOX2. Beware when using other data of the study of Nostrand et al. (2016), because the size of the unique molecular identifier (UMI) can be different. The workflow is set to a UMI of five nucleotides. You can change this by importing the workflow into your account and amend the parameter Cut bases from reads before adapter trimming of the second Cutadapt step for the CLIP and control data.

eCLIP

iCLIP

3.4 Further optional peak analysis

The following workflow can be used if you have picked a peak calling algorithm that do not support biological replicated data. The workflow finds and analysis robust binding regions shared between different peak files.

Robust peak analysis

4. Remarks

Please follow the CLIP-Seq data analysis tutorial for a deeper understand of the tools of CLIP-Explorer.

4.1 Changing Workflows

You can change the workflows at anytime and without any problems. Simply import the workflow into your account and change the necessary tools or tool parameters.

4.2 Adapter sequences

The workflows uses Cutadapt to remove standard eCLIP and iCLIP adapter sequences. You need to change Cutadapt parameters if your read library covers other adapter sequences. Cutadapt cannot detect automatically standard Illumina or other standard adapters. You have to provide the sequence.

4.3 UMI and in-line barcodes

The workflows uses Cutadapt to trim of the length of the UMI (+ barcode) from one site of the read pair. This depends on the iCLIP, eCLIP and your own protocol. Please check or change the parameter in Cutadapt based on your UMI and in-line barcode. For more information follow the CLIP-Seq data analysis tutorial.

CLIP-explorer uses UMI-tools extract to find the UMIs inside your reads. Change the pattern of UMI-tools extract based on your read library preparation.

4.4 Read alignment

We use STAR to do the read alignment. STAR combines genome and transcriptome data. CLIP-Explorer focusses only on uniquely mapped read. Furthermore, STAR is executed with soft-clipping turned off. For more information follow the CLIP-Seq data analysis tutorial.

You can replace STAR with any other read mapper by importing the corresponding workflow into your account. Check the mapping quality: Look at the multiqc report in order to assess the mapping quality.

STAR has many parameters. It is recommended to leave them in default. However, it can happen that STAR denotes a lot of read as unmapped, because they are too short. You might then want to adjust (lower) the two parameters Minimum alignment score, normalized to read length (–outFilterScoreMinOverLread), and Minimum number of matched bases, normalized to read length (–outFilterMatchNminOverLread).

4.5 Peak calling with PEAKachu, PureCLIP, and Piranha

PEAKachu

You need to specific the insert size of your paired-end reads for PEAKachu. For that reason, check the output image of CollectInsertSizeMetric to get an estimate for that parameter.

The three parameters Mad Multiplier (default 2.0), Fold Change Threshold (default 2.0), and Adjusted p-value Threshold (default 0.05) are the primary filters to select significant peaks. Keep them in default. Then adjust them based on your question.

PureCLIP

PureCLIP works best with only one site of the paired end reads, where the cross linking event occurs. Thus, CLIP-Explorer filters out the other mate before the peak calling. Remove the Bam filter tool to disable this behavior or change Bam filter to pick the correct site.

Important parameters for PureCLIP are the Bandwidth for kernel density estimation used to access enrichment (-bw) and the Bandwidth for kernel density estimation used to estimate n for binomial distributions (-bwn). Choose these two parameters wisely. They control the fitting of the model. Decreasing these two parameters result in overfitting.

If PureCLIP does not finish because of a memory error, or if PureCLIP takes too long, then try to apply the model just for a few chromosomes of the reference. Take a look at Genomic chromosomes to learn HMM parameters (-iv).

Piranha

Piranha works best with a zero truncated negative binomial (default), or with a negative binomial for CLIP-Seq data. The selected distribution plays an important part. You can change it under Select distribution type (-d).

Further important parameters are Indicates that input is raw reads and should be binned into bins of this size (-b) which controls for the fitting of the data. Decreasing this parameter results in overfitting. A good baseline of this parameter is a value around 50. The parameter Merge significant bins within certain distance? (-u) also controls for overfitting. Set it to No for more information. Set it to Yes and give it a value bigger than 0 to merge peaks together that are very close together. Set also the Significance threshold for sites to 0.05 (-p).

4.6 Extension of the binding regions

CLIP-Explorer uses SlopBED to extend the peaks a few basepairs to the left and right in order to correct for an underestimation of the binding regions of the peak calling algorithms. For more information follow the CLIP-Seq data analysis tutorial. Remove the tool or change the parameter of SlopBED to change this behavior.

Our Data Policy

Registered Users

User data on UseGalaxy.eu (i.e. datasets, histories) will be available as long as they are not deleted by the user. Once marked as deleted the datasets will be permanently removed within 14 days. If the user "purges" the dataset in the Galaxy, it will be removed immediately, permanently. An extended quota can be requested for a limited time period in special cases.

Unregistered Users

Processed data will only be accessible during one browser session, using a cookie to identify your data. This cookie is not used for any other purposes (e.g. tracking or analytics.) If UseGalaxy.eu service is not accessed for 90 days, those datasets will be permanently deleted.

GDPR Compliance

The Galaxy service complies with the EU General Data Protection Regulation (GDPR). You can read more about this on our Terms and Conditions.