Analysing differential expression with RNA-seq data

We have gathered all of our learning material for RNA-seq data analysis to this page to help you out to get started!
If RNA-seq is a whole new topic to you, we would suggest checking out the introduction to RNA-seq webinar (1h), which introduces the steps of differential expression analysis -this video covers also the 101 of using Chipster.

After the webinar, you might like to try the analysis tools yourself -for that we suggest going through our course exercises (listed below). Open the excercises pdf document, log in to Chipster, and just follow the instructions!

Some parts of the analysis can raise more questions -we have tried to clarify those steps with more detailed instructions and tutorials. Please find the links to those on this page!

Course material

Please find below the links to our RNA-seq course materials.
Course example sessions are available in Chipster (File -> Open example session -> "course_RNAseq...").

Tutorial videos

We have an RNA-seq playlist in our YouTube channel. This list includes tutorials for some common analysis needs:

Getting started with Chipster

Importing your data to Chipster

Analysis steps for finding differentially expressed genes

These steps take you from raw reads to differentially expressed genes (please find a figure showing the different steps and file formats at the end of the page):
  1. Quality control of raw reads
    Check the number, quality encoding, strandedness and inner distance of your reads, and inspect them for base quality, biases and adapters.
    Tools under Quality control category (input FASTQ files):

  2. Preprocessing raw reads
    If the reads contain low quality bases or adapter sequences, you might like to trim or filter them. Note that when preprocessing paired end data, you need to give the two read files simultaneously to the analysis tool in order to preserve the order of read pairs.
    Trimming or filtering is not absolutely necessary, because many bad quality reads are either removed or dealt with when aligning reads to the reference genome in the next step. This depends on the aligner used, as aligners differ in their ability to cope with mismatching bases.
    Tools under Preprocessing category (input FASTQ files):
  3. Alignment of reads to reference genome
    Chipster offers a large selection of reference genomes. If your reference genome is not available, you can import it as a fasta file and use the "own genome" version of the alignment tool. You can also ask us to add the genome in Chipster if it is publicly available and commonly used.
    Note that some sequencing platforms generate several FASTQ files per sample. For example, Illumina NextSeq generates 8 files per sample for paired end data. In this case you need to first generate file name lists (one for read 1 files and another for read 2 files), and give those list files and all the FASTQ files as input for the aligner. Please read the manual for the Utilities/Make a list of file names tool for more info.
    Note that aligners need strandedness / library type information in order to align reads correctly. As there are two possible strandedness types and several kits for producing stranded sequencing libraries, we have made a summary of strandedness options and nomenclature.
    Tools under Alignment category, note that there are separate TopHat tools for single and paired end data, and for user-supplied reference genome (input FASTQ files):
  4. Alignment level quality control:
    Check what proportion of the reads mapped to exons, is the coverage uniform over transcript length, and whether novel splice sites were found. Tools under Quality control category (input BAM files):
  5. Quantitation (for BAM files):
    Next we want to count the reads per gene/transcript. The quantitation is done separately for each BAM file. At this step you also need to report the library type / strandedness -make sure you choose the correct parameter!
    Tools (under RNA-seq category): HTSeq , DEXSeq

  6. Define NGS experiment (for the count files):
    Now we need to combine all the count files (samplename.tsv) into a one table (ngs-data-table.tsv), where rows represent genes and columns different samples. For this, select all the blue count files and run the tool Utilities/Define NGS experiment.
    This tool generates the table and a phenodata file . This is your way to describe your samples in Chipster: your next task is to fill in the "group" column and maybe add some more columns that describe which of your samples are "controls", "treatments", from same batch/patient etc. You also want to make sure the "Description" column is, well, descriptive, and the titles there are short enough, as these are used by many visualisation tools. There are a few things to keep in mind when filling in the phenodata file, so it might be advisable to check out our video tutorial (3 min) and/or the manual page on how to fill in the phenodata file
    Tool (under Utilities category): Define NGS experiment

  7. HINT: If your FASTQ and/or BAM files are big, at this point it might be wise to make sure your session is saved, then remove the large files from your session, and save the session again with another name. It makes your session smaller and easier to handle! You can always return to the earlier session with the bigger files if you notice a need for that later on. Note that the cloud sessions are not stored forever, so for longer storage, save the sessions (also) locally! If the sessions are huge, it might be a good idea to store them in Taito.

  8. Experiment level quality control (for the count table):
    Now we have our data in one table, and it is time to do some experiment level quality control -this is the exciting part where you get to see whether your samples really are expressing differently!
    Note: the tool expects you to use raw counts (in the count table), so don't do any normalisation!
    Check out from the PCA plot and heatmap that the samples are clustered as you would imagine them to cluster -controls in one clump and so on- and that there are no outliers.
    Now you can also see if there are some possible batch effects lurking in your data (see the Drosophila example session to see what you should be looking for. If you notice something, make sure you take it into account in the statistical testing / modeling phase! Tool (under Quality control category): PCA and heatmap of samples using DESeq2)

  9. Differential expression analysis (for the count table):


    KEHTAISKO PYYTAA SEIJALTA LYHYTTA VIDEOTA TAI NOTEBOOKIA AIHEESTA? TASTA JA SIITA linear modeling?
    Time for statistics! It is actually fairly tricky thing to estimate the differential expression due to a couple of things: we are testing usually thousands or tens of thousands of genes (multiple testing correction is needed), the expression values are not normally distributed (instead, negative binomial distributions and generalised linear modeling are used) and the range in which a genes expression values vary varies from gene to gene (dispersion estimation). Luckily, there are tools that do all these tricky things for you.
    These tools are presented in our video tutorial:
    Differential expression analysis tools for RNA-seq tutorial video (3 min)
    Things get a bit trickier if you have multiple variables, like treatment, gender, batch to take into account simultaniously. For these cases, you need to use the tool called EdgeR for multivariate experiments
    -check out the video tutorial for that here (6 min),
    and a specific tutorial for cases where you need to use this "nested" option here (4 min).


    IF NONE OF THESE OPTIONS WORK FOR YOUR DATA, THEN....??? Oliks joku kayta omaa matriisii tyokalu?? Missa??

    Note: the tools expect raw counts (in the count table) as input, so DON'T do any normalisation and don't use FPKM values or similar!
    Tools (under RNA-seq category): DESeq2, edgeR, edgeR for multivariate analysis DEXSeq

  10. Visualising your data:
    Video tutorial for interactive visualisations (7 min)