Understanding RNA-Seq Analysis, a Guide for Biologists
![]() |
Many researchers use RNA-Seq in their study. Some basic understanding of the data analysis process will help you better evaluate your data and get the most out of the results. This guide is written based on our experience working with real bench scientists to answer specific biological questions using RNA-Seq data. BioInfoRx provides service to analyze RNA-Seq data for all steps. Contact us if you need help and mention this blog to get free analysis for your first sample. |
|
Step 1. Basic Data Analysis Procedure There are many different pipelines to analyze RNA-Seq data, and your core facility, service providers or collaborators probably will use one of their favorite pipelines to process the raw data for you. We recommend the TopHat2-HTSeq-EdgeR pipeline, which is well-documented (see Nature Protocol) and has very robust performance. This pipeline is ideal to identify differentially expressed genes. Note this pipeline works on gene level, but it does not give you transcript isoform information. There are tools like Cufflinks and RSEM which are designed to detect isoforms and their differentially expression. However, there are still a lot of debates on the methodology and accuracy of isoform detection from RNA-Seq data. For most researchers, we recommend using the gene level results. |
|
Step 2. Quality Assessment of Your Data It is important to avoid garbage in, garbage out when processing genomic data. Here are a few QC you should be aware with your RNA-Seq data to ensure reliable and accurate results. |
|
Raw Data Quality First, you need to make sure that the raw sequencing reads are of high quality. We often use FastQC to perform QC checks on raw fastq files. Here is an example showing two samples, one has good qualify for all 100 base pairs, the other has a significant drop of data qualify towards the 3' end. For the second sample, you may want to consider trimming low quality reads from 3' before analysis. Using FastQC can also identify other issues like high duplication levels, untrimmed adapters, etc. |
|
Sample Relationship Next, you want to make sure your biological replicates are really reproducible. The idea is simple, samples from the same condition should look similar, and treated and control samples should look different. In the example shown at right, a breast cancer cell line is treated with control (DMSO), endogenous estrogen (E2), or environmental estrogens (GEN: genistein, environmental estrogen found in soybean; BPA: bisphenol A, environmental estrogen found in plastics; see ref. here), and two replicates from each condition are measured using RNA-Seq. We present two ways to look at the sample relationship. First we can create a MDS (multidimensional scaling) plot. You can see that the biological replicates are indeed similar to each other. Also, E2 and GEN are more different from DMSO than BPA. A more informative method is to cluster the samples based on gene expression levels. Here we plot the most variable genes with a heatmap. The sample clustering is similar to what we have seen in MDS Plot. From the heatmap, it is clear that E2 and GEN change the expression of many genes, and there are more up regulated gene. Overall E2 has a stronger effect than GE. BPA only has limited effect on gene expression. From this step, you can also identify outliers (e.g. a control sample which looks like treated, or vice versa), which you may want to remove before moving to downstream analysis. |
MDS Plot
Heatmap after Clustering |
|
Step 3. Visualize Expression Profiles Gene expression data can be easily uploaded to BxGenomicDB for online visualization. Let's look at some examples using the ER data set. |
|
Gene Profiles of Estrogen Induced Genes First we want to look at all the genes that are induced by endogenous estrogen E2. With BxGenomicDB, it is easy to filter these genes by setting Fold Change>2 and FDR value From the plot, it is clear that for most of the E2 induced genes, they are also induced by GEN to a lower extent. BPA may induced some of these genes, but the change is not significant enough to be detected by this experiment (two replicates).
|
|
Search and View Specific Groups of Genes So are there any BPA induced genes? With FDR cut off at 0.05, there isn't any hits. With a relaxed criteria ( P-value <=0.05 and two fold cutoff, see screen shot below), 9 genes show up. Considering poor FDR values, more experiments (e.g. Q-PCR) are probably needed to verify them. Let's look at the expression profiles of the top three genes with two different ways (use plot options to select graphic format). The bar graph is straightforward, but I like the data points with average bar better because values from individual replicates are shown clearly.
|
|
Step 4. Obtain Biological Insights from RNA-Seq Data Now you can explore the data and use various tools to answer biological questions and make new discoveries on the mechanism of gene regulation. We are showing some examples below to give you an idea of the commonly used tools. |
|
Differentially Expressed Genes to Functional Category This is one of the most commonly asked questions and we have a separate blog dedicated to this topic. Briefly, let's see what are the functional categories enriched in E2 induced genes in the breast cancer cell line T-47D ( ref.). With BxGenomicDB, we can direct send the gene list to DAVID and get a list of Gene Ontology categories. We can also export the list, and copy the symbols to the Online Functional Enrichment Tool (based on Homer tool). The screen shots from both methods are shown at left. A direct link to the homer results is here. From DAVID, it looks like membrane proteins and intracellular signaling cascade and up-regulated by E2 in T-47D cell line. From Homer results, the top matches are the up list in response to estradiol from other experiments, and up regulated genes in a few other breast cancer tissues. Overall, some interesting results and a lot of data mining and follow up can be done here. |
|
Gene Set Enrichment Analysis (GSEA) The methods (DAVID, Homer) we mentioned above all start with a gene list, and often we need to use a somewhat arbitrary cutoff to get the list. Sometimes different cutoff may lead to different results. GSEA is another very popular method to search for functional enrichment, but it doesn't need a cutoff. GSEA looks at the difference between two biological states of all genes, then come up with what gene sets are statistically significant. We run GSEQ using ranked gene list based E2 induction. The results shows many interesting functional categories. Again, top matches are the up list in response to estradiol from other publications, but we also identify that the ribosome genes are strongly enriched in E2 induced genes. Next we will look into more details at the ribosome genes. |
![]() |
|
Data Overlap and Venn Diagram From the example of ribosome genes above, these genes are mostly affected only by E2, but not by the two environmental estrogens. To view the overlap of induced genes globally, we can use Venn Diagram. First we export the up-regulated genes from each of the three treatment, then we can enter the gene IDs to the online Data Overlap Tool to compute the overlap and create an Venn Diagram. |
|
Other Integrative Analysis There are a lot more interesting analyses one can perform on RNA-Seq data. For example, you can compare with expression data from other labs, or integrate ChIP-Seq with RNA-Seq results. We will cover some of these topics in future blogs.
Interested in using the approach described here but need some help? Please contact us to get a free quote for our biologists-friendly analysis service. |


