# Documentation

*DiffExpress* is a platform to support researchers in the study of quantitative changes in gene expression levels between experimental groups. The user must provide count-tables from RNA-seq assays and information on the experimental design. The software can filter the read-count data and will normalize it, estimate the data dispersion, fit an statistical model and run a statistical analysis to test if, for a given gene, the difference in the read count average within experimental groups is significant. The analysis also generates supporting images, such as MDS plots, MA plots and heatmaps. The user can explore the results in an interactive interface, share the access to it with collaborators or download it.

### Overview

*DiffExpress* is based on the resources offered by the R (1) package *edgeR* (2). It offers methods developed by Robinson and Smyth (3, 4) to determine differential expression in sequence-based gene expression data. It can be applied to RNA-seq, Tag-seq, SAGE, CAGE, Illumina/Solexa, 454 or ABI SOLiD assay's data (5). It analyses **count tables** generated by those assays, which presents, for each sample/library (columns), the observed number of sequenced reads mapped to the genomic feature of interest (genes, exon) which are generally referred to as ** tags** (rows). A second table informing the relationship between samples is also required, which is labelled as

**metadata**. This table will inform how many groups/conditions can be found in the experiment, the relevant features to be analyzed or information on bias sources. Each column in the count table should have a correspondent row in the metadata table with a matching label (Simplicity will verify and inform if there is no correspondence).

*DiffExpress* was designed to deal with variable table formats and the user can see how the software is interpreting their files on the go and change the settings straight away. One of the key features of this platform is to **support the user to define the statistical design** that best represents the experiment. *v* offers a form where the user has to pick at least one of the features in the metadata table and fit a statistical model. The software informs if any missing value (represented by "NA") is found and checks if it is statistically possible to fit the model. Any issues are informed to the user and some solutions are presented in order to support the user's decision on how to deal with the problem. All changes and fitted models can be tracked in the **History tab.**

*DiffExpress* was implemented using *edgeR* (2) due to its extensive use among the scientific community, its statistical modelling possibilities and general quality. However, it is important to clarify that *edgeR* does differential expression analysis rather than the quantification of expression levels. In other words, it deals with relative changes in expression levels between groups/conditions but does not directly estimate absolute expression levels (6).

### Count Table

In order to work, the *edgeR* methodology requires original observed expression counts, so these should not be transformed in any way by users prior to analysis. *edgeR* automatically takes into account the total size (total read number) of each sample/library in all calculations of fold-changes, concentration, and statistical significance. RPKM values *should* not be used(5).

The count table can be a CSV or TXT file presenting the *tags’* read-count for each sample. *DiffExpress* offers an interface where the user can indicate the type of column separator being used and other formatting features that are relevant to properly read the file. There is also the possibility to transpose the data. The final version of the table should present the *tag* ID in the first column, one row per *tag* and one column per sample/library. No missing data (“NA”) is accepted (*DiffExpress* will inform the tags being removed due to the presence of missing data) and all tags and samples/libraries must have unique identifiers.

### Metadata Table

The metadata table is also presented in a CSV or TXT file and the user can also define the criteria to read the file as described for the count table input interface. The metadata table must have a row for each sample/library in the count table and a column for each variable(s) he/she wants to study. Each column in the count table should have a correspondent row in the metadata table with a matching label.

The metadata table may contain any relevant information to understand the data, such as phenotypic features, clinical outcomes or experimental information (such as collection day, batch, institution). Later, the user will inform which of the information will be used in the statistical design, therefore there is no issue if the table contains variables beyond the ones that are intended to be used in the differential expression analysis (DEA).

### Modelling the Data

The most common type of experimental design involves *m* experimental groups/conditions with *n* independent biological replicates in each group/condition. The researcher will then compare the tag expression among m-groups/conditions. This experiment design is fairly simple to model, but they can also benefit from adjusting for extra confounding information such as experimental batch, patient age, gender, etc. Moreover, experiments frequently have more than one factor to be studied, for example factors such as cell type, treatment, and changes through time. Modeling multiple factor experiments can be more challenging and there may not be enough replicates to fit all parameters the user wants to account for. Here, we provide an interface to easily model the dataset features and test the model compliance before submitting the analysis.

The user is required to pick **features** available on the metadata table and 1) inform which group's/condition's level is the **baseline** (the ‘reference’ state), 2) identify if it is a **categorical/nominal** *or* **numeric/continuous** variable and 3) if he/she wants to run a differential expression analysis for that particular feature or simply include it in the model to **adjust the sample-imbalance bias** attributable to that feature (it is a **confounding variable** All those variables will be represented in a matrix that is commonly referred as the *design matrix* *DiffExpress* will run a **validation** procedure to verify if the model can be fitted with the dataset available. Based on the results, it provides guidance to the user on how to fix the identified issues. This feature avoids waiting for the *edgeR* analysis to run (and fail) to only then discover that the dataset was not adequate for the statistical model fitting.

*DiffExpress* offers the possibility to design statistical models that are based on the sum of features (to which we refer as "**simple**" and also it’s possible to include **interactions** between variables. Interactions combine the conditions present in two or more “new variables”.

Example:

Samples | Tissue | Treatment |
---|---|---|

A | Control | Control |

B | Control | Control |

C | Control | Treated |

D | Control | Treated |

E | Tumor | Control |

F | Tumor | Control |

G | Tumor | Treated |

H | Tumor | Treated |

Modeling the interaction between features *Type* and *Treatment* (*~ Type:Treatment* would generate a combination between the levels of each variable:

TissueControlTreatmentControl : samples A,B

TissueControlTreatmentTreated : samples C,D

TissueTumorTreatmentControl : samples E,F

TissueTumorTreatmentControl : samples G,H

It is possible to include variables in the **simple** modeling and in **interaction** models simultaneously. If you wish to understand how *edgeR* interprets the models and how the analysis is done, we suggest referring to the *edgeR* User Guide at Bioconductor.

### Analysis Parameters

The core steps in *DiffExpress* analysis are:

- Removal of
*tags*with NA values - Filtering out
*tags*with low count (optional) - Normalizing with weighted trimmed mean of M-values ("TMM") method
- Dispersion estimative (robust or non-robust)
- Dispersion plot (Biological coefficient of variation – BCV )
- lNB-GLM fitting
- Likelihood ratio test for the features of interest
- Heatmaps

Here we provide a description of the methods implemented and whereas the user can customize them or not. For further details, we recommend you to consult the *edgeR* User Guide and the cited sources.

#### Low-count filtering

The dataset usually has thousands of tags and not all of them have enough reads to contribute to the differential expression analysis. In addition, these low counts may interfere with some of the statistical methods used in the pipeline. Therefore, it is strongly recommended to filter them out prior to further analysis (6). Nonetheless, we allow the user to make the decision if he/she wants to filter low counts out and what is the minimum CPM (count per million) that a *tag* must have in order to be kept in the analysis. The filtered is done based on CPM as recommended by the *edgeR* authors.

#### Normalization

The dataset is normalized for RNA composition by finding a set of scaling factors for the samples/libraries sizes that minimize the log-fold changes between the samples for most tags (6). These scaling factors are calculated using trimmed means of M-values (TMM) between each pair of samples (7), and is the default methodology implemented on *edgeR*. In *DiffExpress*, this is a mandatory step. This procedure is especially important when a small number of genes are very highly expressed in one sample/library, but not in all others, causing the remaining tags to be under-sampled in that first sample/library. Therefore this normalization step adjusts the RNA composition effect, avoiding that the remaining genes falsely appear to be down-regulated in that sample/library (6). It is noteworthy that, because we are interested in calculating relative expressions, the dataset does not need to be adjusted for GC-content and gene length (6).

#### Mutidimensional Scaling - MDS Plots

After the data is filtered and normalized, MDS plots are generated to inform the user about the relationship between samples. The MDS plot presents the leading log-fold-change between each pair of samples (6). Similar to a plot based on Principal component analysis, the MDS plot helps to identify structure in the data, heterogeneity and provides an overview if that is a clear separation between the groups/conditions of interest. In *DiffExpress*, an MDS plot is generated for all variables included in the statistical design.

#### Dispersion

The *tag*-specific dispersion estimation is necessary so that *tags* that behave consistently across replicates (the ones with lower estimated dispersion) are ranked higher than the ones that do not (3, 4, 6). The method calculates the matrix of likelihoods for each *tag* at a set of dispersion grid points and then applies weighted likelihood empirical Bayes method to obtain posterior dispersion estimates (3). It calculates the adjusted profile log-likelihood for each *tag* based on the design matrix and then maximizes it (6). *DiffExpress* uses *edgeR*’s Cox-Reid profile-adjusted likelihood (CR) method for all *tags* It fits generalized linear models (GLM) with a design matrix, allowing for all systematic sources of variation to be accounted for in the estimative (6, 8, 9). For this reason, we recommend that the user informs the sources of bias when defining the statistical model in the input interface. By including all known sources of undesired bias (e.g. batch) in the statistical model, the data analysis will take it into consideration all those factors which will provide more precise estimations (6). Besides that, the user may decide if the analysis should be robustified against potential outlier genes or not. When using robustified methods please cite (10).

#### Dispersion Plot (Biological Coeficient of Variation - BCV)

“Biological CV (BCV) is the coefficient of variation with which the (unknown) true abundance of the gene varies between replicate RNA samples. It represents the CV that would remain between biological replicates if sequencing depth could be increased indefinitely. The technical CV decreases as the size of the counts increases. BCV on the other hand does not. BCV is therefore likely to be the dominant source of uncertainty for high-count genes, so reliable estimation of BCV is crucial for realistic assessment of differential expression in RNA-Seq experiments.” (6).

#### GLM Fitting

Once the steps above are completed, the dataset is ready to be fitted to a log-Negative Binomial Generalized log-Linear Model (lNB-GLM), as described by McCarthy et al. (2012) (9). It conducts a gene-wise statistical test for a given coefficient or coefficient contrast of the variable(s) of interest.

#### Likelihood Ratio Test for the Selected Variables

This method is applied to test the ratio of deviances between nested models with and without the estimation of coefficients or coefficient contrast of the variable(s) of interest in the lNB-GLM model, respectively. It is at this *stage* of the analysis that genes **significantly** differentially expressed between groups/conditions are actually identified and the gene-wise p-values are corrected for multiple comparisons using the Benjamini-Hochberg false discovery method (11).

#### Results Table

Simplicity will generate multiple tables with the results. The number of tables depend on:

- the variables included in the statistical design,
- the number of levels which each of the categorical/nominal variable has, and
- if the user set the program to do compare expression between every level of the categorical/nominal variables or only contrasts the levels against the baseline.

Analysis of differential expression based on Continuous/Numeric variables will produce a single table which presents the logFC of the variation per unit. Features marked as confounding factors (checked the box for “I want to remove this effect”) will not have tags expression analysed.

In the table, the ** tag ID** is in the 1st column, followed by the

**logFC**(the log2 of the ratio of the estimated read counts average between either two levels of a categorical/nominal variable or the estimated increment of one unit of a numeric/continuous variable being analyzed). The 3rd column contains the mean

**logCPM**(log2 of the estimated read count mean per million) for each

*tag*The other columns present the test statistic (likelihood ratio -

**LR**the estimated

**p-value**and, the

**FDR adjusted p-value**(p-value adjusted for multiple comparison using the false discovery ratio). The researcher should interpret the differential expression significance based on the FDR adjusted p-values.

#### Plot MA

MA plots are an interesting way to visualize differential expression results. It consists of plotting the estimated mean log2 ratio between any two levels of a categorical/nominal variable in the ordinate (vertical axe) against the estimated log2 average CPM in the abscissa (horizontal axe). Therefore, for every differential expression analysis there is an MA plot. *Tags* with absolute logFC above the established threshold (horizontal dotted grey lines) and with FDR adjusted p-values below the significance level are indicated as red color dots.

#### Heatmaps

Heatmaps are a popular way to visually search for pattern in the tags expression that could explain the variables of interest. In *DiffExpress*, we plot the heatmaps with the dataset scaled/normalized logCPM using the resources of the pheatmap package (12). Due to the usual large list of genomic features we have established some criteria to define which tags will be presented in the heatmap:

- All differentially expressed tags in all comparisons (according to the logFC and FDR adjusted p-value threshold established by the user). It is necessary to have at least two differentially expressed
*tags* - The
*tags*with the higher average logCPM; and, - The differentially expressed
*tags*for each comparison (according to the logFC and FDR adjusted p-value threshold established by the user). It is necessary to have at least two differentially expressed*tags*

For every *tag* list, 3 versions of heatmaps are generated:

- Simple – the samples/conditions are ordered according to the features presented by the first variable in the experimental design and the tags are not sorted;
- Tag Clustered - the samples/conditions are ordered according to the features presented by the first variable in the experimental design and a complete-linkage agglomerative hierarchical clustering using the Euclidean distance as dissimilarity measure is used to cluster the tags; and,
- Bi-Clustered – both samples/conditions and
*tags*are clustered.

All heatmaps will display the experimental model features as top rows with a legend and a color key on the right side of the plot.

### Bibliography

- The R Foundation for Statistical Computing (2017) R.
- Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.
*Bioinformatics*26(1):139–140. - Robinson MD, Smyth GK (2007) Moderated statistical tests for assessing differences in tag abundance.
*Bioinformatics*23(21):2881–2887. - Robinson MD, Smyth GK (2008) Small-sample estimation of negative binomial dispersion, with applications to SAGE data.
*Biostatistics*9(2):321–332. - Robinson M, McCarthy D, Chen Y edgeR: differential expression analysis of digital gene expression data.
*BioConductor*

Available at: http://www.bioconductor.org/packages/2.8/bioc/html/edgeR.html [Accessed December 5, 2017]. - Chen Y, McCarthy D, Ritchie M, Robinson M, K.Smyth G (2010)
*edgeR:differential expression analysis of digital gene expression data User’s Guide.* - Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data.
*Genome Biol*11(3):R25. - Chen Y, Lun ATL, Smyth GK (2014) in
*Statistical Analysis of Next Generation Sequence Data*eds Datta S, Nettleton DS (Springer). - McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.
*Nucleic Acids Res*40(10):4288–4297. - Zhou X, Lindsay H, Robinson MD (2014) Robustly detecting differential expression in RNA sequencing data using observation weights.
*Nucleic Acids Res*42(11):e91. - Hochberg Y, Benjamini Y (1990) More powerful procedures for multiple significance testing.
*Stat Med*9(7):811–818. - Kolde R (2015)
*pheatmap: Pretty Heatmaps*(R).

### Citation

The DiffExpress^{[1]} pipeline depends on previous work of other researchers, please consider the list below when presenting the results generated here.

**Example:**

The differential expression analysis was done in DiffExpress pipeline ^{[1]}(https://simplicity.nsilico.com/DEA), a pipeline that supports the analysis using edgeR^{[2][3]}.
In the analysis, the dataset was normalized for RNA composition using trimmed mean of M-values (TMM)^{[4]} and the gene-specific dispersion was estimated using edgeR’s Cox-Reid profile-adjusted likelihood (CR) method for all tags^{[5][6]}.
Later, the dataset was fitted to a log-Negative Binomial Generalized Linear Model (lNB-GLM)^{[7]}, followed by the assessment for differentially expressed genes between the groups/conditions.
The analysis was robustified against potential outlier genes^{[8]} and the gene-wise p-values were corrected for multiple comparisons using the Benjamini-Hochberg false discovery method^{[9]}.
The heatmaps were also provided by DiffExpress using the pheatmap^{[10]}R^{[11]} library.

#### When using DiffExpress you are also using:

- Palu CC, Ribeiro-Alves M, Wu YX, Lawlor B, Kelly B, Walsh P (2018)
*DiffExpress: bespoken cloud-based interface for bulk RNA-seq differential expression modelling and analysis.***Unpublished manuscript** - Robinson MD, McCarthy DJ, Smyth GK (2010)
*edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.***Bioinformatics**26(1):139–140. - Robinson MD, Smyth GK (2007)
*Moderated statistical tests for assessing differences in tag abundance.***Bioinformatics**23(21):2881–2887. - Robinson MD, Oshlack A (2010)
*A scaling normalization method for differential expression analysis of RNA-seq data.***Genome Biol**11(3):R25. - Robinson MD, Smyth GK (2008)
*Small-sample estimation of negative binomial dispersion, with applications to SAGE data.***Biostatistics**9(2):321–332. - Chen Y, Lun ATL, Smyth GK (2014)
*Differential expression analysis of complex RNAseq experiments using edgeR*in**Statistical Analysis of Next Generation Sequence Data**, eds Datta S, Nettleton DS (Springer). - McCarthy DJ, Chen Y, Smyth GK (2012)
*Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.***Nucleic Acids Res**40(10):4288–4297. - Hochberg Y, Benjamini Y (1990)
*More powerful procedures for multiple significance testing.***Stat Med**9(7):811–818. - Zhou X, Lindsay H, Robinson MD (2014)
*Robustly detecting differential expression in RNA sequencing data using observation weights.***Nucleic Acids Res**42(11):e91. - The R Foundation for Statistical Computing (2017)
*R*. - Kolde R (2015)
*pheatmap: Pretty Heatmaps*(R).

**If you used robustified analysis**

**If you use the heatmaps**