Introduction

DiffExpress is a platform to support researchers to study quantitative changes in gene expression levels between experimental groups. The user must provide count-tables from RNA-seq assays and information on the experimental design. The software can filter the read-count data and will normalize it, estimate the data dispersion, fit a statistical model and run statistical analysis to test if, for a given gene, the difference in the read counts average within experimental groups is significant. The analysis also generates supporting images, such as MDS plots, MA plots and heatmaps. The user can explore the results in an interactive interface, share the access to it with collaborators or download it.

DiffExpress objectives are to allow researchers with or without prior bioinformatics knowledge to model the differential expression analysis. DiffExpress achieves this through a user-friendly, intuitive, flexible and interactive cloud-based platform. The platform also provides clarity, real time answers and data validation. An organism’s transcriptome is the sum of all the RNA transcripts it possesses. The transcriptome of the cell depending on the experimental condition can be all RNAs present (such as tRNA and sRNA) or just mRNA. The transcriptome displays all the genes being expressed at a particular time and unlike the genome is also affected by external environmental conditions.

Transcriptomics

It can be defined as the complete set of transcripts in a cell; their quantity for a specific developmental stage or a physiological condition and set of all RNA molecules in one cell or in a population of cells. It also allows for the examination of whole transcriptome changes across a variety of biological conditions. Over time RNA sequencing has become the most effective and precise transcriptomic technology and analytical tool.

Differential Expression Analysis

DEA (Differential Expression Analysis) is the acronym used here to define the use of normalized read count data and performance of statistical analysis to discover quantitative changes in expression levels between experimental groups. Different methods include edgeR, which can perform multiple comparisons and DESeq, which is based on negative binomial distributions. The DiffExpress uses edgeR.


Accessing the DiffExpress pipeline

1. Users can access the Simplicity™ homepage using the URL https://simplicity.nsilico.com. Here, you can access the login portal, use the tutorial, find out more information about NSilico and the Simplicity™ platform as well as contact NSilico staff

Figure 13 - NSilico Home

2. At the login portal, you can access your account by inputting your email and password. You can also sign up for a new account or recover lost passwords. Upon logging in, you are brought to the user home page. You can start a new pipeline or check your current pipelines by clicking your cursor on one of the two blue texts in the highlighted yellow box. Live features and new features in development are displayed below the yellow highlighted box

Figure 14 - Login portal

3. Upon logging in, you are brought to the user home page. You can start a new pipeline or check your current pipelines by clicking your cursor on one of the two blue texts in the highlighted yellow box. Live features and new features in development are displayed below the yellow highlighted box.

Figure 15 - User home page

4. 'Start a new pipeline'will bring you to a page with three options. To carry out a DEA, you must click on the interactive image in the yellow highlighted box (Transcriptomics).

Figure 16 - Pipeline menu

5. Next, you can select the type of transcriptomic pipeline you wish to use. Currently only Differential Expression Analysis is available (highlighted in the yellow box).

Figure 17 - Types of transcriptomics-based DiffExpress available

Input Interface

1. The initial DiffExpress interface has only three fields enabled (labeled as A, B and C)Here, highlighted in yellow, (A)you can input the project title for your DEA,(B)upload your RNA-seq count table and (C)upload metadata. Metadata are information regarding the samples, experimental designs and sources of bias. (D)If you wish to start your analysis from scratch, it is also possible to reset the form at any moment. The other functions in the interface will only be available once both tables are successfully uploaded. The project title is important to identify the pipeline in the user’s 'My Pipelines' area and in the results interface. It is recommended that you use a name that will aid in distinguishing each individual pipeline later. (E)The step wizard provides tips and explanation for what is required to successfully use the DiffExpress platform, the information provided by the step wizard changes as the user navigates the DiffExpress interface.

Figure 18 - DiffExpress input webpage

2. Uploading the count table

2.1 When uploading the RNA-seq count table, you can move the count table file into the 'Drag & Drop'box (highlighted in yellow). In addition, if you would like to browse through your files, you can click on the'Drag & Drop'box (highlighted in yellow) to open file explorer. It must be noted that the read-count be from NGS data.

Figure 19 - Uploading RNA-seq count tablep>

2.2 In the file explorer (this tutorial uses Microsoft Windows as an OS), you should select and open the count table (tab-delimited .txt or .csv file). It is imperative that the count table contains only numbers, with the only exception being the tag ID in the first column and the sample names in the first row.

Figure 20 - Selecting corresponding file

2.3 Once the count table file has been selected, you can adjust the variables in the highlighted box (header/transpose, separator and quote) as needed before submitting the RNA-seq count table.

Figure 21 - Uploading RNA-seq count table

2.4 Your count table should have the sample IDs in the first row. If it doesn’t have sample IDs, uncheck the'Header'box and the interface will create the IDs

Figure 22 - Count table header

2.5 If your original count table has the samples by rows, you don’t need to reformat the file. You may check the box 'Transpose'.

Figure 23 - Transposing tables

2.6 You can also adjust the separator (comma, semicolon or tab) to organize the count table to allow for distinguishable columns and rows. You can also notice that, if the interface successfully distinguishes the columns, there will be a sorting button for each of the columns.

Figure 24 - Adjusting Separator

2.7 You can adjust the'quote' variable from none to single or double.'Double' (“) results in defined column headings and a noticeable lack of quotation marks. At this point, you can submit the table or close the upload menu.

Figure 25 - Adjusting Quote

2.8 The interface informs if the upload was successful.

Figure 26 - Successful Count Table upload

3. Uploading metadata

3.1 When uploading the Metadata, you can move the Metadata file into the 'Drag & Drop' box (highlighted in yellow). In addition, if you would like to browse through your files, you can click on the 'Drag & Drop' box (highlighted in yellow) to open file explorer. The metadata table indicates which groups/conditions that can be found in the experiment. It may contain data such as phenotypic features, clinical outcomes or experimental information.

Figure 27 - Drag & Drop box to upload Metadata

3.2 In the file explorer (this tutorial uses Microsoft Windows as an OS) you can select and open the count table (tab-delimited .txt or .csv file).

Figure 28 - Selecting corresponding file

3.3 As in the count table upload, the same setting options need to be adjusted for the metadata. Notice that, in the example bellow, the interface was not able to recognize the columns.

Figure 29 - Metadata Upload

3.4 You can adjust the separator (comma, semicolon or tab) to organize the count table to allow for distinguishable columns and rows.

Figure 30 - Adjusting Separator

3.5 You can adjust the “quote” variable from none to single or double. Double (“) results in defined column headings and a noticeable lack of quotation marks. At this point you can submit your table or close the upload menu.

Figure 31 - Adjusting Quote

3.6 If the IDs in the Metadata are not matching with the IDs in the count-table, the highlighted message below is outputted.

Figure 32 - Error message (Mismatch)

3.7 When matching data is inputted or previous errors are successfully rectified, the highlighted message below is outputted.

Figure 33 - Successful data upload

4. Statistical design

4.1 By interacting with the texts in the highlighted box, you can review the files you have uploaded at any time.

Figure 34 - Review input files

4.2 In this region, you will define the statistical model based on the variables presented in the metadata table. You should select the variables of interest to be included in the analysis. (A) You can add a new variable/field, (B) show all fields from the metadata table and (C) remove all statistical fields present.

Figure 35 - Simple tab

4.3 By interacting with the highlighted section, you can choose a value for the variable.

Figure 36 - Select variables for the statistical model

4.4 Several options are available to you in order to modify the variable as needed. (A) Firstly, you can change the variable that corresponds to the columns of the metadata table. (B) Secondly, you can remove an effect from a confounding variable or a source of bias. (C) Thirdly, you can mark a variable as continuous or leave the box unchecked to leave the variable as categorical. (D) If the variable is categorical, you can then choose a baseline value that indicates a reference value. (E) Finally, you can remove the variable altogether.

Figure 37 - Variable options

4.5 You must utilize the simple model to analyze differences between groups. For example, if fields 'Treatment' and 'Gender' are chosen, an analysis will be carried out between groups 'Female' and 'Male' as well as 'Drug A' and 'Control'.

Figure 38 - Simple Tab

4.6 The highlighted error message occurs when unknown values (NA) are present. NA values are not acceptable therefore DiffExpress offers two options: 1) Remove the variables in question or 2) remove samples with the NA values.

Figure 39 - Error message outputted due to NA value(s)

4.7 The interaction tab is used to study the combined effect of 2 or more variables. At least one column should be selected in the simple or interaction tab. You can also combine information from both tabs. We recommend that you include the information about confounding variables in the model. Additionally, you can select the values for the baseline (control) in the interaction tab. There are a number of options available to you here. (A) You can add a new interaction box. (B) You can include the variables to study their combined effect, regardless if these variables are already present in the simple model. Variables can be labeled as continuous or categorical by default and if it has a baseline. (C) Inform the baseline for the categorical variables. Make sure to always use the same value. (D) If there are more than two variables in the interaction, it is possible to remove them individually. (E) More variables can be added to the interaction and (F) you can remove an interaction in its entirety.

Figure 40 - Interaction Tab

4.8 The highlighted error message is outputted when the same variable is defined as continuous in the simple model but has a baseline in the interaction model.

Figure 41 - Error message (Interaction)

4.9 In the removed samples section (highlighted in yellow), you can see a history of the samples that had to be removed. (A) are the samples removed from the count table as they were not present in the metadata table and (B) are the samples removed as they had unknown values for the variable 'Age'.

Figure 42 - Removed Samples

4.10 In the history tab, you can see the recorded history of messages outputted by the platform when validating the files and statistical models.

Figure 43 - History Tab

4.11 In the statistics tab, you can adjust analysis features as needed. Variables include whether to: (A) remove genes with low counts; (B) robustify dispersion estimative against potential outliers and (C) compare the gene expression between every category within selected variables. Furthermore, (D) you can adjust the threshold criteria for plot generation; these include logFC and the adjusted p-value. (E) You can also reset the statistical parameters to default value if necessary.

Figure 44 - Statistics Tab

Results Interface

1. Project Overview

1.1 In the project overview section, you can view the pipeline I.D, the pipeline title and the date of pipeline submission. The analysis parameters are also displayed below the highlighted regions.

Figure 45 - Project overview

1.2 You can also view the multi-dimensional scaling (MDS) plots and see the background processes occurring with the Simplicity™ platform

Figure 46 - Project overview

1.3 Highlighted below is an example of a Multi-Dimensional Scaling (MDS) plot. You can interact with the highlighted regions to view other MDS plots with different statistical measures.

Figure 47 - MDS Plot

2. Output explorer

2.1 In the output explorer, you can view heatmaps of the results obtained, view the total number of comparisons obtained and view each comparison individually. Moreover, you can select a specific filter for the comparisons.

Figure 48 - Output explorer

2.2 When clicking in a comparison, you are brought to a page with the analysis results related to that feature. (A) The comparison is identified on the left top corner of the page and (B) you can also view the MA plot and (C) heatmaps. (E) The table presents the DEA results (F) for each tag presented in the count table. (D) You can use the search box to retrieve information of a specific tag.

Figure 49 - Output explorer

2.3 Highlighted below is a heatmap, a method to visually search for patterns in tag expression that could provide information about the variables of interest. You can interact with the highlighted regions to view other heatmaps with different statistical measures.

Figure 50 - Heatmap

3. Other options

3 You can download your pipeline results by clicking on “download all files” and view citation and references by clicking on “citation and references”.

Figure 51 - Project Overview

Glossary

Baseline

reference value (control) to which all other categories will be compared to. If continuous, the baseline will automatically be the smallest value.

Confounding factor

a variable that can distort the association between the other variables being analyzed due to its strong relationship with one or more of them.

CPM

counts per million. The counts are scaled by total number of reads for each sample.

DEA

differential expression analysis

FDR

false discovery ratio, a method to estimate the rate of type I errors in null hypothesis testing when conducting multiple comparisons.

logCPM

log2 of the estimated read count mean per million for each tag

logFC

the log2 of the ratio of the estimated read counts average between either two levels of a categorical/nominal variable or the estimated increment of one unit of a numeric/continuous variable being analyzed

LR

likelihood ratio

Metadata

a set of data that describes and gives information about other data, in the DEA context, metadata described the samples that are presented on the count table.

Tags

the genomic feature of interest (genes, exons, etc)