DiffExpress is a platform to support researchers to study quantitative changes in gene expression levels between experimental groups. The user must provide count-tables from RNA-seq assays and information on the experimental design. The software can filter the read-count data and will normalize it, estimate the data dispersion, fit a statistical model and run statistical analysis to test if, for a given gene, the difference in the read counts average within experimental groups is significant. The analysis also generates supporting images, such as MDS plots, MA plots and heatmaps. The user can explore the results in an interactive interface, share the access to it with collaborators or download it.
DiffExpress objectives are to allow researchers with or without prior bioinformatics knowledge to model the differential expression analysis. DiffExpress achieves this through a user-friendly, intuitive, flexible and interactive cloud-based platform. The platform also provides clarity, real time answers and data validation. An organism’s transcriptome is the sum of all the RNA transcripts it possesses. The transcriptome of the cell depending on the experimental condition can be all RNAs present (such as tRNA and sRNA) or just mRNA. The transcriptome displays all the genes being expressed at a particular time and unlike the genome is also affected by external environmental conditions.
It can be defined as the complete set of transcripts in a cell; their quantity for a specific developmental stage or a physiological condition and set of all RNA molecules in one cell or in a population of cells. It also allows for the examination of whole transcriptome changes across a variety of biological conditions. Over time RNA sequencing has become the most effective and precise transcriptomic technology and analytical tool.
Differential Expression Analysis
DEA (Differential Expression Analysis) is the acronym used here to define the use of normalized read count data and performance of statistical analysis to discover quantitative changes in expression levels between experimental groups. Different methods include edgeR, which can perform multiple comparisons and DESeq, which is based on negative binomial distributions. The DiffExpress uses edgeR.
Accessing the DiffExpress pipeline
1. Users can access the Simplicity™ homepage using the URL https://simplicity.nsilico.com. Here, you can access the login portal, use the tutorial, find out more information about NSilico and the Simplicity™ platform as well as contact NSilico staff
2. At the login portal, you can access your account by inputting your email and password. You can also sign up for a new account or recover lost passwords. Upon logging in, you are brought to the user home page. You can start a new pipeline or check your current pipelines by clicking your cursor on one of the two blue texts in the highlighted yellow box. Live features and new features in development are displayed below the yellow highlighted box
3. Upon logging in, you are brought to the user home page. You can start a new pipeline or check your current pipelines by clicking your cursor on one of the two blue texts in the highlighted yellow box. Live features and new features in development are displayed below the yellow highlighted box.
4. 'Start a new pipeline'will bring you to a page with three options. To carry out a DEA, you must click on the interactive image in the yellow highlighted box (Transcriptomics).
5. Next, you can select the type of transcriptomic pipeline you wish to use. Currently only Differential Expression Analysis is available (highlighted in the yellow box).
1. The initial DiffExpress interface has only three fields enabled (labeled as A, B and C)Here, highlighted in yellow, (A)you can input the project title for your DEA,(B)upload your RNA-seq count table and (C)upload metadata. Metadata are information regarding the samples, experimental designs and sources of bias. (D)If you wish to start your analysis from scratch, it is also possible to reset the form at any moment. The other functions in the interface will only be available once both tables are successfully uploaded. The project title is important to identify the pipeline in the user’s 'My Pipelines' area and in the results interface. It is recommended that you use a name that will aid in distinguishing each individual pipeline later. (E)The step wizard provides tips and explanation for what is required to successfully use the DiffExpress platform, the information provided by the step wizard changes as the user navigates the DiffExpress interface.
2. Uploading the count table
2.1 When uploading the RNA-seq count table, you can move the count table file into the 'Drag & Drop'box (highlighted in yellow). In addition, if you would like to browse through your files, you can click on the'Drag & Drop'box (highlighted in yellow) to open file explorer. It must be noted that the read-count be from NGS data.
2.2 In the file explorer (this tutorial uses Microsoft Windows as an OS), you should select and open the count table (tab-delimited .txt or .csv file). It is imperative that the count table contains only numbers, with the only exception being the tag ID in the first column and the sample names in the first row.
2.3 Once the count table file has been selected, you can adjust the variables in the highlighted box (header/transpose, separator and quote) as needed before submitting the RNA-seq count table.
2.4 Your count table should have the sample IDs in the first row. If it doesn’t have sample IDs, uncheck the'Header'box and the interface will create the IDs
2.5 If your original count table has the samples by rows, you don’t need to reformat the file. You may check the box 'Transpose'.
2.6 You can also adjust the separator (comma, semicolon or tab) to organize the count table to allow for distinguishable columns and rows. You can also notice that, if the interface successfully distinguishes the columns, there will be a sorting button for each of the columns.
2.7 You can adjust the'quote' variable from none to single or double.'Double' (“) results in defined column headings and a noticeable lack of quotation marks. At this point, you can submit the table or close the upload menu.
2.8 The interface informs if the upload was successful.
3. Uploading metadata
3.1 When uploading the Metadata, you can move the Metadata file into the 'Drag & Drop' box (highlighted in yellow). In addition, if you would like to browse through your files, you can click on the 'Drag & Drop' box (highlighted in yellow) to open file explorer. The metadata table indicates which groups/conditions that can be found in the experiment. It may contain data such as phenotypic features, clinical outcomes or experimental information.
3.2 In the file explorer (this tutorial uses Microsoft Windows as an OS) you can select and open the count table (tab-delimited .txt or .csv file).
3.3 As in the count table upload, the same setting options need to be adjusted for the metadata. Notice that, in the example bellow, the interface was not able to recognize the columns.
3.4 You can adjust the separator (comma, semicolon or tab) to organize the count table to allow for distinguishable columns and rows.
3.5 You can adjust the “quote” variable from none to single or double. Double (“) results in defined column headings and a noticeable lack of quotation marks. At this point you can submit your table or close the upload menu.
3.6 If the IDs in the Metadata are not matching with the IDs in the count-table, the highlighted message below is outputted.
3.7 When matching data is inputted or previous errors are successfully rectified, the highlighted message below is outputted.
4. Statistical design
4.1 By interacting with the texts in the highlighted box, you can review the files you have uploaded at any time.
4.2 In this region, you will define the statistical model based on the variables presented in the metadata table. You should select the variables of interest to be included in the analysis. (A) You can add a new variable/field, (B) show all fields from the metadata table and (C) remove all statistical fields present.
4.3 By interacting with the highlighted section, you can choose a value for the variable.
4.4 Several options are available to you in order to modify the variable as needed. (A) Firstly, you can change the variable that corresponds to the columns of the metadata table. (B) Secondly, you can remove an effect from a confounding variable or a source of bias. (C) Thirdly, you can mark a variable as continuous or leave the box unchecked to leave the variable as categorical. (D) If the variable is categorical, you can then choose a baseline value that indicates a reference value. (E) Finally, you can remove the variable altogether.
4.5 You must utilize the simple model to analyze differences between groups. For example, if fields 'Treatment' and 'Gender' are chosen, an analysis will be carried out between groups 'Female' and 'Male' as well as 'Drug A' and 'Control'.
4.6 The highlighted error message occurs when unknown values (NA) are present. NA values are not acceptable therefore DiffExpress offers two options: 1) Remove the variables in question or 2) remove samples with the NA values.
4.7 The interaction tab is used to study the combined effect of 2 or more variables. At least one column should be selected in the simple or interaction tab. You can also combine information from both tabs. We recommend that you include the information about confounding variables in the model. Additionally, you can select the values for the baseline (control) in the interaction tab. There are a number of options available to you here. (A) You can add a new interaction box. (B) You can include the variables to study their combined effect, regardless if these variables are already present in the simple model. Variables can be labeled as continuous or categorical by default and if it has a baseline. (C) Inform the baseline for the categorical variables. Make sure to always use the same value. (D) If there are more than two variables in the interaction, it is possible to remove them individually. (E) More variables can be added to the interaction and (F) you can remove an interaction in its entirety.
4.8 The highlighted error message is outputted when the same variable is defined as continuous in the simple model but has a baseline in the interaction model.
4.9 In the removed samples section (highlighted in yellow), you can see a history of the samples that had to be removed. (A) are the samples removed from the count table as they were not present in the metadata table and (B) are the samples removed as they had unknown values for the variable 'Age'.
4.10 In the history tab, you can see the recorded history of messages outputted by the platform when validating the files and statistical models.
4.11 In the statistics tab, you can adjust analysis features as needed. Variables include whether to: (A) remove genes with low counts; (B) robustify dispersion estimative against potential outliers and (C) compare the gene expression between every category within selected variables. Furthermore, (D) you can adjust the threshold criteria for plot generation; these include logFC and the adjusted p-value. (E) You can also reset the statistical parameters to default value if necessary.
1. Project Overview
1.1 In the project overview section, you can view the pipeline I.D, the pipeline title and the date of pipeline submission. The analysis parameters are also displayed below the highlighted regions.
1.2 You can also view the multi-dimensional scaling (MDS) plots and see the background processes occurring with the Simplicity™ platform
1.3 Highlighted below is an example of a Multi-Dimensional Scaling (MDS) plot. You can interact with the highlighted regions to view other MDS plots with different statistical measures.
2. Output explorer
2.1 In the output explorer, you can view heatmaps of the results obtained, view the total number of comparisons obtained and view each comparison individually. Moreover, you can select a specific filter for the comparisons.
2.2 When clicking in a comparison, you are brought to a page with the analysis results related to that feature. (A) The comparison is identified on the left top corner of the page and (B) you can also view the MA plot and (C) heatmaps. (E) The table presents the DEA results (F) for each tag presented in the count table. (D) You can use the search box to retrieve information of a specific tag.
2.3 Highlighted below is a heatmap, a method to visually search for patterns in tag expression that could provide information about the variables of interest. You can interact with the highlighted regions to view other heatmaps with different statistical measures.
3. Other options
3 You can download your pipeline results by clicking on “download all files” and view citation and references by clicking on “citation and references”.
reference value (control) to which all other categories will be compared to. If continuous, the baseline will automatically be the smallest value.
a variable that can distort the association between the other variables being analyzed due to its strong relationship with one or more of them.
counts per million. The counts are scaled by total number of reads for each sample.
differential expression analysis
false discovery ratio, a method to estimate the rate of type I errors in null hypothesis testing when conducting multiple comparisons.
log2 of the estimated read count mean per million for each tag
the log2 of the ratio of the estimated read counts average between either two levels of a categorical/nominal variable or the estimated increment of one unit of a numeric/continuous variable being analyzed
a set of data that describes and gives information about other data, in the DEA context, metadata described the samples that are presented on the count table.
the genomic feature of interest (genes, exons, etc)