NGS-Trex Help


Introduction

Next-generation sequencing technology has increased the ability to sequence DNA in a massively parallel manner. Nevertheless, NGS data analysis requires bioinformatics skills still beyond the possibilities of many laboratories focused on “wet biology”. Moreover many projects require few sequencing cycles and the possibility to carefully inspect obtained data to identify genes, transcripts and splice variants found in the biological sample. These projects can take benefits from the availability of easy to use systems to automatically analyze sequences and to mine data without the preventive need of strong bioinformatics knowledge.

To fill this gap we developed NGS-Trex (NGS TRanscriptome profile Explorer) an automatic system targeted to the analysis of Next Generation Sequencing data obtained from large-scale transcriptome studies. This system is available through a simple web interface and allows the user to upload raw sequences and easily obtain an accurate characterization of the transcriptome profile after the setting of few parameters required to tune the analysis procedure. The system is also able to assess differential expression at both gene and transcript level (i.e. splicing isoforms) by comparing the expression profile of different samples (e.g. normal and disease status).

The overall procedure involves three steps.

  • Data Submission: Creation of a project and upload of fasta/fastq sequences – called datasets – related to the project.
  • Analysis: Setup of analysis parameters. Althougth NGS-Trex can be run with default parameters it is possible to tune the analysis through the filling of simple forms. With this step it is possible to define pre-processing of the sequences, mapping criteria onto a reference genome, annotation rules to compare reads to genome annotation.
  • Data Mining: By accessing to the query forms the user can obtain list of genes, transcripts, splice sites ranked and filtered according to several criteria. Data can be viewed as tables, text files or through a simple genome browser which helps the visual inspection of data.

Data Submission

To create a project and upload your data you need to be logged in. If you do not own yet an account please contact us.

Once you are logged in you can view all your private projects and create a new one through the CREATE NEW PROJECT link of the main navigation bar.

Fill the form by inserting a name (empty spaces are not allowed), optionally a short description (max 60 characters) and the reference genome your project belongs to. On form submission the main page of the newly created project will open. It includes: the project summary with the project description and a summary of already analyzed dataset, the list of datasets belonging to the project (which is empty on newly created projects), the ADD DATASET form.

Go to the ADD DATASET panel to upload your raw sequences. Click the Click here to upload file link: a window will open showing a list of the files previously uploaded via an FTP client (such as FireFTP, FileZilla ...) by using your NGS-Trex username and password as credentials and 'www.ngs-trex.org' as Host. Select one or more fasta/fastq files you want to load (if you select more than one file a unique sequence file will be created) and go back to the form. Assign a label (empty spaces are not allowed) and optionally a short description to the dataset and click the Add Dataset button.

For multiplexed samples, a demultiplexing process is provided. Set Y in Multiplexed sequencing? field and provide a comma separated label,barcode list. The system will subdivide the reads in their respective sample datasets.

The newly created dataset(s) will be shown in the DATASETS panel.

Analysis Procedure

To submit a dataset to the analysis procedure click the Set Analysis Params link provided for each sample in the DATASETS panel of the project main page. Through the simple form that will open set the analysis parameters.

On form submission click the submit job to analysis queue. The progresses of each dataset analysis will be shown in the corresponding status column of the DATASETS panel.

Analysis of NGS-Seq data is divided into 3 steps: (1) Pre-processing of sequences; (2) Mapping; (3) Annotation. In the pre-processing step the reads are filtered by quality (only for fastq files) and specified 5' and 3' linkers/cloning sequences are trimmed off. Filtered reads are then aligned to the user-defined reference genome by using two different mapping algorithms:

- GMAP version 2012-07-20, a genomic mapping and alignment program for mRNA and EST sequences designed by Thomas D. Wu and Colin K. Watanabe (Bioinformatics 2005), for long reads such as 454 (Roche) sequences;

- TopHat v2.0.9, a fast read mapper for RNA-Seq reads designed by the University of Maryland Center for Bioinformatics and Computational Biology and the University of California, Berkeley Departments of Mathematics and Molecular and Cell Biology, for short reads such as Illumina or SOLID sequences.

Mapped sequences satisfying user-defined thresholds on similarity, coverage and number of mapping positions over the genome are compared to genome annotation. Annotation is performed at both gene and transcript level.

Reads mapping onto the gene region and satisfying the minimum overlapping criteria are assigned to the gene g and labeled as Genic (G).

Reads mapping onto a user-defined region surrounding the gene and not satisfying the minimum overlapping criteria are assigned to the gene g and labeled as Proximal (P).

Reads mapping out of the gene and its surrounding region are not assigned and labeled as O.

Genic reads mapping onto a RefSeq transcript with no insertions or deletions and satisfying user-defined criteria on trimming at 5' and 3' end of the alignment (A) and on extension of the alignment (B), are assigned to the transcript T and labeled as T.

(A)

(B)

If the assignment of reads to genes is ambiguous the system tries (if requested by the user) to solve ambiguities occurring in the following scenarios: (1) the read R aligns to overlapping genes G1 and G2; (2) the read R aligns to more not overlapping genes.

In case (1) R is assigned to gene G1 if: R is labeled as T for G1, R shares a splice site with the known transcript T1 of G1 or R is spliced and G1 and G2 are on opposite strands but the orientation of G1 supports a canonic intron. In case (2) the read is assigned to those genes for which it is labeled as T

.

In all the other cases ambiguities are not solved and reads are assigned to all competing genes or discarded based on user settings.

Besides the three steps described so far, the analysis procedure includes a post-processing step for the identification of significant features such as highly represented genes, putative new splice sites, differences in expression levels.

To assess the differential espression of genes/splice sites between two datasets within the same project, a cumulative hypergeometric distribution is computed. Specifically the probability of a gene/splice site X to be differentially expressed within a reference dataset A compared to a dataset B is given by:

Where N is the total number of reads supporting genes/splice sites within the two datasets, n the total number of reads supporting the gene/splice site X within the two dataset, M the number of reads supporting genes/splice sites within the dataset B and m the number of reads supporting the gene/splice site X within the dataset B. Over-representation is defined to be significant when p-value is less then 0.01 or then 0.05.

The Benjamini-Hochberg correction for multiple testing is applied to the resulting p-values.

Data Mining

Upon the completion of the analysis process, the status column of the DATASETS table is set to "Done" by enabling us to explore the obtained results.

You can view a statistical summary of analyzed data by clicking the statistics symbol corresponding to the dataset you are interested in or you can mine data related to one or more selected datasets by clicking the QUERY RESULTS link in the project navigation bar at the top left of the page. Several mining tools are provided by the system:

  • Query gene: explore a specific gene by HUGO name or Entrez Gene Identifier and the organism it belongs to. By using this tool you get a table listing for each selected dataset the Entrez Gene Identifier (eg_id) and the corresponding HUGO name of the specified gene, sequence coverage, sequencing depth, focus index and RPKM values and the number of both annotated and putative new splice sites provided by the analysis. Furthermore you get a detailed list of the putative new splice sites and a profile of the differential expression of the selected gene within the datasets under exam.
  • Advanced Search: filter your data by sequence coverage, sequencing depth and focus index. It is possible to select only reads mapping on annotated exons or all reads mapping onto the entire gene region. This tool provides a list of all genes satisfying the filtering criteria in the selected samples. For each gene the table shows: the Entrez Gene Identifier (eg_id), the HUGO name, sequence coverage, sequencing depth, focus index and the RPKM value.
  • Differentially expressed genes: query results for differentially espressed genes between selected datasets. By using this tool you get the list of DE genes within the reference sample compared to the other ones. For each DE gene the table shows: the Entrez Gene Identifier (eg_id), the HUGO name, the reference sample name, the other sample name, the number of reads supporting the gene in the reference (Ref count), the number of reads supporting the gene in the other samples (Other count), the pValue evaluating the statistical significance of the differential expression and the fold change.
  • Differentially expressed introns: query results for differentially espressed splice sites between selected datasets. You get the list of differentially expressed splice sites between the reference sample and the other ones. For each DE splice site the chromosomal location (chromosome name, chromosome start, chromosome end), the Entrez Gene Identifier (eg_id) and the HUGO name of the gene it belongs to, the reference and the other sample names, the number of reads supporting the gene in both reference and other samples (Ref count and other count), the number of reads supporting the DE splice site in both reference and other samples, the associated pValue and the fold change are shown.
  • New Introns: query results for new putative splice sites. By using this tool you get a list showing for each unannotated splice site: the chromosomal location (chromosome name, chromosome start, chromosome end), the strand, the Entrez Gene Identifier (eg_id) and the HUGO name of the gene it belongs to, the sequence around the 5' splice site (Donor) and the 3' splice site (Acceptor) and the donor/acceptor dinucleotides (D/A).
  • Transcripts: retrieve RefSeq transcripts represented in the selected samples. By using this tool you get a table reporting for each transcript: the Entrez Gene Identifier (eg_id) and the HUGO name of the gene, the RefSeq accession number identifying the transcript, the rna length (rna_length), the rna strand, the coding sequence location on the transcript (cds start and cds end), the number of reads supporting the transcript and the mapping position of reads related to the transcript.

Once you filtered your results through the query forms and got the corresponding table, you can download it as a text file by clicking the download symbol at the top right of the RESULTS page. You can also visualize your data through a simple genome browser accessible via the Entrez Gene Identifier link corresponding to the gene you are interested in.


To get an overview of your results click the statistics symbol corresponding to the dataset you are interested in. You get a summary of results obtained at each step of the analysis workflow. Specifically you can view a list of the parameters you set for the analysis, the graphical report of pre-processing step (only for fastQ files), the statistical summaries of trimming, mapping and annotation procedures and a summary table showing the specific features identified by the downstream analysis.