7. Wiki NGS & Bioinformatics¶
To be completed
7.1. Glossary¶
We define hereafter a series of abbreviations, terms and concepts which appear recurrently in the litterature about NGS analysis. This document aims at providing a support for the interpretation of analysis reports.
Other resources:
7.1.1. A¶
7.1.2. B¶
7.1.3. C¶
- ChIP-exo:
- ChIP-seq:
- cigar: alignment (`more <>`__).
- Cloud:
- Copy number variation:
- Core:
7.1.4. D¶
- DEG: Differentially Expressed Gene.
7.1.5. E¶
- e-value (E): indicates the number of false positives expected by chance, for a given threshold of p-value. It is a number that can exceed 1, it is thus not a probability, and thus, not a p-value.
E = <FP> = P . m
Where m is the number of tests (e.g. genes), FP the number of false positives, the notation < > denotes the random expectation, and P is the nominal p-value of the considered gene.
Note that the e-value is a positive number ranging from 0 to m (number of tests). It is thus not a p-value, since probabilities are by definition comprized between 0 and 1.
7.1.6. F¶
- Family-wise error rate (FWER): indicates the probability to observe at least one false positive among the multiple tests.
FWER = P(FP >= 1)
- fastq (file format): raw sequences + quality (more).
- False discovery rate (FDR): indicates the expected proportion of false positives among the cases declared positive. For example, if a differential analysis reports 200 differentially expressed genes with an FDR threshold of 0.05, we should expect to have 0.05 x 200=10 false positive among them.
7.1.7. G¶
- genome (file format):
- genomic input:
- gff (file format): genome feature file - annotations
(more).
See also
gtf
. - gtf (file format): variant of GFF, with two fields for annotation (more).
- gft2 (file format): Gene annotation (more).
- GSM: Gene Expression Omnibus Sample identifier.
- GSE: Gene Expression Omnibus Series identifier (a collection of samples related to the same publication or thematics).
7.1.8. H¶
7.1.9. I¶
- input: Pour le peak-calling, le mot “input” est utilisé dans un sens tout à fait particulier, pour désigner un jeu de séquences servant à estimer les densités de reads attendues au hasard en fonction de la position génomique. Les méthodes typiques sont l’input génomique (actuellement le plus généralement utilisé) et le mock.
7.1.10. J¶
7.1.11. K¶
7.1.12. L¶
- Library: Terme utilisé de façon parfois ambiguë selon le contxte. Les biologistes se réfèrent à une librairie d’ADN pour désigner … (à définir). Les bioinformaticiens parlent de librairie de séquences pour désigner l’nsemble des fragments de lectures provenant du séquençage d’un même échantillon. Les informaticiens appellent “”library”” (bibliothèque, librairies ?) des modules de code regroupant une série de fonctions et procédures.
7.1.13. M¶
- m: number of tests in a multiple-testing schema (e.g. number of genes in differential expression analysis).
- Mapped read:
- Mapping: Identifying genomic positions for the raw reads of a sequence library.
- mock: type of control for the peak-calling in ChIP-seq. It is an input obtained by using a non-specific antibody (eg. anti-GFP) for the immunoprecipitation. *afin d’estimer le taux de séquençage aspécifique pour chaque région génomique. L’intérêt du mock est qu’il constitue un contrôle réalisé dans les mêmes conditions que le ChIP-seq spécifique. La faiblesse est que les tailles de librairries sont parfois tellement faibles que l’estimation du backgroun est très peu robuste.
- motif:
- Multiple testing: the multiple testing problem arises from the application of a given statistical test to a large number of cases. For example, in differential expression analysis, each gene/transcript is submitted to a test of equality between two conditions. A single analysis thus typically involves several tens of thousands tests. The general problem of multiple testing is that the risk of false positive indicated by the nominal p-value will be challenged for each element. Various types of corrections for multiple testing have been defined (Bonferroni, e-value, FWER, FDR).
7.1.14. N¶
- Negative control:
- NGS: Next Generation Sequencing.
7.1.15. O¶
7.1.16. P¶
- p-value (P): the nominal p-value is the p-value attached to one particular element in a series of multiple tests. For example, in differential analysis, one nominal p-value is computed for each gene. This p-value indicates the risk to obtain an effect at least as important as our observation under the null hypothesis, i.e. in the absence of regulation.
- padj (abbr.): adjusted p-value. Statistics derived from the nominal p-value in order to correct for the effects of multiple testing (see Bonferroni correction, e-value).
The most usual correction is the FDR, which can be estimated in various ways.
- Paired end:
- Peak:
- Peak-calling:
- pileup (file format): base-pair information at each chromosomal position (more).
7.1.17. Q¶
- q-value:
7.1.18. R¶
- RAM:
- Raw read: non-aligned read.
- Read: short sequence (typically 25-75bp) obtained by high-throughput sequencing.
- Region-calling:
- Replicate: … distinguer réplicat technique et réplicat biologique
- RNA-seq:
7.1.19. S¶
- sam (file format): aligned reads (more).
- Single end:
- Single nucleotide polymorphism:
- SRA: Sequence Read Archive (SRA). Database maintained by the NCBI.
- SRX: Short Read Experiment. See documentation.
- SRR: Short Read Run. See documentation.
7.1.20. T¶
7.1.21. U¶
7.1.24. X¶
7.1.25. Y¶
7.1.26. Z¶
7.2. Notes on multiple testing corrections¶
7.2.1. The problem with multiple tests¶
The multiple testing problem arises from the application of a given statistical test to a large number of cases. For example, in differential expression analysis, each gene/transcript is submitted to a test of equality between two conditions. A single analysis thus typically involves several tens of thousands tests.
The general problem of multiple testing is that the risk of false positive indicated by the nominal p-value will be challenged for each element.
7.2.2. P-value and derived multiple testing corrections¶
7.2.3. P-value (nominal p-value)¶
The nominal p-value is the p-value attached to one particular element in a series of multiple tests. For example, in differential analysis, one nominal p-value is computed for each gene. This p-value indicates the risk to obtain an effect at least as important as our observation under the null hypothesis, i.e. in the absence of regulation.
7.2.4. Bonferroni correction¶
7.2.5. E-value¶
The e-value indicates the number of false positives expected by chance, for a given threshold of p-value.
\(E = <FP> = P \cdot m\)
Where \(m\) is the number of tests (e.g. genes), \(FP\) the number of false positives, the notation \(< >\) denotes the random expectation, and \(P\) is the nominal p-value of the considered gene.
Note that the e-value is a positive number ranging from \(0\) to \(m\) (number of tests). It is thus not a p-value, since probabilities are by definition comprized between 0 and 1.
7.2.6. Family-wise error rate (FWER)¶
The Family-Wise Error Rate (FWER) indicates the probability to observe at least one false positive among the multiple tests.
\(FWER = P(FP >= 1)\)
7.2.7. False Discovery Rate (FDR)¶
The False Discovery Rate (FDR) indicates the expected proportion of false positives among the cases declared positive. For example, if a differential analysis reports 200 differentially expressed genes with an FDR threshold of 0.05, we should expect to have \(0.05 \cdot 200=10\) false positive among them.
7.2.8. What is an adjusted p-value?¶
An adjusted p-value is a statistics derived from the nominal p-value in order to correct for the effects of multiple testing.
Various types of corrections for multiple testing have been defined (Bonferoni, e-value, FWER, FDR). Note that some of these corrections are not actual “adjusted p-values”.
- the original Bonferoni correction consists in adapting the \(\alpha\) threshold rather than correcting the p-value.
- the e-value is a number that can exceed 1, it is thus not a probability, and thus, not a p-value.
The most usual correction is the FDR, which can be estimated in various ways.
7.3. Useful links¶
7.3.1. Versioning, code sharing¶
7.3.3. Miscellaneous¶
- QC Fail Sequencing
- FastQC results interpretation
- A Wikipedia list of sequence alignment software
- Genome sizes for common organisms
- A list of formats maintained by the UCSC
- The IFB cloud and its documentation
- A catalogue of NGS-related tools: Sequencing (OmicTools)
- Elixir’s Tools and Data Services Registry.
- Wikipedia list of biological databases
7.4. Bibliography¶
7.4.1. ChIP-seq guidelines¶
- Bailey et al., 2013. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data.
- ENCODE & modENCODE consortia, 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.
7.4.2. Tutorials¶
7.4.2.1. French¶
- Thomas-Chollier et al. 2012. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.
- TODO add JvH & MTC tutos
- TODO Roscoff bioinformatics school: link
- RNA-seq tutorial