Bioinformatics Tutorial
Files Needed
  • Getting Started
    • Setup
    • Run jobs in a Docker
    • Run jobs in a cluster [Advanced]
  • Part I. Programming Skills
    • 1.Linux
      • 1.1.Basic Command
      • 1.2.Practice Guide
      • 1.3.Linux Bash
    • 2.R
      • 2.1.R Basics
      • 2.2.Plot with R
    • 3.Python
  • PART II. BASIC ANALYSES
    • 1.Blast
    • 2.Conservation Analysis
    • 3.Function Analysis
      • 3.1.GO
      • 3.2.KEGG
      • 3.3.GSEA
    • 4.Clinical Analyses
      • 4.1.Survival Analysis
  • Part III. NGS DATA ANALYSES
    • 1.Mapping
      • 1.1 Genome Browser
      • 1.2 bedtools and samtools
    • 2.RNA-seq
      • 2.1.Expression Matrix
      • 2.2.Differential Expression with Cufflinks
      • 2.3.Differential Expression with DEseq2 and edgeR
    • 3.ChIP-seq
    • 4.Motif
      • 4.1.Sequence Motif
      • 4.2.Structure Motif
    • 5.RNA Network
      • 5.1.Co-expression Network
      • 5.2.miRNA Targets
      • 5.3. CLIP-seq (RNA-Protein Interaction)
    • 6.RNA Regulation - I
      • 6.1.Alternative Splicing
      • 6.2.APA (Alternative Polyadenylation)
      • 6.3.Chimeric RNA
      • 6.4.RNA Editing
      • 6.5.SNV/INDEL
    • 7.RNA Regulation - II
      • 7.1.Translation: Ribo-seq
      • 7.2.RNA Structure
    • 8.cfDNA
      • 8.1.Basic cfDNA-seq Analyses
  • Part IV. MACHINE LEARNING
    • 1.Machine Learning Basics
      • 1.1 Data Pre-processing
      • 1.2 Data Visualization & Dimension Reduction
      • 1.3 Feature Extraction and Selection
      • 1.4 Machine Learning Classifiers/Models
      • 1.5 Performance Evaluation
    • 2.Machine Learning with R
    • 3.Machine Learning with Python
  • Part V. Assignments
    • 1.Precision Medicine - exSEEK
      • Help
      • Archive: Version 2018
        • 1.1.Data Introduction
        • 1.2.Requirement
        • 1.3.Helps
    • 2.RNA Regulation - RiboShape
      • 2.0.Programming Tools
      • 2.1.RNA-seq Analysis
      • 2.2.Ribo-seq Analysis
      • 2.3.SHAPE Data Analysis
      • 2.4.Integration
    • 3.RNA Regulation - dsRNA
    • 4.Single Cell Data Analysis
      • Help
  • 5.Model Programming
  • Appendix
    • Appendix I. Keep Learning
    • Appendix II. Databases & Servers
    • Appendix III. How to Backup
    • Appendix IV. Teaching Materials
    • Appendix V. Software and Tools
    • Appendix VI. Genome Annotations
Powered by GitBook
On this page
  • 1) Pipeline
  • 2) Data Structure
  • 2a) Inputs
  • 2b) Outputs
  • 3) Running Steps
  • 3a) Input gene name
  • 3b) Output the results
  • 3c) Display the results
  • 4) Tips/Utilities
  • 5) Homework

Was this helpful?

Edit on GitHub
  1. PART II. BASIC ANALYSES
  2. 3.Function Analysis

3.1.GO

Previous3.Function AnalysisNext3.2.KEGG

Last updated 12 days ago

Was this helpful?

1) Pipeline

2) Data Structure

ENSG00000001036
ENSG00000003756
ENSG00000008018
ENSG00000012048
ENSG00000043355
ENSG00000074755
ENSG00000079616
ENSG00000089280
ENSG00000100591
ENSG00000100941
ENSG00000101109
ENSG00000101974
ENSG00000104611
ENSG00000104738
ENSG00000105738
ENSG00000113318
ENSG00000114867
ENSG00000116221
ENSG00000116857
ENSG00000117724
ENSG00000119285
ENSG00000121774
ENSG00000127663
ENSG00000127884
ENSG00000128159
ENSG00000129187
ENSG00000130640
ENSG00000131473
ENSG00000134287
ENSG00000134644
ENSG00000136628
ENSG00000137273
ENSG00000146263
ENSG00000153187
ENSG00000160285
ENSG00000164818
ENSG00000164944
ENSG00000167325
ENSG00000167548
ENSG00000170448
ENSG00000179632
ENSG00000183207
ENSG00000187954
ENSG00000196700
ENSG00000196924
ENSG00000198604
ENSG00000198886
ENSG00000198899
ENSG00000206503
ENSG00000223609
ENSG00000272822

2a) Inputs

File format

Information contained in file

File description

Notes

txt

Gene encode id

The file contain the gene encode id

-

2b) Outputs

File format

Information contained in file

File description

Notes

txt

Output information

The gene ontology of each gene

-

3) Running Steps

3a) Input gene name

3b) Output the results

Reference list

User upload

Mapped IDs:

21042 out of 21042

50 out of 50

Unmapped IDs:

0

1

Multiple mapping information:

0

0

3c) Display the results

We only display results with False Discovery Rate (FDR) < 0.05.

GO term
Reference
Input number
expected
Fold Enrichment
+/-
raw P value
FDR

DNA replication

208

6

0.49

12.14

+

1.11E-05

1.25E-02

  • 通过和数据库比对,我们可以知道在数据库参考基因组中的21042基因中,被注释到DNA replication 的有208个,在用户上传的50个可以识别的基因中有6个基因被注释为DNA replication。

  • expected 0.4942= 208*50/21042

  • Fold Enrichment 12.14=6/0.4942

  • +/- 富集用“+”表示

  • raw P value 可以用下面公式计算

  • N: numbers of one organism's genes annotated with GO or of the user-provided background . 这里N等于21042

  • n: numbers of genes mapped to the background in the query list

    . 这里n等于50

  • K: numbers of genes in one GO term

    . 这里K等于208

  • k: the counts of genes mapped to the GO term in the query list

    . 这里k等于6

4) Tips/Utilities

其他一些网页工具和R package也常被用来做富集分析,有兴趣的同学可自行了解:

  • 网页工具:

  • R package

5) Homework

  1. 请问上面的例子中, Fold Enrichment和P value是如何计算的? 请写出公式,并解释原理。此外,在定义显著富集的 GO terms 时为什么一般不是参考P value的大小,而是要计算一个 FDR来做为参考?

参考 下载 GO_gene.txt 文件, 内容为一组人类基因的ensembl ID:

open

(教程中的[3.2.kegg](https://book.ncrnalab.org/teaching/part-ii.-basic-analyses/3.function-analysis/3.2.kegg)是基于David实现的,供大家参考)

从(这是我们在一节获得的野生型的结果)中选取显著上调的(FDR<0.05, logFC>1)的基因进行GO分析。

http://geneontology.org/
metascape
gprofiler
David
clusterProfiler
wt.light.vs.dark.all.txt
差异表达
文件获取方式