3.1.GO

1) Pipeline

2) Data Structure

参考 文件获取方式下载 GO_gene.txt 文件, 内容为一组人类基因的ensembl ID:

2a) Inputs

File format

Information contained in file

File description

Notes

txt

Gene encode id

The file contain the gene encode id

-

2b) Outputs

File format

Information contained in file

File description

Notes

txt

Output information

The gene ontology of each gene

-

3) Running Steps

open http://geneontology.org/

3a) Input gene name

3b) Output the results

Reference list

User upload

Mapped IDs:

21042 out of 21042

50 out of 50

Unmapped IDs:

0

1

Multiple mapping information:

0

0

3c) Display the results

We only display results with False Discovery Rate (FDR) < 0.05.

GO term
Reference
Input number
expected
Fold Enrichment
+/-
raw P value
FDR

DNA replication

208

6

0.49

12.14

+

1.11E-05

1.25E-02

  • 通过和数据库比对,我们可以知道在数据库参考基因组中的21042基因中,被注释到DNA replication 的有208个,在用户上传的50个可以识别的基因中有6个基因被注释为DNA replication。

  • expected 0.4942= 208*50/21042

  • Fold Enrichment 12.14=6/0.4942

  • +/- 富集用“+”表示

  • raw P value 可以用下面公式计算

goout2

  • N: numbers of one organism's genes annotated with GO or of the user-provided background . 这里N等于21042

  • n: numbers of genes mapped to the background in the query list

    . 这里n等于50

  • K: numbers of genes in one GO term

    . 这里K等于208

  • k: the counts of genes mapped to the GO term in the query list

    . 这里k等于6

4) Tips/Utilities

其他一些网页工具和R package也常被用来做富集分析,有兴趣的同学可自行了解:

  • 网页工具:

    • David (教程中的[3.2.kegg](https://book.ncrnalab.org/teaching/part-ii.-basic-analyses/3.function-analysis/3.2.kegg)是基于David实现的,供大家参考)

5) Homework

  1. wt.light.vs.dark.all.txt(这是我们在差异表达一节获得的野生型的结果)中选取显著上调的(FDR<0.05, logFC>1)的基因进行GO分析。

  2. 请问上面的例子中, Fold Enrichment和P value是如何计算的? 请写出公式,并解释原理。此外,在定义显著富集的 GO terms 时为什么一般不是参考P value的大小,而是要计算一个 FDR来做为参考?

Last updated

Was this helpful?