首先计算基因之间的相关系数,构建基因网络(correlation network of genes),然后将具有相似表达模式的基因划分成模块(module)。随后计算各个模块与样本表型数据之间的相关性,对特定的感兴趣的模块分析核心基因(hub gene,通常是转录因子等关键的调控因子),并将特定模块的基因提取出来,进行GO/KEGG等分析。
RPKM是Reads Per Kilobase per Million mapped reads的缩写,代表每百万reads中来自于某基因每千碱基长度的reads数。RPKM是将map到基因的read数除以map到基因组上的所有read数(以million为单位)与RNA的长度(以KB为单位)。这是一个衡量基因表达丰度的单位。
# ENSG00000210082 ENSG00000198712 ENSG00000198804 ENSG00000210845
#GSM1172844 78053.20 103151.73 112917.53 92808.32
#GSM1172845 96200.86 157203.85 163847.92 93501.17
#GSM1172846 18259.11 40704.97 13357.02 75183.17
#GSM1172847 33184.15 43673.63 15360.50 91278.61
#GSM1172848 35255.48 55018.26 17715.40 65460.84
#### each row represents a cell line (sample), each column represents the fpkm value of a gene
#[1] 56 5000
#### 56 cell lines (samples), 5000 genes
#In R:datTraits[1:4,]dim(datTraits)
A brief look of the datTraits:
# gsm cellline subtype
#GSM1172844 GSM1172844 184A1 Non-malignant
#GSM1172845 GSM1172845 184B5 Non-malignant
#GSM1172846 GSM1172846 21MT1 Basal
#GSM1172847 GSM1172847 21MT2 Basal
#### each row represents a cell line (sample), each column provides the gsm number, the cell line name and the cell line subtype information
#[1] 56 3
#### 56 cell lines (samples)
#### The rownames of datExpr and datTraits are matched.
3d) Pick the soft thresholding power
#In R:options(stringsAsFactors =FALSE)#open multithreading enableWGCNAThreads()powers =c(c(1:10), seq(from=12, to=20, by=2))#Call the network topology analysis function,choose a soft-threshold to fit a scale-free topology to the networksft=pickSoftThreshold(datExpr, powerVector = powers, verbose =5)
#In R:pdf(file="/home/bioc/soft_thresholding.pdf",width=9, height=5)#Plot the results:par(mfrow =c(1,2))cex1 =0.9# Scale-free topology fit index as a function of the soft-thresholding powerplot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n", main =paste("Scale independence"));text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], labels=powers,cex=cex1,col="red");#Red line corresponds to using an R^2 cut-offabline(h=0.90,col="red")#Mean connectivity as a function of the soft-thresholding powerplot(sft$fitIndices[,1], sft$fitIndices[,5], xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n", main =paste("Mean connectivity"))text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")dev.off()
#In R:#Convert labels to colors for plottingmergedColors =labels2colors(net$colors)table(mergedColors)# black blue brown cyan darkgreen # 175 355 305 84 44 # darkgrey darkorange darkred darkturquoise green # 40 38 47 41 270 # greenyellow grey grey60 lightcyan lightgreen # 101 246 67 77 62 # lightyellow magenta midnightblue orange pink # 61 124 84 40 168 # purple red royalblue salmon tan # 108 241 60 87 88 # turquoise white yellow # 1671 37 279 #Plot the dendrogram and the module colors underneathpdf(file="/home/bioc/module_visualization.pdf",width=9, height=5)plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],"Module colors", dendroLabels =FALSE, hang =0.03, addGuide =TRUE, guideHang =0.05)## assign all of the gene to their corresponding module ## hclust for the genes.dev.off()
3g) Quantify module similarity by eigengene correlation
#In R:#Quantify module similarity by eigengene correlation#Eigengene: One of a set of right singular vectors of a gene's x samples matrix that tabulates, e.g., the mRNA or gene expression of the genes across the samples.
#Recalculate module eigengenesmoduleColors =labels2colors(net$colors)MEs =moduleEigengenes(datExpr, moduleColors)$eigengenes#Add the weight to existing module eigengenesMET =orderMEs(MEs)#Plot the relationships between the eigengenes and the traitpdf(file="/home/bioc/eigengenes_trait_relationship.pdf",width=7, height=9)par(cex =0.9)plotEigengeneNetworks(MET,"", marDendro=c(0,4,1,2), marHeatmap=c(3,4,1,2), cex.lab=0.8, xLabelsAngle=90)dev.off()
The top part of this plot represents the eigengene dendrogram and the lower part of this plot represents the eigengene adjacency heatmap.
3h) Find the relationships between modules and traits
模块与性状之间的关系
#In R:#Plot the relationships between modules and traitsdesign =model.matrix(~0+ datTraits$subtype)colnames(design) = levels(datTraits$subtype)moduleColors =labels2colors(net$colors)nGenes =ncol(datExpr)nSamples =nrow(datExpr)# Recalculate MEs with color labelsMEs0 =moduleEigengenes(datExpr, moduleColors)$eigengenesMEs =orderMEs(MEs0)moduleTraitCor =cor(MEs, design, use ="p")moduleTraitPvalue =corPvalueStudent(moduleTraitCor, nSamples)pdf(file="/home/bioc/module_trait_relationship.pdf",width=9, height=10)#Display the correlations and their p-valuestextMatrix =paste(signif(moduleTraitCor, 2), "\n(",signif(moduleTraitPvalue, 1), ")", sep ="")dim(textMatrix) = dim(moduleTraitCor)par(mar =c(6, 8.5, 3, 3))#Display the correlation values within a heatmap plotlabeledHeatmap(Matrix = moduleTraitCor, xLabels =colnames(design), yLabels =names(MEs), ySymbols =names(MEs), colorLabels =FALSE, colors =blueWhiteRed(50), textMatrix = textMatrix, setStdMargins =FALSE, cex.text =0.6, zlim =c(-1,1), main =paste("Module-trait relationships"))dev.off()
We choose the "brown" module in trait “Luminal” for further analyses.
3i.1) Intramodular connectivity, module membership, and screening for intramodular hub genes
#In R:#Intramodular connectivity, module membership, and screening for intramodular hub genes#Intramodular connectivityconnet=abs(cor(datExpr,use="p"))^6Alldegrees1=intramodularConnectivity(connet, moduleColors)#Relationship between gene significance and intramodular connectivitywhich.module="brown"Luminal=as.data.frame(design[,3])names(Luminal) = "Luminal"GS1 =as.numeric(cor(datExpr, Luminal, use ="p"))GeneSignificance=abs(GS1)#Generalizing intramodular connectivity for all genes on the arraydatKME=signedKME(datExpr, MEs, outputColumnName="MM.")#Display the first few rows of the data frame#Finding genes with high gene significance and high intramodular connectivity in specific modules#abs(GS1)>.8 #adjust parameter based on actual situations#abs(datKME$MM.black)>.8 #at least larger than 0.8FilterGenes=abs(GS1)>0.8&abs(datKME$MM.brown)>0.8table(FilterGenes)#FilterGenes#FALSE TRUE #4997 3#find 3 hub geneshubgenes <-rownames(datKME)[FilterGenes]hubgenes#[1] "ENSG00000124664" "ENSG00000129514" "ENSG00000143578"
3i.2) Export the network
#In R:#Export the network#Recalculate topological overlapTOM =TOMsimilarityFromExpr(datExpr, power =6)# Select modulemodule ="brown"# Select module probesprobes =colnames(datExpr)inModule = (moduleColors==module)modProbes = probes[inModule] # Select the corresponding Topological OverlapmodTOM = TOM[inModule, inModule]dimnames(modTOM) = list(modProbes, modProbes)#Export the network into edge and node list files Cytoscape can read#default threshold = 0.5, we could adjust parameter based on actual situations or in Cytoscapecyt =exportNetworkToCytoscape(modTOM, edgeFile =paste("CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""), nodeFile =paste("CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""), weighted =TRUE, threshold =0.02, nodeNames = modProbes, nodeAttr = moduleColors[inModule])#Screen the top genesnTop =10IMConn =softConnectivity(datExpr[, modProbes])top = (rank(-IMConn) <= nTop)filter <- modTOM[top, top]cyt =exportNetworkToCytoscape(filter, edgeFile =paste("CytoscapeInput-edges-filter-", paste(module, collapse="-"), ".txt", sep=""), nodeFile =paste("CytoscapeInput-nodes-filter-", paste(module, collapse="-"), ".txt", sep=""), weighted =TRUE, threshold =0.02, nodeNames =rownames(filter), nodeAttr = moduleColors[inModule][1:nTop])
This plot is visualized by Cytoscape.
The output files look like:
#Bash Shell command:headCytoscapeInput-edges-filter-brown.txt#fromNode toNode weight direction fromAltName toAltName#ENSG00000108639 ENSG00000124664 0.0597956525080059 undirected NA NA#ENSG00000108639 ENSG00000214530 0.0618251730361535 undirected NA NA#ENSG00000108639 ENSG00000129514 0.0339542369643292 undirected NA NA#ENSG00000108639 ENSG00000065361 0.0360849044869507 undirected NA NA##Bash Shell command:headCytoscapeInput-nodes-filter-brown.txt#nodeName altName nodeAttr.nodesPresent...#ENSG00000108639 NA brown#ENSG00000124664 NA brown#ENSG00000214530 NA brown#ENSG00000129514 NA brown
3i.3) Extract gene IDs in specific module
#In R:#Extract gene IDs in specific module#Select modulemodule ="brown"#Select module probes (gene ID)probes =colnames(datExpr)inModule = (moduleColors == module)modProbes = probes[inModule]write.table(modProbes,file="/home/bioc/geneID_brown.txt",sep="\t",quote=F,row.names=F,col.names=F)
We could use the gene ID list for GO/KEGG analysis.
4) Homework
Input data: homework_FemaleLiver-01-dataInput.RData (文件位于该链接中Files needed by this Tutorial中的清华云Bioinformatics Tutorial / Files路径下的相应文件夹中). It contains 134 samples, 3600 genes; each row represents a sample, each column represents a gene.
Please try to construct an automatic network and detect module (choosing the soft-thresholding power, one-step network construction and module detection) from the homework_FemaleLiver-01-dataInput.RData.
Please plot two figures:
a. Analysis of network topology(Scale independence, Mean connectivity) for various soft-thresholding powers.
b. The dendrogram and the module colors underneath.