3.2.Plot with R
本章我们介绍如何使用 R 进行数据可视化。
R语言对于科学作图提供了强大的支持。在R语言中主要存在两套作图系统,一套是R语言原生的Base图形系统,一套是基于R包grid中实现的图形语法进行作图的一系列工具,后者相对而言更加灵活方便。
上机任务
:
首先,请选择性练习下面各章的 plot 方法。
- 把图片标题设为"Sepal Length Distribution",加粗居中(可使用labs函数和theme函数)
- 把y轴范围设为0.5到7之间(可使用scale_y_continuous函数)
- 三个Species的对应的填充颜色分别设为#C44E52, #55A868和#4C72B0(可使用scale_fill_manual函数)
本方案需要先按照我们上节课介绍的方法配置好R语言和rstudio,并加载一个我们提供的文件:
R markdown是一种markdown文件的扩展,rstudio可以加载R markdown文件,运行R markdown中的R代码,并将输入输出内嵌在文件中进行展示。
- (4) 安装需要的package:
- (5) 打开
.Rmd
文件
用Rstudio打开
all.Rmd
文件, 即可阅读教程,并执行相关代码。如果你更喜欢每个文件仅包含一节的内容(一种 plot 类型),可以先打开index.Rmd
,安装需要的 packages,然后依次打开每一节对应的.Rmd
文件(动画展了第1、2小节对应的1.box-plots.Rmd
和2.violin-plots.Rmd
)
如果你在使用方案一时遇到了问题,也可以用我们提供的 Docker(里面已经预装好了 R 语言和需要的 packages)。
首先进入容器:
docker exec -it bioinfo_tsinghua bash
本章的操作均在
/home/test/plot/
下进行:cd /home/test/plot/
进入容器后,输入
R
回车进入R的交互式环境:R
在实际画图时,依次将下文给出的 R 代码复制到 Terminal 中运行。
- 在R语言中也提供了操作文件系统的函数,例如可以用
dir.create
建立一个新的目录
dir.create('output')
- 用
read.table
函数将表格数据读取到数据框中(上一节中我们已对read.table
函数进行了介绍)
# Read the input files
# “header=T” means that the data has a title, and sep="\t" is used as the separator
data <-read.table("input/box_plots_mtcars.txt",header=T,sep="\t")
df <- data[, c("mpg", "cyl", "wt")]
df2 <-read.table("input/histogram_plots.txt",header=T,sep="\t")
df3 <- read.table("input/volcano_plots.txt", header=T)
df4 <- read.table("input/manhattan_plots_gwasResults.txt",header=T,sep="\t")
df5 <-read.table("input/heatmaps.txt",header=T,sep="\t")
# Covert data into matrix format
# nrow(df5) and ncol(df5) return the number of rows and columns of matrix df5 respectively.
dm <- data.matrix(df5[1:nrow(df5),2:ncol(df5)])
# Get the row names
row.names(dm) <- df5[,1]
df6 <- read.table("input/ballon_plots_GO.txt", header=T, sep="\t")
df7 <- read.table("input/box_plots_David_GO.txt",header=T,sep="\t")
df7 <- df7[1:10,]
Docker 中已经装好所需要的 R 包,如果你是在自己电脑上运行,则需要安装 ggplot2, qqman, gplots, pheatmap, scales, reshape2, RColorBrewer 和 plotrix(使用
install.packages()
, 如 install.packages('ggplot2')
)。library(ggplot2) # R语言中最常用的基于grid的可视化工具
# 另外两个比较常见的作图package
library(gplots)
library(plotrix)
library(qqman) # 用于GWAS数据可视化
library(pheatmap) #用于绘制热图,ComplexHeatmap也是另外一个常用的package
library(scales) # map numeric value to color
library(RColorBrewer) #提供常见的配色方案
# reshape data in R
library(reshape2)
library(plyr)
这里我们介绍保存作图结果的两种方式:
- 1.在作图代码前加上
pdf("path-to-save.pdf")
,代码后加上dev.off()
。这样R语言会将图片保存到路径path-to-save.pdf
中。如果想保存成pdf之外的其他格式,可将pdf()换成png()等相应的函数。这种方式对于原生R语言的作图结果和ggplot2的作图结果都是适用的。以下给出了一个简单的例子:
# 指定输出pdf,路径为output/1.1.Basic_boxplot.pdf,高度宽度均为3
pdf("output/1.1.Basic_boxplot.pdf", height = 3, width = 3)
# ggplot从数据框df中读取作图所需的数据
# aes(x=cyl, y=mpg)告诉ggplot2将数据框中的cyl列作为x轴,mpg列作为y轴
ggplot(df, aes(x=cyl, y=mpg))+ # 加号在ggplot中意思是 在当前的ggplot对象上进行修改
# draw the boxplot and fill it with gray
geom_boxplot(fill="gray")+
# Use the labs function to set the title and modify x and y
labs(title="Plot of mpg per cyl",x="Cyl", y = "Mpg")+
# Set the theme style
theme_classic()
# Save the plot
dev.off()
- 1.使用
ggplot2
中的ggsave
函数,它只适用于保存ggplot2以及基于ggplot2的一些package的作图结果
# Begin to plot
p <- ggplot(df, aes(x=cyl, y=mpg)) +
geom_boxplot(fill="gray")+
labs(title="Plot of mpg per cyl",x="Cyl", y = "Mpg")+
theme_classic()
# Sava as pdf
ggsave("output/1.1.Basic_boxplot.pdf", plot=p, height = 3, width = 3)
完成作图后,可以将作图结果复制到共享目录中,在宿主机上进行查看
- 在箱线图(box plot)中,我们按某个离散变量对数据进行分组展示,即x轴为类别变量,y轴通常为连续变量
# ggplot2通过数据类型是否为factor类型确定一个变量是不是类别变量,用因子的次序确定可视化结果中数据排布的次序
# 所以如果希望作为x轴的变量不是factor类型,需要进行手动转换
df$cyl <- as.factor(df$cyl)
head(df)
### mpg cyl wt
### Mazda RX4 21.0 6 2.620
### Mazda RX4 Wag 21.0 6 2.875
### Datsun 710 22.8 4 2.320
### Hornet 4 Drive 21.4 6 3.215
### Hornet Sportabout 18.7 8 3.440
### Valiant 18.1 6 3.460
ggplot(df, aes(x=cyl, y=mpg)) +
geom_boxplot(fill="gray")+
labs(title="Plot of mpg per cyl",x="Cyl", y = "Mpg")+
theme_classic()

ggplot(df, aes(x=cyl, y=mpg, fill=cyl)) + # fill=cyl: 用颜色表示cyl一列的数值
geom_boxplot()+
labs(title="Plot of mpg per cyl",x="Cyl", y = "Mpg") +
scale_fill_brewer(palette="Blues") + # palette="Blues": 定义了一种数值到颜色的对应关系,数值越大蓝色的颜色越深
theme_bw()

- lattice和ggplot2一样,也是一个比较常用的package,大家有兴趣可自行了解
#Read the data table
data=read.csv("boxplot_example.csv")
###################
#I.Prepare the data
#1.Normalize the data, etc
for (i in 12:17){
data[,i]=log(data[,i]+1e-3) # log some expression values
}
for (i in 9:17) {
maxValue=max(data[,i]) #scale the data into 0-1
minValue=min(data[,i])
range=maxValue-minValue
data[,i]=(data[,i]-minValue)/range
}
data$X8.Identity=data$X8.Identity/100
#2.Make the new matrix for boxplot: cleaning the data table
library("reshape")
m=melt(data[,c(2,7:12,14:17)], id=1)# remove some columns not to show and reshape the matrix into 3 columns for boxplot drawing in bwplot
colnames(m)=c("Type","Feature","Normalized_Value") #define the new column names
#3.Clean the names of each type and each feature
#Merge sub-types of different elements
m[,1]=sub ("ncRNA_selected","RNAI", m[,1])
m[,1]=sub ("ncRNA_3019","RNAII", m[,1])
m[,1]=sub ("exon_CCDS","CDS", m[,1])
m[,1]=sub ("five_prime_UTR","UTR", m[,1])
m[,1]=sub ("three_prime_UTR","UTR", m[,1])
m[,1]=sub ("ancestral_repeat","AP", m[,1])
#Rename the feature
m[,2]=sub('X7.GC','01.GC Content',m[,2])
m[,2]=sub('X8.Identity','02.DNA Conservation',m[,2])
m[,2]=sub('X9.z_score','03.RNA Struc. Free Energy',m[,2])
m[,2]=sub('X10.SCI','04.RNA Struc. Cons.',m[,2])
m[,2]=sub('X11.tblastx_score','05.Protein Conservation',m[,2])
m[,2]=sub('X12.polyA_RNAseq_MAX','06.PolyA+ RNA-seq',m[,2])
m[,2]=sub('X14.small_RNAseq_MAX','07.Small RNA-seq',m[,2])
m[,2]=sub('X15.Array_totalRNA_MAX','08.Total RNA Array',m[,2])
m[,2]=sub('X16.Array_polyA_MAX','09.PolyA+ RNA Array',m[,2])
m[,2]=sub('X17.Array_nonpolyA_MAX','10.PolyA- RNA Array',m[,2])
###########################
#Making Boxplot
library("lattice")
png("boxplot.png",width=1500,height=500) # pdf is recommended for most cases, or png for figure with huge amount of data points
#pdf("boxplot.pdf")
attach(m)
bwplot(Normalized_Value ~ Type|Feature,fill=c("green","red","yellow","blue","light blue"),layout=c(10,1))
dev.off()

- ggplot2支持很多个性化的配置,可以进行非常复杂的可视化
- 基于这样的package,可以用少量代码实现比较复杂的功能,大家可以根据具体的需求选择使用
geom_boxplot
: 作箱线图geom_point
: 展示出每个点的数 值(对类别变量x轴的位置引入一定的随机性,避免点的重合,方便展示y轴每个点的分布)scale_fill_brewer
: 使用RColorBrewer的配色theme
: 对各种各样的属性进行配置,可结合具体需求进行调整panel.grid=element_blank()
: 不绘制网格panel.border=element_blank()
: 不添加边框axis.line = element_line(size=1, colour = "black")
: 设置坐标轴颜色和粗细legend.title = element_text(face="bold", color="black",family = "Arial", size=24)
:设置图注标题属性,文本格式都可以通过element_text
函数设置- ...
stat_compare_means
: ggpubr提供的函数,用于标注统计显著性,输入为需要进行的两两比较列表labs
: 设置坐标轴标题等
library(ggplot2)
library(ggpubr)
data(iris)
print(levels(iris$Species))
comparisons <- list(c("versicolor","setosa"),c("virginica","versicolor"),c("virginica","setosa"))
ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species))+geom_boxplot(alpha = 1, size = 1, position = position_dodge(1.1),outlier.size=0,outlier.alpha = 0)+
geom_point(size = 1, position = position_jitterdodge(dodge.width=0.3,jitter.width = 0.3))+
scale_fill_brewer(palette="Blues") +
theme_bw()+
theme(legend.position="right",
panel.grid=element_blank(),
panel.border=element_blank(),
axis.line = element_line(size=1, colour = "black"),
legend.title = element_text(face="bold", color="black",family = "Arial", size=24),
legend.text= element_text(face="bold", color="black",family = "Arial", size=24),
plot.title = element_text(hjust = 0.5,size=24,face="bold"),
axis.text.x = element_text(face="bold", color="black", size=24,angle = 45,hjust = 1),
axis.text.y = element_text(face="bold", color="black", size=24),
axis.title.x = element_text(face="bold", color="black", size=24),
axis.title.y = element_text(face="bold",color="black", size=24))+
stat_compare_means(comparisons = comparisons,
method = "wilcox.test",
method.args = list(alternative = "greater"),
label = "p.signif"
)+labs(x="",title="Boxplot and statistical test", face="bold")

和箱线图一样,Violin plots 中横轴为类别变量,纵轴为连续变量
df$cyl <- as.factor(df$cyl)
head(df)
### mpg cyl wt
### Mazda RX4 21.0 6 2.620
### Mazda RX4 Wag 21.0 6 2.875
### Datsun 710 22.8 4 2.320
### Hornet 4 Drive 21.4 6 3.215
### Hornet Sportabout 18.7 8 3.440
### Valiant 18.1 6 3.460
ggplot(df, aes(x=cyl, y=mpg)) +
geom_violin(trim=FALSE) +
labs(title="Plot of mpg per cyl", x="Cyl", y = "Mpg")

ggplot(df, aes(x=cyl, y=mpg)) +
geom_violin(trim=FALSE) +
labs(title="Plot of mpg per cyl", x="Cyl", y = "Mpg") +
stat_summary(fun.y=mean, geom="point", shape=23, size=2, color="red")

or
ggplot(df, aes(x=cyl, y=mpg)) +
geom_violin(trim=FALSE) +
labs(title="Plot of mpg per cyl", x="Cyl", y = "Mpg") +
geom_boxplot(width=0.1)

ggplot(df, aes(x=cyl, y=mpg)) +
geom_violin(trim=FALSE) +
labs(title="Plot of mpg per cyl", x="Cyl", y = "Mpg") +
stat_summary(fun.data="mean_sdl", fun.args = list(mult = 1), geom="crossbar", width=0.1 )

or
ggplot(df, aes(x=cyl, y=mpg)) +
geom_violin(trim=FALSE) +
labs(title="Plot of mpg per cyl", x="Cyl", y = "Mpg") +
stat_summary(fun.data=mean_sdl, fun.args = list(mult = 1), geom="pointrange", color="red")

ggplot(df, aes(x=cyl, y=mpg, fill=cyl)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.1, fill="white") +
labs(title="Plot of mpg per cyl", x="Cyl", y = "Mpg") +
scale_fill_brewer(palette="Blues") +
theme_classic()

head(df2)
### sex weight
### 1 F 49
### 2 F 56
### 3 F 60
### 4 F 43
### 5 F 57
### 6 F 58
ggplot(df2, aes(x=weight)) + geom_histogram(binwidth=1)

ggplot(df2, aes(x=weight)) +
geom_histogram(binwidth=1, color="black", fill="white") +
geom_vline(aes(xintercept=mean(weight)),color="black", linetype="dashed", size=0.5)

##Use the plyr package to calculate the average weight of each group :
mu <- ddply(df2, "sex", summarise, grp.mean=mean(weight))
head(mu)
### sex grp.mean
### 1 F 54.70
### 2 M 65.36
##draw the plot
ggplot(df2, aes(x=weight, color=sex)) +
geom_histogram(binwidth=1, fill="white", position="dodge")+
geom_vline(data=mu, aes(xintercept=grp.mean, color=sex), linetype="dashed") +
scale_color_brewer(palette="Paired") +
theme_classic()+
theme(legend.position="top")

head(df2)
### sex weight
### 1 F 49
### 2 F 56
### 3 F 60
### 4 F 43
### 5 F 57
### 6 F 58
ggplot(df2, aes(x=weight)) +
geom_density()

ggplot(df2, aes(x=weight)) +
geom_density() +
geom_vline(aes(xintercept=mean(weight)), color="black", linetype="dashed", size=0.5)

##Use the plyr package plyr to calculate the average weight of each group :
mu <- ddply(df2, "sex", summarise, grp.mean=mean(weight))
head(mu)
### sex grp.mean
### 1 F 54.70
### 2 M 65.36
draw the plot
ggplot(df2, aes(x=weight, fill=sex)) +
geom_density(alpha=0.7)+
geom_vline(data=mu, aes(xintercept=grp.mean, color=sex), linetype="dashed")+
labs(title="Weight density curve",x="Weight(kg)", y = "Density") +
scale_color_brewer(palette="Paired") +
scale_fill_brewer(palette="Blues") +
theme_classic()

ggplot(df2, aes(x=weight, color=sex)) +
geom_density()+
geom_vline(data=mu, aes(xintercept=grp.mean, color=sex), linetype="dashed")+
labs(title="Weight density curve",x="Weight(kg)", y = "Density") +
scale_color_brewer(palette="Paired") +
theme_classic()

ggplot(df2, aes(x=weight, color=sex, fill=sex)) +
geom_histogram(binwidth=1, aes(y=..density..), alpha=0.5, position="identity") +
geom_density(alpha=.2) +
labs(title="Weight density curve",x="Weight(kg)", y = "Density") +
scale_color_brewer(palette="Paired") +
scale_fill_brewer(palette="Blues") +
theme_classic()

df$cyl <- as.factor(df$cyl) #我们这里同样希望ggplot2把x轴当作类别变量
head(df)
### mpg cyl wt
### Mazda RX4 21.0 6 2.620
### Mazda RX4 Wag 21.0 6 2.875
### Datsun 710 22.8 4 2.320
### Hornet 4 Drive 21.4 6 3.215
### Hornet Sportabout 18.7 8 3.440
### Valiant 18.1 6 3.460
ggplot(df, aes(x=cyl, y=mpg)) +
geom_dotplot(binaxis='y', stackdir='center', binwidth=1)

ggplot(df, aes(x=cyl, y=mpg)) +
geom_dotplot(binaxis='y', stackdir='center', binwidth=1) +
stat_summary(fun.data="mean_sdl", fun.args = list(mult=1), geom="crossbar", width=0.5)

or
ggplot(df, aes(x=cyl, y=mpg)) +
geom_dotplot(binaxis='y', stackdir='center', binwidth=1) +
stat_summary(fun.data="mean_sdl", fun.args = list(mult=1), geom="pointrange", color="red")

ggplot(df, aes(x=cyl, y=mpg, fill=cyl, shape=cyl)) +
geom_dotplot(binaxis='y', stackdir='center', binwidth=1, dotsize=0.8) +
labs(title="Plot of mpg per cyl",x="Cyl", y = "Mpg") +
#stat_summary(fun.data="mean_sdl", fun.args = list(mult=1), geom="crossbar", width=0.5) +
scale_fill_brewer(palette="Blues") +
#scale_color_brewer(palette="Blues") +
theme_classic()

ggplot(df, aes(x=cyl, y=mpg, color=cyl, shape=cyl)) +
geom_jitter(position=position_jitter(0.1), cex=2)+
labs(title="Plot of mpg per cyl",x="Cyl", y = "Mpg") +
scale_color_brewer(palette="Blues") +
theme_classic()

df$cyl <- as.factor(df$cyl)
head(df)
### mpg cyl wt
### Mazda RX4 21.0 6 2.620
### Mazda RX4 Wag 21.0 6 2.875
### Datsun 710 22.8 4 2.320
### Hornet 4 Drive 21.4 6 3.215
### Hornet Sportabout 18.7 8 3.440
### Valiant 18.1 6 3.460
ggplot(df, aes(x=wt, y=mpg)) +
geom_point(size=1.5)

ggplot(df, aes(x=wt, y=mpg, color=cyl, shape=cyl)) +
geom_point(size=1.5) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE) +
theme_classic()

data(cars)
ggscatter(cars, x = "speed", y = "dist",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.coeff.args = list(method = "spearman", label.x = 15, label.y = 0.05,label.sep = "\n",size = 8))+
theme(legend.position="right",
panel.grid=element_blank(),
legend.title = element_text(face="bold", color="black",family = "Arial", size=20),
legend.text= element_text(face="bold", color="black",family = "Arial", size=20),
plot.title = element_text(hjust = 0.5,size=24,face="bold"),
axis.text.x = element_text(face="bold", color="black", size=20),
axis.text.y = element_text(face="bold", color="black", size=20),
axis.title.x = element_text(face="bold", color="black", size=24),
axis.title.y = element_text(face="bold",color="black", size=24))

data(iris)
library(Hmisc)
library(corrplot)
res2 <- rcorr(as.matrix(iris[c("Sepal.Width","Petal.Length","Petal.Width")]))
corrplot(corr = res2$r,tl.col="black",type="lower", order="original",tl.pos = "d",tl.cex=1.2,
p.mat = res2$P, sig.level = 0.05,insig = "blank")

head(df3)
### Gene log2FoldChange pvalue padj
### 1 DOK6 0.5100 1.861e-08 0.0003053
### 2 TBX5 -2.1290 5.655e-08 0.0004191
### 3 SLC32A1 0.9003 7.664e-08 0.0004191
### 4 IFITM1 -1.6870 3.735e-06 0.0068090
### 5 NUP93 0.3659 3.373e-06 0.0068090
### 6 EMILIN2 1.5340 2.976e-06 0.0068090
# 把基因归为上调,不变,下调三类,用因子表示,放在threshold一列,用于定义颜色
df3$threshold <- as.factor(ifelse(df3$padj < 0.05 & abs(df3$log2FoldChange) >=1,ifelse(df3$log2FoldChange > 1 ,'Up','Down'),'Not'))
ggplot(data=df3, aes(x=log2FoldChange, y =-log10(padj), color=threshold,fill=threshold)) +
scale_color_manual(values=c("blue", "grey","red"))+ #手动指定三类基因的颜色
geom_point(size=1) +
xlim(c(-3, 3)) +
theme_bw(base_size = 12, base_family = "Times") +
geom_vline(xintercept=c(-1,1),lty=4,col="grey"