Bioinformatics Tutorial
Files Needed
  • Getting Started
    • Setup
    • Run jobs in a Docker
    • Run jobs in a cluster [Advanced]
  • Part I. Programming Skills
    • 1.Linux
      • 1.1.Basic Command
      • 1.2.Practice Guide
      • 1.3.Linux Bash
    • 2.R
      • 2.1.R Basics
      • 2.2.Plot with R
    • 3.Python
  • PART II. BASIC ANALYSES
    • 1.Blast
    • 2.Conservation Analysis
    • 3.Function Analysis
      • 3.1.GO
      • 3.2.KEGG
      • 3.3.GSEA
    • 4.Clinical Analyses
      • 4.1.Survival Analysis
  • Part III. NGS DATA ANALYSES
    • 1.Mapping
      • 1.1 Genome Browser
      • 1.2 bedtools and samtools
    • 2.RNA-seq
      • 2.1.Expression Matrix
      • 2.2.Differential Expression with Cufflinks
      • 2.3.Differential Expression with DEseq2 and edgeR
    • 3.ChIP-seq
    • 4.Motif
      • 4.1.Sequence Motif
      • 4.2.Structure Motif
    • 5.RNA Network
      • 5.1.Co-expression Network
      • 5.2.miRNA Targets
      • 5.3. CLIP-seq (RNA-Protein Interaction)
    • 6.RNA Regulation - I
      • 6.1.Alternative Splicing
      • 6.2.APA (Alternative Polyadenylation)
      • 6.3.Chimeric RNA
      • 6.4.RNA Editing
      • 6.5.SNV/INDEL
    • 7.RNA Regulation - II
      • 7.1.Translation: Ribo-seq
      • 7.2.RNA Structure
    • 8.cfDNA
      • 8.1.Basic cfDNA-seq Analyses
  • Part IV. MACHINE LEARNING
    • 1.Machine Learning Basics
      • 1.1 Data Pre-processing
      • 1.2 Data Visualization & Dimension Reduction
      • 1.3 Feature Extraction and Selection
      • 1.4 Machine Learning Classifiers/Models
      • 1.5 Performance Evaluation
    • 2.Machine Learning with R
    • 3.Machine Learning with Python
  • Part V. Assignments
    • 1.Precision Medicine - exSEEK
      • Help
      • Archive: Version 2018
        • 1.1.Data Introduction
        • 1.2.Requirement
        • 1.3.Helps
    • 2.RNA Regulation - RiboShape
      • 2.0.Programming Tools
      • 2.1.RNA-seq Analysis
      • 2.2.Ribo-seq Analysis
      • 2.3.SHAPE Data Analysis
      • 2.4.Integration
    • 3.RNA Regulation - dsRNA
    • 4.Single Cell Data Analysis
      • Help
  • 5.Model Programming
  • Appendix
    • Appendix I. Keep Learning
    • Appendix II. Databases & Servers
    • Appendix III. How to Backup
    • Appendix IV. Teaching Materials
    • Appendix V. Software and Tools
    • Appendix VI. Genome Annotations
Powered by GitBook
On this page
  • 1) 数据说明
  • 2) Load R packages
  • 3) 数据预处理
  • 4) 数据集划分
  • 5) 特征选择
  • 6) 调参和模型拟合
  • 7)模型性能评估
  • 8) Homework
  • 8.1)
  • 8.2)
  • 9) More reading

Was this helpful?

Edit on GitHub
  1. Part IV. MACHINE LEARNING

2.Machine Learning with R

Previous1.5 Performance EvaluationNext3.Machine Learning with Python

Last updated 1 year ago

Was this helpful?

  • 在本节中,我们将利用glmnet package,caret和pROC三个R package来实现一个简单的二分类任务。

1) 数据说明

我们这里用到的还是BreastCancer数据集。R的mlbench package提供了该数据集,大家也可以从下载。

  1. 文件包括11个列,第1列为样本编号,第2-10列为特征,11列为标签(benign为良性,malignant为恶性)

  2. 数据集有458个良性(benign)样本和241个恶性(malignant)样本

  3. 数据集有Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses9个特征,均为取值在0-10之间的整数。

2) Load R packages

library(glmnet)
library(caret)
library(pROC)
library(mlbench)
  • glmnet: 实现了一系列带正则化的广义线性模型,我们用到的logistic regression对应binomial glm的情形

  • caret: 对多种分类器进行了封装,提供统一的接口,为机器学习中的调参,特征选择等常见的任务提供了便于使用的接口

  • pROC: 实现了一系列用于性能评估的常用函数

  • mlbench: 提供了一些常用的toy dataset,包括我们用到的BreastCancer

3) 数据预处理

  • 我们这里用均值补全缺失值,用Z-score对特征进行scaling

data(BreastCancer) # load data
y <- BreastCancer$Class 
x <- BreastCancer[,2:10]
x <- apply(x,2,as.numeric)
feature.mean <- colMeans(x,na.rm = T)
# fill missing value with average value of this feature 
x[is.na(x)] <- matrix(rep(feature.mean,each=length(y)), nrow=length(y))[is.na(x)]
# Z score scaling
x <- scale(x,center = TRUE,scale = TRUE)

4) 数据集划分

我们用caret提供的createDataPartition方法划分训练集和测试集:

set.seed(666) # 固定random seed保证结果可重复
# 80% data for training, 20% for performance evaluation
train.indices <- createDataPartition(y,p=0.8,times = 1,list=T)$Resample1
x.train <- x[train.indices,]
x.test <- x[ -train.indices,]
y.train <- y[train.indices]
y.test <- y[ -train.indices]

5) 特征选择

  • 我们这里使用caret实现的递归特征消除(recursive feature elimination)方法

  • 我们先考虑一个基于随机森林的方法:

# twoClassSummary是caret定义的一个函数,用于二分类的性能评估
# rfFuncs是rfFuncs是caret自带的一个S3 list,用于存储基于随机森林做特征选择相关的一些函数
# 我们这里把twoClassSummary用作模型评估,其他保留默认设置
rfFuncs$summary <- twoClassSummary 

# caret中可以通过通过rfeControl函数定义一个list(这里的rfectrl),用于控制RFE的行为
# method指的是性能评估的方式,可选参数有"boot", "cv", "LOOCV", "repeatedcv"
# 我们这里使用默认的boot,即通过bootstrapping(有放回的抽样)进行性能评估
# number控制了cv的fold和bootstrapping的重复次数
rfectrl <- rfeControl(functions=rfFuncs,
                      verbose = TRUE,
                      method="boot",number=10)
rfe.results <- rfe(x.train,y.train, 
               sizes=2:9, 
               rfeControl=rfectrl,
               metric = "ROC")
selected.features <- predictors(rfe.results)
selected.features
# "Bare.nuclei"  "Cl.thickness" "Cell.size"    "Bl.cromatin" 
  • 可见RFE最终保留了4个特征

  • 我们还可以用rfe.results$results打印出交叉验证的结果

  • 如果把rfFuncs参数换成lrFuncs,我们就可以进行基于logistic regression的特征选择:

lrFuncs$summary <- twoClassSummary 
rfectrl <- rfeControl(functions=lrFuncs,
                      verbose = TRUE,
                      method="boot",number=10)
rfe.results <- rfe(x.train,y.train, 
               sizes=2:9, 
               rfeControl=rfectrl,
               metric = "ROC")
predictors(rfe.results)
# "Bare.nuclei"     "Cl.thickness"    "Bl.cromatin"     "Marg.adhesion"   "Mitoses"         "Normal.nucleoli"

6) 调参和模型拟合

  • 我们使用caret提供的train函数,以glmnet package实现的带正则化的logistic regression为例通过交叉验证调参

params.grid <- expand.grid(alpha = c(0,0.5,1),lambda = c(0,0.01,0.1,1))
# alpha: relative weighting of L1 and L2 regularization
# lambda: degree of regularization, see glmnet documentation for detail

# train是caret封装的一个用于调参的函数
# 和rfeControl类似,trainControl也返回一个列表,用于控制train函数的行为
# method同样可以是"boot","cv"等,number参数控制了交叉验证的折数和bootstraping的重复次数
tr.ctrl <- trainControl(method="cv",
                        number = 5,
                        summaryFunction = twoClassSummary,
                        classProbs = TRUE)
cv.fitted <- train(x.train[,predictors(rfe.results)],y.train,
                   method="glmnet",
                   family="binomial",
                   metric = "ROC",
                   tuneGrid = params.grid,
                   preProcess = NULL,
                   trControl = tr.ctrl )
  • 我们可以查看交叉验证得到的最好的一组超参数组合

cv.fitted$bestTune
#  alpha lambda
#    1    0.01
  • 也可以通过cv.fitted$results属性查看交叉验证的结果等

7)模型性能评估

  • 获取测试集上的预测结果

# caret为不同分类器的预测提供了统一的接口
y.test.pred.prob <- predict(cv.fitted,newdata=x.test,type="prob")
# predict是一个generic function,它根据输入对象的类型判断调用哪一个函数
# caret::train返回的是一个"train"类,所以predict调用的实际上是caret实现的predict.train函数

# 接下来我们用pROC::roc函数产生一个"roc"对象
roc.curve <- roc(y.test,y.test.pred.prob[,2])
plot(roc.curve,print.auc=T)
  • pROC实现了对分类性能指标置信区间的估计:

ci.auc(roc.curve,conf.level = 0.95)
# 95% CI: 0.9893-1 (DeLong) 

ci.coords(roc.curve,x="best",conf.level = 0.95,ret ="recall",best.method="youden",best.policy="random")
# 95% CI (2000 stratified bootstrap replicates):
# threshold recall.low recall.median recall.high
#      best     0.9375        0.9792           1

ci.coords(roc.curve,x="best",conf.level = 0.95,ret ="precision",best.method="youden",best.policy="random")
#95% CI (2000 stratified bootstrap replicates):
# threshold precision.low precision.median precision.high
#      best        0.8571           0.9412              1

 ci.coords(roc.curve,x="best",conf.level = 0.95,ret ="specificity",best.method="youden",best.policy="random")
#95% CI (2000 stratified bootstrap replicates):
# threshold specificity.low specificity.median specificity.high
#      best          0.9121              0.967                1

8) Homework

8.1)

  • 数据预处理

    • 用均值或中位数补全空缺值

    • 对数据进行scaling

  • 数据可视化

    • 用PCA对数据进行可视化

  • 数据集划分:预留20%数据用于评估模型泛化能力,剩下的用于模型拟合

  • 模型选择和模型拟合

    • 任选一种分类器即可

    • 特征选择: 用RFE或其他方式均可,特征数量不限

    • 调参: 根据所选的分类器对相应的超参数进行搜索

  • 模型评估

    • 计算预留数据集上的AUROC

    • 绘制ROC曲线

  • 请提交代码,必要的文字解释和ROC曲线

8.2)

请大家查阅资料,回答以下两个关于随机森林的问题:

  • 随机森林中树的数量是不是一个需要通过交叉验证调整的超参数?为什么?

  • 请问什么是随机森林的out-of-bag (OOB) error?它和bootstrapping有什么关系?

9) More reading

    • randomForest.R: Random Forest

    • logistic_regression.R: Logistic Regression

    • svm.R: SVM

    • plot_result.R: Plot your training and testing performance

  • more R packages for machine learning

通过改变train函数的method参数,我们就可以很容易的拟合其他分类器,caret封装了大量可选的分类和回归模型,请大家参考

我们提供了一个qPCR数据集,第1列为sample id,第2-12列为特征(11个基因的表达量),第13列为样本标签(负例为健康人:NC,正例为肝癌病人:HCC)。请大家用R语言完成以下任务:

其他机器学习模型相关:

Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc (142479 downloads)

Recursive Partitioning and Regression Trees. (135390)

A collection of network analysis tools. (122930)

Feed-forward Neural Networks and Multinomial Log-Linear Models. (108298)

Breiman and Cutler's random forests for classification and regression. (105375)

package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. (87151)

Kernel-based Machine Learning Lab. (62064)

Lasso and elastic-net regularized generalized linear models. (56948)

Visualizing the performance of scoring classifiers. (51323)

Generalized Boosted Regression Models. (44760)

A Laboratory for Recursive Partitioning. (43290)

Mining Association Rules and Frequent Itemsets. (39654)

Classification and regression trees. (27882)

Classification and visualization. (27828)

R/Weka interface. (26973)

Improved Predictors. (22358)

Least Angle Regression, Lasso and Forward Stagewise. (19691)

Multivariate Adaptive Regression Spline Models. (15901)

Classification, regression, feature evaluation and ordinal evaluation. (13856)

Model-Based Boosting. (13078)

BreastCancer.csv
http://topepo.github.io/caret/train-models-by-tag.html
qPCR_data.csv
An Introduction to Machine Learning with R
caret documentation
代码
e1071
rpart
igraph
nnet
randomForest
caret
kernlab
glmnet
ROCR
gbm
party
arules
tree
klaR
RWeka
ipred
lars
earth
CORElearn
mboost
ROC curve