# 2.Machine Learning with R

* 在本节中，我们将利用glmnet package，caret和pROC三个R package来实现一个简单的二分类任务。

## 1) 数据说明

我们这里用到的还是BreastCancer数据集。R的mlbench package提供了该数据集，大家也可以从[BreastCancer.csv](https://cloud.tsinghua.edu.cn/lib/1e25eb24-ecb1-4efe-be8d-13dac2f68b91/file/Files/PART_IV/BreastCancer.csv)下载。

1. 文件包括11个列，第1列为样本编号，第2-10列为特征，11列为标签（benign为良性,malignant为恶性)
2. 数据集有458个良性(benign)样本和241个恶性(malignant)样本
3. 数据集有Cl.thickness，Cell.size，Cell.shape，Marg.adhesion，Epith.c.size，Bare.nuclei，Bl.cromatin，Normal.nucleoli，Mitoses9个特征，均为取值在0-10之间的整数。

## 2) Load R packages

```r
library(glmnet)
library(caret)
library(pROC)
library(mlbench)
```

* glmnet: 实现了一系列带正则化的广义线性模型，我们用到的logistic regression对应binomial glm的情形
* caret: 对多种分类器进行了封装，提供统一的接口，为机器学习中的调参，特征选择等常见的任务提供了便于使用的接口
* pROC: 实现了一系列用于性能评估的常用函数
* mlbench: 提供了一些常用的toy dataset，包括我们用到的BreastCancer

## 3) 数据预处理

* 我们这里用均值补全缺失值，用Z-score对特征进行scaling

```r
data(BreastCancer) # load data
y <- BreastCancer$Class 
x <- BreastCancer[,2:10]
x <- apply(x,2,as.numeric)
feature.mean <- colMeans(x,na.rm = T)
# fill missing value with average value of this feature 
x[is.na(x)] <- matrix(rep(feature.mean,each=length(y)), nrow=length(y))[is.na(x)]
# Z score scaling
x <- scale(x,center = TRUE,scale = TRUE)
```

## 4) 数据集划分

我们用caret提供的`createDataPartition`方法划分训练集和测试集:

```r
set.seed(666) # 固定random seed保证结果可重复
# 80% data for training, 20% for performance evaluation
train.indices <- createDataPartition(y,p=0.8,times = 1,list=T)$Resample1
x.train <- x[train.indices,]
x.test <- x[ -train.indices,]
y.train <- y[train.indices]
y.test <- y[ -train.indices]
```

## 5) 特征选择

* 我们这里使用caret实现的递归特征消除(recursive feature elimination)方法
* 我们先考虑一个基于随机森林的方法:

```r
# twoClassSummary是caret定义的一个函数，用于二分类的性能评估
# rfFuncs是rfFuncs是caret自带的一个S3 list，用于存储基于随机森林做特征选择相关的一些函数
# 我们这里把twoClassSummary用作模型评估，其他保留默认设置
rfFuncs$summary <- twoClassSummary 

# caret中可以通过通过rfeControl函数定义一个list(这里的rfectrl)，用于控制RFE的行为
# method指的是性能评估的方式，可选参数有"boot", "cv", "LOOCV", "repeatedcv"
# 我们这里使用默认的boot，即通过bootstrapping(有放回的抽样)进行性能评估
# number控制了cv的fold和bootstrapping的重复次数
rfectrl <- rfeControl(functions=rfFuncs,
                      verbose = TRUE,
                      method="boot",number=10)
rfe.results <- rfe(x.train,y.train, 
               sizes=2:9, 
               rfeControl=rfectrl,
               metric = "ROC")
selected.features <- predictors(rfe.results)
selected.features
# "Bare.nuclei"  "Cl.thickness" "Cell.size"    "Bl.cromatin" 
```

* 可见RFE最终保留了4个特征
* 我们还可以用`rfe.results$results`打印出交叉验证的结果
* 如果把`rfFuncs`参数换成`lrFuncs`，我们就可以进行基于logistic regression的特征选择:

```r
lrFuncs$summary <- twoClassSummary 
rfectrl <- rfeControl(functions=lrFuncs,
                      verbose = TRUE,
                      method="boot",number=10)
rfe.results <- rfe(x.train,y.train, 
               sizes=2:9, 
               rfeControl=rfectrl,
               metric = "ROC")
predictors(rfe.results)
# "Bare.nuclei"     "Cl.thickness"    "Bl.cromatin"     "Marg.adhesion"   "Mitoses"         "Normal.nucleoli"
```

## 6) 调参和模型拟合

* 我们使用caret提供的train函数，以glmnet package实现的带正则化的logistic regression为例通过交叉验证调参

```r
params.grid <- expand.grid(alpha = c(0,0.5,1),lambda = c(0,0.01,0.1,1))
# alpha: relative weighting of L1 and L2 regularization
# lambda: degree of regularization, see glmnet documentation for detail

# train是caret封装的一个用于调参的函数
# 和rfeControl类似，trainControl也返回一个列表，用于控制train函数的行为
# method同样可以是"boot","cv"等，number参数控制了交叉验证的折数和bootstraping的重复次数
tr.ctrl <- trainControl(method="cv",
                        number = 5,
                        summaryFunction = twoClassSummary,
                        classProbs = TRUE)
cv.fitted <- train(x.train[,predictors(rfe.results)],y.train,
                   method="glmnet",
                   family="binomial",
                   metric = "ROC",
                   tuneGrid = params.grid,
                   preProcess = NULL,
                   trControl = tr.ctrl )
```

* 我们可以查看交叉验证得到的最好的一组超参数组合

```r
cv.fitted$bestTune
#  alpha lambda
#    1    0.01
```

* 也可以通过`cv.fitted$results`属性查看交叉验证的结果等
* 通过改变`train`函数的`method`参数，我们就可以很容易的拟合其他分类器，caret封装了大量可选的分类和回归模型，请大家参考<http://topepo.github.io/caret/train-models-by-tag.html>

## 7)模型性能评估

* 获取测试集上的预测结果

```r
# caret为不同分类器的预测提供了统一的接口
y.test.pred.prob <- predict(cv.fitted,newdata=x.test,type="prob")
# predict是一个generic function，它根据输入对象的类型判断调用哪一个函数
# caret::train返回的是一个"train"类，所以predict调用的实际上是caret实现的predict.train函数

#　接下来我们用pROC::roc函数产生一个"roc"对象
roc.curve <- roc(y.test,y.test.pred.prob[,2])
plot(roc.curve,print.auc=T)
```

![ROC curve](https://4115668567-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LPVsf5VZbQ7h14X29qW%2Fuploads%2Fgit-blob-e3f9a473a10e76158a97aeff83e2540023f1215e%2Fiv.2.ml.with.r.pROC.plot.png?alt=media)

* pROC实现了对分类性能指标置信区间的估计:

```r
ci.auc(roc.curve,conf.level = 0.95)
# 95% CI: 0.9893-1 (DeLong) 

ci.coords(roc.curve,x="best",conf.level = 0.95,ret ="recall",best.method="youden",best.policy="random")
# 95% CI (2000 stratified bootstrap replicates):
# threshold recall.low recall.median recall.high
#      best     0.9375        0.9792           1

ci.coords(roc.curve,x="best",conf.level = 0.95,ret ="precision",best.method="youden",best.policy="random")
#95% CI (2000 stratified bootstrap replicates):
# threshold precision.low precision.median precision.high
#      best        0.8571           0.9412              1

 ci.coords(roc.curve,x="best",conf.level = 0.95,ret ="specificity",best.method="youden",best.policy="random")
#95% CI (2000 stratified bootstrap replicates):
# threshold specificity.low specificity.median specificity.high
#      best          0.9121              0.967                1
```

## 8) Homework

### 8.1)

我们提供了一个qPCR数据集[qPCR\_data.csv](https://cloud.tsinghua.edu.cn/d/ad22768345664924b202/?p=%2FFiles%2FPART_IV\&mode=list)，第1列为sample id，第2-12列为特征(11个基因的表达量)，第13列为样本标签(负例为健康人:NC,正例为肝癌病人:HCC)。请大家用R语言完成以下任务:

* 数据预处理
  * 用均值或中位数补全空缺值
  * 对数据进行scaling
* 数据可视化
  * 用PCA对数据进行可视化
* 数据集划分：预留20%数据用于评估模型泛化能力，剩下的用于模型拟合
* 模型选择和模型拟合
  * 任选一种分类器即可
  * 特征选择: 用RFE或其他方式均可，特征数量不限
  * 调参: 根据所选的分类器对相应的超参数进行搜索
* 模型评估
  * 计算预留数据集上的AUROC
  * 绘制ROC曲线
* 请提交代码，必要的文字解释和ROC曲线

### 8.2)

请大家查阅资料，回答以下两个关于随机森林的问题：

* 随机森林中树的数量是不是一个需要通过交叉验证调整的超参数?为什么?
* 请问什么是随机森林的out-of-bag (OOB) error?它和bootstrapping有什么关系?

## 9) More reading

* [An Introduction to Machine Learning with R](https://lgatto.github.io/IntroMachineLearningWithR/)
* [caret documentation](http://topepo.github.io/caret)
* 其他机器学习模型相关[代码](https://github.com/urluzhi/scripts/tree/master/Rscript/machine_learning):
  * `randomForest.R`: Random Forest
  * `logistic_regression.R`: Logistic Regression
  * `svm.R`: SVM
  * `plot_result.R`: Plot your training and testing performance
* more R packages for machine learning
  * [e1071](http://cran.r-project.org/web/packages/e1071/)Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc *(142479 downloads)*
  * [rpart](http://cran.r-project.org/web/packages/rpart/) Recursive Partitioning and Regression Trees. (135390)
  * [igraph](http://cran.r-project.org/web/packages/igraph/) A collection of network analysis tools. (122930)
  * [nnet](http://cran.r-project.org/web/packages/nnet/) Feed-forward Neural Networks and Multinomial Log-Linear Models. (108298)
  * [randomForest](http://cran.r-project.org/web/packages/randomForest/) Breiman and Cutler's random forests for classification and regression. (105375)
  * [caret](http://cran.r-project.org/web/packages/caret/) package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. (87151)
  * [kernlab](http://cran.r-project.org/web/packages/kernlab/) Kernel-based Machine Learning Lab. (62064)
  * [glmnet](http://cran.r-project.org/web/packages/glmnet/) Lasso and elastic-net regularized generalized linear models. (56948)
  * [ROCR](http://cran.r-project.org/web/packages/ROCR/) Visualizing the performance of scoring classifiers. (51323)
  * [gbm](http://cran.r-project.org/web/packages/gbm/) Generalized Boosted Regression Models. (44760)
  * [party](http://cran.r-project.org/web/packages/party/) A Laboratory for Recursive Partitioning. (43290)
  * [arules](http://cran.r-project.org/web/packages/arules/) Mining Association Rules and Frequent Itemsets. (39654)
  * [tree](http://cran.r-project.org/web/packages/tree/) Classification and regression trees. (27882)
  * [klaR](http://cran.r-project.org/web/packages/klaR/) Classification and visualization. (27828)
  * [RWeka](http://cran.r-project.org/web/packages/RWeka/) R/Weka interface. (26973)
  * [ipred](http://cran.r-project.org/web/packages/ipred/) Improved Predictors. (22358)
  * [lars](http://cran.r-project.org/web/packages/lars/) Least Angle Regression, Lasso and Forward Stagewise. (19691)
  * [earth](http://cran.r-project.org/web/packages/earth/) Multivariate Adaptive Regression Spline Models. (15901)
  * [CORElearn](http://cran.r-project.org/web/packages/CORElearn/) Classification, regression, feature evaluation and ordinal evaluation. (13856)
  * [mboost](http://cran.r-project.org/web/packages/mboost/) Model-Based Boosting. (13078)
