包含read数量信息的文件类似如下,对于column1 row2的13178信息,它表示Sample_N1经过preprocess中的cutadpat操作后留下的reads数目;对于column3的各个属性的含义,libSizeN表示rawdata的reads数目,rRNA_N表示map到rRNA index上的reads的数目,keptN表示去除rRNA后fastq文件剩余的reads的数目,sequentialMap的< some type of RNA >表示map到< some type of RNA > index上的reads的数目,hg38other表示map到hg38的reads的数目,nonHuman_N表示未map的reads的数目:
Reads are trimmed using a proprietary version of cutAdapt. We map to transcriptome for a better sensitivity (see details in protocol and example).
outputs
File format
Information contained in file
File description
Notes
tsv
gene (ncRNA) quantifications
Non-normalized counts.
2b) Running Scripts
2b.1) Software/Tools
FeatureCounts
2b.2) FeatureCounts
对Mapping步骤得到的不同样本不同RNA类型的<sample>.<some type of RNA>.rsem.clean.bam文件,进行Raw Counts的统计(无需统计hg38other),结果可输出到.../04.counts/<sample>/<sample>.<some type of RNA>.featureCounts.counts
rnanames = ['miRNA', 'piRNA', 'Y_RNA', 'snRNA','srpRNA','tRNA',
'lncRNA','mRNA','other_genomic_region','non_human',
]
rna_ratio = pd.read_table(file1)
x = rnanames
colours = tableau20[:len(x)]/255.
y = rna_ratio.mean()
z = np.array([float('{:.4f}'.format(y[i])) for i in range(y.shape[0])])*100
fig1, ax1 = plt.subplots(figsize=(10,10))
patches, texts = ax1.pie(y, colors=colours, #autopct='%1.1f%%',
shadow=True, startangle=90)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(x, z)]
sort_legend = True
if sort_legend:
patches, labels, dummy = zip(*sorted(zip(patches, labels, y),
key=lambda x: x[2],
reverse=True))
plt.legend(patches, labels, loc='center', bbox_to_anchor=(1.1, .7),
fontsize=8)
def plot_pie(data, rnanames):
'''
data: table_ratio
rnanames: rna type names
adjustment: merge RNA with small percent together
'''
from bokeh.io import output_file, show
from bokeh.palettes import Category20c
from bokeh.plotting import figure
from bokeh.transform import cumsum
x = np.array(rnanames)
y = np.array(data.loc[:,x].mean())
z = np.array([float('{:.10f}'.format(y[i])) for i in range(y.shape[0])])*100
labels = rnanames
dataframe = pd.DataFrame(np.concatenate((x.reshape(-1,1),z.reshape(-1,1)),axis=1))
dataframe.columns=['rna','percent']
dataframe["percent"] = pd.to_numeric(dataframe["percent"])
dataframe['angle'] = dataframe['percent']/dataframe['percent'].sum() * 2*pi
dataframe['color'] = Category20c[len(x)]
p = figure(plot_height=400, title="Pie Chart", toolbar_location=None,
tools="hover", tooltips="@rna: @percent", x_range=(-0.5, 1.0))
p.wedge(x=0, y=1, radius=0.4,
start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
line_color="white", fill_color='color', legend="rna", source=dataframe)
p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None
show(p)
lengthdata = pd.read_table(file2)
length = np.array(lengthdata.T)
fig,ax=plt.subplots(10,1,figsize=(20,50))
for i in range(length.shape[0]):
ax[i].plot(length[i],label=rnanames[i],color=colours[i])
ax[i].legend(loc='upper right')
ax[i].set_xticklabels(np.arange(lengthdata.index[0]-5,lengthdata.index[-1],5))
fig, ax = plt.subplots(figsize=(100,20))
sns.boxplot(data = rna_ratio,ax=ax,boxprops=dict(alpha=.5))
ax.set_title(u'RNA percentage in different samples',fontsize=80)
ax.set_xticks(range(10))
ax.set_xticklabels(rnanames,fontsize=40)
ax.set_yticks(np.arange(0,1,0.1))
ax.set_yticklabels(['{:.1f}%'.format(i*10) for i in range(10)],fontsize=40)
example_sce <- SingleCellExperiment(
assays = list(counts = as.matrix(scirepcounts)),
colData = samples_scirep
)
keep_gene <- rowSums(counts(example_sce)) > 0
example_sce <- example_sce[keep_gene,]
## Apply TMM normalisation taking into account all genes
example_sce <- normaliseExprs(example_sce, method = "TMM")
## normalize the object using the saved size factors
example_sce <- normalize(example_sce)
mat_tmm <- normcounts(example_sce)
## Apply RLE normalisation taking into account all genes
example_sce <- normaliseExprs(example_sce, method = "RLE")
## normalize the object using the saved size factors
example_sce <- normalize(example_sce)
mat_rle <- normcounts(example_sce)
library(EDASeq)
library(RUVSeq)
library(sva)
library(scRNA.seq.funcs)
##sample info ranked by mat
if(unique(is.na(sample_info$sample_id)))
stop("sample_id not in file")
rownames(sample_info) = sample_info$sample_id
sample_info=sample_info[names(mat),]
rownames(sample_info) <- c()
scIdx <- matrix(-1, ncol = max(table(sample_info$label)), nrow = 2)
tmp <- which(sample_info$label == "Colorectal Cancer")
scIdx[1, 1:length(tmp)] <- tmp
tmp <- which(sample_info$label == "Healthy Control")
scIdx[2, 1:length(tmp)] <- tmp
cIdx <- rownames(mat)
mat <- log(mat+0.001)
ruvs <- RUVs(as.matrix(mat), cIdx, k = 10, scIdx = scIdx, isLog = TRUE)
exp(ruv$normalizedCounts)
batch_info <-read.table(batchinfo_path,sep='\t',row.names=1,header=T,check.names = FALSE)
batchname <-toString(names(batch_info)[batch_column])
batch_info=batch_info[names(mat),]
mod <- model.matrix(~ 1, data = batch_info)
if (!dim(mat)[2]==dim(batch_info)[1])
stop('sample numbers in batch info and expression matrix should be same')
combat <- ComBat(
dat = log(mat+0.001),
batch = factor(batch_info[,batch_column]),
mod = mod,
par.prior = TRUE,
prior.plots = FALSE
)
mat <- exp(combat)
random_state = np.random.RandomState(1289237)
x = random_state.normal(10, 2, size=1000)
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler
scalers = {
'Standard': StandardScaler(),
'MinMax': MinMaxScaler(),
'MaxAbs': MaxAbsScaler(),
'Robust': RobustScaler()
}
scalernames = ['Standard','MinMax','MaxAbs','Robust']
fig, axes = plt.subplots(1,4, figsize=(20, 4))
for i in range(4):
x_scaled = scalers[scalernames[i]].fit_transform(x.reshape(-1,1)).ravel()
sns.distplot(x_scaled, ax=axes[i])
axes[i].set_title(scalernames[i])