0) 编程环境

  • cellrangerigblast安装在路径/data/2023-bioinfo-shared/software
  • 包含一些单细胞分析常用的python和R package的conda环境,可以在本地通过以下命令创建。
conda env create -f ITP.yml

1) 参考基因组

  • cellranger比对的参考基因路径
# GEX Reference
# VDJ Reference
  • Igblast database路径

2) 构建count matrix和BCR contig组装

  • 对于10x基因表达数据可以使用cellranger count对fastq进行比对
cellranger count --id={sample_id}-5LIB \
--transcriptome={ref} \
--fastqs={input.fastq} \
--sample={sample_id}-5LIB \
--localcores=20 \
  • 对于10x V(D)J数据可以使用cellranger vdj进行BCR contig的组装
cellranger vdj --id={sample_id}-BCR \
--reference={ref} \
--fastqs={fastq} \
--sample={sample_id}-BCR \
--localcores=10 \

3) GEX分析

  • R包Seurat使用教程 使用R语言Seurat包进行分析的同学可以参考Seurat官网提供的PBMC3K数据分析标准流程代码,该教程没有提供批次校正的代码,可以查阅官网中其他教程或是其他论坛的代码进行学习:
  • 批次效应校正 批次校正的方法有许多并且对于不同的样本、数据可能会有不同的选择,常见的有harmony、CCA等,在一次分析中往往选择一种方法使用,但可能需要对比不同方法对后续分析产生的影响,感兴趣的同学可以查阅有关批次效应校准方法的对比的文献:
Tran H T N, Ang K S, Chevrier M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data[J]. Genome biology, 2020, 21: 1-32.
  • python相关库导入
import logging, matplotlib, os, sys
import copy
import scanpy as sc
import numpy as np
import scipy as sp
import pandas as pd
import scrublet as scr
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context
from anndata import AnnData
from matplotlib import rcParams
from matplotlib import colors
import seaborn as sb
import warnings
# from rpy2.robjects.packages import importr
%matplotlib inline
# %matplotlib notebook
# configure file
sc.settings.autoshow = False
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=150, dpi_save=300, format='png', frameon=False, transparent=True, fontsize=10)
plt.rcParams["image.aspect"] = "equal"
plt.rcParams["figure.figsize"] = ([3,3])
warnings.simplefilter(action='ignore', category=FutureWarning)
  • 导入cellranger结果文件到scanpy
sam1 = sc.read_10x_mtx("./ITP_1/outs/raw_feature_bc_matrix"); sam1.obs['id'] = "ITP_1"; sam1.obs['disease'] = "ITP";
sam2 = sc.read_10x_mtx("./ITP_2/outs/raw_feature_bc_matrix"); sam2.obs['id'] = "ITP_2"; sam2.obs['disease'] = "ITP";
samples = [sam1,sam2]
samplescopy = copy.deepcopy(samples)
adata = sam1.concatenate(samples[1:])
  • Quality control
scrub = scr.Scrublet(adata.X)
adata.obs['doublet_scores'], adata.obs['predicted_doublets'] = scrub.scrub_doublets()
adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True), ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],jitter=0.4, multi_panel=True,show=True, save="n_genes_by_counts-total_counts-pct_counts_mt.png")
# Base filtering for trivial QC failures:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, min_counts=400)
sc.pp.filter_cells(adata, max_counts=20000)
adata = adata[adata.obs.pct_counts_mt < 10, :]
sc.pp.filter_genes(adata, min_cells=3) # Only filter genes here;
adata = adata[adata.obs['predicted_doublets']==False,:]
print('Number of cells after gene filter: {:d}'.format(adata.n_obs))
  • Normalization
adata.layers["counts"] = adata.X.copy() ## preserve counts
scales_counts = sc.pp.normalize_total(adata, target_sum=None, inplace=False)
# log1p transform
adata.layers["log1p_norm"] = sc.pp.log1p(scales_counts["X"], copy=True)
  • Feature Selection and Dimensionality Reduction
adata.X = adata.layers["log1p_norm"]
adata.var["highly_variable"] = adata.var["highly_deviant"]
sc.pp.pca(adata, svd_solver="arpack", use_highly_variable=True), n_pcs=30, log=True, show=True, save='pca_variance.pdf')
  • Clustering, resolution=1.5, key_added='leiden_r15')
  • Major cell type annotation, color=['CD34','CD3D','KLRF1','FCN1','HBD','CD79A'], show=True, size=10,
edgecolor="none", save="_major_celltype.png",ncols=3, legend_loc=None, wspace=0.2, color_map='Reds')

4) BCR分析

  • Analysis using Change-O toolkit
  • Converting 10X V(D)J data into the AIRR Community standardized format igblast -s filtered_contig.fasta -b igblast_1.19 \
--organism human --loci ig --format blast
#The -b argument specifies the path containing the database, internal_data, and optional_file directories required by IgBLAST.
# The output is "filtered_contig_igblast.fmt7". igblast -i filtered_contig_igblast.fmt7 -s filtered_contig.fasta -r \
imgt_human_*.fasta \
--10x filtered_contig_annotations.csv --extended
# The output is "filtered_contig_igblast_db-pass.tsv", which overwrites the V, D and J gene assignments generated by Cell Ranger and uses those generated by IgBLAST instead.
# Standalone IgBLAST blast-style tabular output is parsed by the igblast subcommand of to generate the standardized tab-delimited database file on which all subsequent Change-O modules operate.
# The optional --extended argument adds extra columns to the output database containing IMGT-gapped CDR/FWR regions and alignment metrics.
# -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
  • Identifying clones from B cells in AIRR formatted 10X V(D)J data
  1. 1.
    Splitting into separate light and heavy chain files. To group B cells into clones from AIRR Rearrangement data, the output from MakeDb must be parsed into a light chain file and a heavy chain file: select -d filtered_contig_igblast_db-pass.tsv -f locus -u "IGH" \
--logic all --regex --outname temp select -d temp_parse-select.tsv -f productive -u T --outname temp2 select -d temp2_parse-select.tsv -f v_call j_call c_call -u "IGH" \
--logic all --regex --outname heavy select -d filtered_contig_igblast_db-pass.tsv -f locus -u "IG[LK]" \
--logic all --regex --outname temp select -d temp_parse-select.tsv -f productive -u T --outname temp2 select -d temp2_parse-select.tsv -f v_call j_call c_call -u "IG[LK]" \
--logic all --regex --outname light
# the outputs are "heavy_parse-select.tsv" and "light_parse-select.tsv". Non-productive sequences were removed.
#records with disagreements between the C-region primers and the reference alignment were removed too
#(vjc are both IGH or IGL/LGK).
  1. 2.
    Calculating nearest neighbor distances based on heavy chains
data(ExampleDb, package="alakazam")
heavy_parse <- read.table("./heavy_parse-select.tsv",sep = "\t",header = T,stringsAsFactors = F)
dist_ham <- distToNearest(heavy_parse, sequenceColumn="junction",
vCallColumn="v_call", jCallColumn="j_call",
model="ham", normalize="len", nproc=1)
output_ham <- findThreshold(dist_ham$dist_nearest, method="density")
dist_s5f <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call", jCallColumn="j_call",
model="hh_s5f", normalize="none", nproc=1)
output_s5f <- findThreshold(dist_s5f$dist_nearest, method="density")
output_ham@threshold;output_s5f@threshold# show the threshold
  1. 3.
    Clonal assignment using shazam -d heavy_parse-select.tsv --act set --model ham \
--norm len --dist 0.16
# or use other models: -d heavy_parse-select.tsv --act set --model hh_s5f --norm none --dist **
# output is "heavy_parse-select_clone-pass.tsv"
#Correct clonal groups based on light chain data: -d heavy_parse-select_clone-pass.tsv -e light_parse-select.tsv \
-o 10X_clone-pass.tsv
#The algorithm will (1) remove cells associated with more than one heavy chain and
#(2) correct heavy chain clone definitions based on an analysis of the light chain partners
#associated with the heavy chain clone.
  • Reconstructing germline sequences -d 10X_clone-pass.tsv -g dmask --cloned\
-r $SCRATCH/projects/B/immcantation/germlines/imgt/human/vdj/imgt_human_*.fasta
# The output is "10X_clone-pass_germ-pass.tsv".
#this will generate a single germline of consensus length for each clone
  • Define SHM (mutation analysis)
  • Clonal abundance and diversity