Label transfer
Here, we show how SuperSCC implements marker-genes-based label transfer.
[1]:
import SuperSCC as scc
import pandas as pd
import scanpy as sc
import os
from sklearn.metrics import confusion_matrix, cohen_kappa_score, matthews_corrcoef, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
[37]:
# read reference data
data = pd.read_csv('/mnt/disk5/zhongmin/superscc/师兄整理的肺数据/未去批次效应couns数据/没有去除批次效应_Banovich_Kropski_2020数据.csv', index_col=0)
cell_type = pd.read_csv('/home/fengtang/jupyter_notebooks/working_script/evulate_clustering/cell_type_info/finest/Banovich_Kropski_2020_finest_celltype.csv', index_col = 0)
[ ]:
# split train and test data
Xtrain, Xtest, Ytrain, Ytest = train_test_split(data, cell_type, test_size= 0.3)
[43]:
# do log-normalization for training and testing data
Xtrain = sc.AnnData(Xtrain.select_dtypes("number"))
sc.pp.normalize_total(Xtrain, target_sum = 1e4)
sc.pp.log1p(Xtrain)
Xtrain_norm = pd.DataFrame(Xtrain.X)
Xtrain_norm.columns = Xtrain.var_names
Xtrain_norm.index = Xtrain.obs_names
Xtrain_norm.loc[:, "cell_type"] = Ytrain.cell_type.values
Xtest = sc.AnnData(Xtest.select_dtypes("number"))
sc.pp.normalize_total(Xtest, target_sum = 1e4)
sc.pp.log1p(Xtest)
Xtest_norm = pd.DataFrame(Xtest.X)
Xtest_norm.columns = Xtest.var_names
Xtest_norm.index = Xtest.obs_names
Xtest_norm.loc[:, "cell_type"] = Ytest.cell_type.values
[ ]:
# find informative features in training data
my_logger = scc.log_file("logger", "a")
info_features = scc.feature_selection.feature_selection(Xtrain_norm.copy(), label_column = "cell_type", model = "svm", normalization_method = "Min-Max", save = True, logger = my_logger)
info_features = [i[0] for i in info_features["final_feature_selection_by_ensemble"]] # use ensemble-selection features
[ ]:
# model training on training data
model = scc.label_transfer.model_training(Xtrain_norm.copy(), label_column = "cell_type", features = info_features, model = "svm", normalization_method = "Min-Max", save = True, logger = my_logger)
2025-02-07 15:41:12 start model training
2025-02-07 15:41:12 model traning based on svm algorithm
2025-02-07 15:41:13 doing Min-Max normalization
2025-02-07 15:41:13 doing label encoding
2025-02-07 15:41:13 grid search below paramters getting the best model
* C: [0.01 0.12 0.23 0.34 0.45 0.56 0.67 0.78 0.89 1. ]
* kernel: ['rbf', 'poly', 'sigmoid', 'linear']
/home/fengtang/anaconda3/envs/SuperSCC/lib/python3.11/site-packages/sklearn/model_selection/_split.py:700: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
warnings.warn(
2025-02-07 15:55:10 finish model training
[ ]:
# do label transfer
pred = scc.label_transfer.predict_label(Xtest_norm, models = ".+training_model.+pkl$", wk_dir = os.getcwd(), save=True, logger = my_logger)
2025-02-07 16:06:59 start label prediction based on svm_training_model_2025-02-07 15:55:10.pkl model
2025-02-07 16:07:28 finish label prediction based on svm_training_model_2025-02-07 15:55:10.pkl
[55]:
pred["svm_training_model_2025-02-07 15:55:10.pkl"]["prediction"][0:5] # glance the predicted labels
[55]:
['Alveolar macrophages',
'EC venous pulmonary',
'Monocyte-derived Mph',
'Non-classical monocytes',
'AT2']
[56]:
# compare predicted labels with ground truth labels
confusion_matrix(Ytest, pred["svm_training_model_2025-02-07 15:55:10.pkl"]["prediction"])
[56]:
array([[ 19, 1, 7, ..., 0, 0, 4],
[ 1, 195, 5, ..., 0, 0, 0],
[ 4, 7, 1061, ..., 0, 0, 0],
...,
[ 0, 0, 0, ..., 7, 0, 0],
[ 0, 0, 0, ..., 0, 11, 0],
[ 3, 0, 1, ..., 2, 0, 109]])
[66]:
# evaulate the prediction
{
"accuracy_score": accuracy_score(Ytest, pred["svm_training_model_2025-02-07 15:55:10.pkl"]["prediction"]),
"f1_score": f1_score(Ytest, pred["svm_training_model_2025-02-07 15:55:10.pkl"]["prediction"], average= "weighted"),
"cohen_kappa_score": cohen_kappa_score(Ytest, pred["svm_training_model_2025-02-07 15:55:10.pkl"]["prediction"]),
"matthews_corrcoef": matthews_corrcoef(Ytest, pred["svm_training_model_2025-02-07 15:55:10.pkl"]["prediction"])
}
[66]:
{'accuracy_score': 0.9083333333333333,
'f1_score': 0.9055509253043628,
'cohen_kappa_score': 0.9001176613948768,
'matthews_corrcoef': 0.9001806199319379}
In default, SuperSCC-based label transfer will transfer every labels from reference to query. However, the accuracy and reliability of the label transfer process are heavily dependent on the quality and comprehensiveness of the reference data. When the reference lacks certain cell types present in the query data, label transfer may assign incorrect labels to those cells. To migitate this limitations, user can only keep high-confidence transferrable labels and retain those low-confidence labels as ‘uncentrain’, indicating the possiblility of the unique cell type/state in the query. For this, you can do as below:
[ ]:
# model training on training data
model = scc.label_transfer.model_training(Xtrain_norm.copy(),
label_column = "cell_type",
features = info_features,
model = "svm",
normalization_method = "Min-Max",
save = True,
logger = my_logger,
probability = True) # set probabaility in True to return transferring accuracy per cell per reference label
# do label transfer
pred = scc.label_transfer.predict_label(Xtest_norm,
models = ".+training_model.+pkl$",
pred_confidence_cutoff = 0.7,
wk_dir = os.getcwd(), # set pred_confidence_cutoff in 0.7 to only keep reference labels with above 0.7 transferring accuracy in the query
save=True,
logger = my_logger)
When reference’s features lost in the query, those features will be padded with zero in default to ensure consistent dimension between reference and query. Alternatively, MAGIC-based imputation could be activated when magic_based_imputation argument set in True.
[ ]:
# do label transfer with magic-based imputation
pred = scc.label_transfer.predict_label(Xtest_norm,
models = ".+training_model.+pkl$",
magic_based_imputation = True, # set in True to activate magic-based imputation
save=True,
wk_dir = os.getcwd(),
logger = my_logger)