Feature selection
Here, we show how to use SuperSCC to find highly variable genes for the overall dataset or markers of each cluster/cell type for scRNAseq data.
[24]:
import SuperSCC as scc
import pandas as pd
import scanpy as sc
import numpy as np
[6]:
data = pd.read_csv('/mnt/disk5/zhongmin/superscc/师兄整理的肺数据/未去批次效应couns数据/没有去除批次效应_Banovich_Kropski_2020数据.csv', index_col=0)
cell_type = pd.read_csv('/home/fengtang/jupyter_notebooks/working_script/evulate_clustering/cell_type_info/finest/Banovich_Kropski_2020_finest_celltype.csv', index_col = 0)
[ ]:
adata = sc.AnnData(data.select_dtypes("number"))
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
data_norm = pd.DataFrame(adata.X)
data_norm.columns = adata.var_names
data_norm.index = adata.obs_names
Since SuperSCC’ feature selection is an supervised process, it neeeds clustering or cell type labels. So we need to do add this information into the normalized counts matrix:
[ ]:
data_norm.loc[:, "cell_type"] = cell_type.cell_type.values
[ ]:
# find highly variable genes for the overall dataset
my_logger = scc.log_file("logger", "a") # create a logging object
hvgs = scc.feature_selection.find_signature_genes(data_norm.copy(),
label_column = "cell_type",
save = True,
logger = my_logger,
variance_threshold = "mean",
n_features_to_select = 0.15,
model = "svm",
mutual_info = False,
F_test = True,
normalization_method = "Min-Max"
)
# alternatively
# hvgs = scc.feature_selection.feature_selection(data_norm.copy(),
# label_column = "cell_type",
# model = "svm",
# normalization_method = "Min-Max",
# save = True,
# logger = my_logger)
[13]:
hvgs.keys()
[13]:
dict_keys(['retained_features_ranking_by_variance', 'retained_features_ranking_by_correlation', 'retained_features_ranking_by_embedding', 'retained_features_ranking_by_wrapping', 'final_feature_selection_by_ensemble', 'model_accuracy', 'params_used_for_feature_selection', 'retained_features_by_filtering', 'retained_features_by_embedding', 'retained_features_by_wrapping', 'final_feature_selection_by_intersection'])
The output includes:
retained_features_ranking_by_variance: Feature rankings ordered by variance.retained_features_ranking_by_correlation: Feature rankings ordered by correlation-test p-value.retained_features_ranking_by_embedding: Feature rankings ordered by induction-model-derived feature importances.retained_features_ranking_by_wrapping: Feature rankings ordered by RFE-induction-model-derived feature importances.retained_features_by_filtering: Feature subset from variance and correlation filter.retained_features_by_embedding: Feature subset from induction-model.retained_features_by_wrapping: Feature subset from RFE-induction-model.model_accuracy: Accuracy scores for the training model.params_used_for_feature_selection: Settings used for running thefeature_selectionfunction.final_feature_selection_by_intersection: Ideal feature subset from aggregating different feature subsets.final_feature_selection_by_ensemble: Ideal feature subset from aggregating differentfeature rankings.label_classes: The unique label class of cell type in the data.
[ ]:
# find markers of each cluster/cell type
markers = scc.feature_selection.find_markers_ovr(data_norm.copy(),
label_column = "cell_type",
model = "svm",
normalization_method = "Min-Max",
save = True,
logger = my_logger,
variance_threshold = "mean",
n_features_to_select = 0.15,
mutual_info = False,
F_test = True,
normalization_method = "Min-Max"
)
[ ]:
markers.keys(), markers["0"].keys()
(dict_keys(['3', '0', '4', '1', '2', '5']),
dict_keys(['features', 'sub_high_expression_genes']))
For each cluster/cell type, besides the output mentioned above stored in the features key, it also includes the feature expression intensity and expression ratio info stored in sub_high_expression_genes key.
[ ]:
markers["0"]["sub_high_expression_genes"]["0"]
| feature | expression1 | pct1 | rank1 | expression2 | pct2 | rank2 | score | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000100246 | ENSG00000100246 | 0.089330 | 0.100143 | 1 | NaN | NaN | NaN | NaN |
| ENSG00000051596 | ENSG00000051596 | 0.090060 | 0.102182 | 2 | NaN | NaN | NaN | 804.086531 |
| ENSG00000139579 | ENSG00000139579 | 0.090104 | 0.101774 | 3 | NaN | NaN | NaN | NaN |
| ENSG00000006757 | ENSG00000006757 | 0.090352 | 0.100551 | 4 | NaN | NaN | NaN | NaN |
| ENSG00000117505 | ENSG00000117505 | 0.090451 | 0.101163 | 5 | 0.162982 | 0.140160 | 1176.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ENSG00000251562 | ENSG00000251562 | 4.220352 | 0.969610 | 4638 | 4.312087 | 0.978009 | 4321.0 | 1246.790112 |
| ENSG00000198804 | ENSG00000198804 | 4.278880 | 0.986131 | 4639 | 4.397296 | 0.986289 | 4323.0 | 1110.438579 |
| ENSG00000198712 | ENSG00000198712 | 4.430985 | 0.986947 | 4640 | 4.689565 | 0.990395 | 4326.0 | 1562.438389 |
| ENSG00000168878 | ENSG00000168878 | 4.643004 | 0.944728 | 4641 | 0.329463 | 0.174074 | 2878.0 | 1692.315966 |
| ENSG00000168484 | ENSG00000168484 | 5.248846 | 0.863553 | 4642 | 0.949610 | 0.454262 | 3990.0 | 1672.888264 |
4642 rows × 8 columns
expression1, pct1 and rank1 represent the expression intensity, expression ratio and rank in the interst group, while expression2, pct2 and rank2 represent the expression intensity, expression ratio and rank in other groups except interst group. Such info can be used to remove negative markers (good for distinguishing interst group from other groups but have low expression in the interest group) out of informative features stored in features key. You can do:
[31]:
# get the expression intensity and ratio info
df = markers["0"]["sub_high_expression_genes"]["0"]
# use feature subset from aggregating feature rankings
important_features = [i[0] for i in markers["0"]["features"]["final_feature_selection_by_ensemble"]]
# remove little informative features
df = df.loc[df.feature.isin(important_features)]
# calculate log2 fold change
df.loc[:, "log2_fold_change"] = np.log2(df.expression1 / df.expression2)
# only keep features with satified expression intensity and ratio in the interest group
df = df.loc[((df.expression1 > 1) & (1 - (df.pct2/df.pct1) > 0.5)) | (np.isnan(df.pct2)) ].sort_values("score", ascending = False)
df.head(5)
/tmp/ipykernel_3943216/2645709130.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.loc[:, "log2_fold_change"] = np.log2(df.expression1 / df.expression2)
[31]:
| feature | expression1 | pct1 | rank1 | expression2 | pct2 | rank2 | score | log2_fold_change | |
|---|---|---|---|---|---|---|---|---|---|
| ENSG00000124107 | ENSG00000124107 | 4.134322 | 0.908016 | 4634 | 0.512570 | 0.210174 | 3497.0 | 1692.613278 | 3.011830 |
| ENSG00000168878 | ENSG00000168878 | 4.643004 | 0.944728 | 4641 | 0.329463 | 0.174074 | 2878.0 | 1692.315966 | 3.816870 |
| ENSG00000161055 | ENSG00000161055 | 1.940696 | 0.587803 | 4514 | 0.458728 | 0.228191 | 3355.0 | 1691.958975 | 2.080864 |
| ENSG00000131400 | ENSG00000131400 | 3.110635 | 0.843769 | 4602 | NaN | NaN | NaN | 1691.304883 | NaN |
| ENSG00000166347 | ENSG00000166347 | 2.939046 | 0.944116 | 4586 | 0.717823 | 0.402265 | 3822.0 | 1690.470148 | 2.033649 |