Feature selection

Here, we show how to use SuperSCC to find highly variable genes for the overall dataset or markers of each cluster/cell type for scRNAseq data.

[24]:

import SuperSCC as scc
import pandas as pd
import scanpy as sc
import numpy as np

[6]:

data = pd.read_csv('/mnt/disk5/zhongmin/superscc/师兄整理的肺数据/未去批次效应couns数据/没有去除批次效应_Banovich_Kropski_2020数据.csv', index_col=0)
cell_type = pd.read_csv('/home/fengtang/jupyter_notebooks/working_script/evulate_clustering/cell_type_info/finest/Banovich_Kropski_2020_finest_celltype.csv', index_col = 0)

[ ]:

adata = sc.AnnData(data.select_dtypes("number"))
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)

data_norm = pd.DataFrame(adata.X)
data_norm.columns = adata.var_names
data_norm.index = adata.obs_names

Since SuperSCC’ feature selection is an supervised process, it neeeds clustering or cell type labels. So we need to do add this information into the normalized counts matrix:

[ ]:

data_norm.loc[:, "cell_type"] = cell_type.cell_type.values

[ ]:

# find highly variable genes for the overall dataset

my_logger = scc.log_file("logger", "a") # create a logging object

hvgs = scc.feature_selection.find_signature_genes(data_norm.copy(),
                                label_column = "cell_type",
                                save = True,
                                logger = my_logger,
                                variance_threshold = "mean",
                                n_features_to_select = 0.15,
                                model = "svm",
                                mutual_info = False,
                                F_test = True,
                                normalization_method = "Min-Max"
                                )

# alternatively
# hvgs = scc.feature_selection.feature_selection(data_norm.copy(),
#                              label_column = "cell_type",
#                              model = "svm",
#                              normalization_method = "Min-Max",
#                              save = True,
#                              logger = my_logger)

[13]:

hvgs.keys()

[13]:

dict_keys(['retained_features_ranking_by_variance', 'retained_features_ranking_by_correlation', 'retained_features_ranking_by_embedding', 'retained_features_ranking_by_wrapping', 'final_feature_selection_by_ensemble', 'model_accuracy', 'params_used_for_feature_selection', 'retained_features_by_filtering', 'retained_features_by_embedding', 'retained_features_by_wrapping', 'final_feature_selection_by_intersection'])

The output includes:

retained_features_ranking_by_variance: Feature rankings ordered by variance.
retained_features_ranking_by_correlation: Feature rankings ordered by correlation-test p-value.
retained_features_ranking_by_embedding: Feature rankings ordered by induction-model-derived feature importances.
retained_features_ranking_by_wrapping: Feature rankings ordered by RFE-induction-model-derived feature importances.
retained_features_by_filtering: Feature subset from variance and correlation filter.
retained_features_by_embedding: Feature subset from induction-model.
retained_features_by_wrapping: Feature subset from RFE-induction-model.
model_accuracy: Accuracy scores for the training model.
params_used_for_feature_selection: Settings used for running the feature_selection function.
final_feature_selection_by_intersection: Ideal feature subset from aggregating different feature subsets.
final_feature_selection_by_ensemble: Ideal feature subset from aggregating differentfeature rankings.
label_classes: The unique label class of cell type in the data.

[ ]:

# find markers of each cluster/cell type
markers = scc.feature_selection.find_markers_ovr(data_norm.copy(),
                               label_column = "cell_type",
                               model = "svm",
                               normalization_method = "Min-Max",
                               save = True,
                               logger = my_logger,
                               variance_threshold = "mean",
                               n_features_to_select = 0.15,
                               mutual_info = False,
                               F_test = True,
                               normalization_method = "Min-Max"
                               )

[ ]:

markers.keys(), markers["0"].keys()

(dict_keys(['3', '0', '4', '1', '2', '5']),
 dict_keys(['features', 'sub_high_expression_genes']))

For each cluster/cell type, besides the output mentioned above stored in the features key, it also includes the feature expression intensity and expression ratio info stored in sub_high_expression_genes key.

[ ]:

markers["0"]["sub_high_expression_genes"]["0"]

	feature	expression1	pct1	rank1	expression2	pct2	rank2	score
ENSG00000100246	ENSG00000100246	0.089330	0.100143	1	NaN	NaN	NaN	NaN
ENSG00000051596	ENSG00000051596	0.090060	0.102182	2	NaN	NaN	NaN	804.086531
ENSG00000139579	ENSG00000139579	0.090104	0.101774	3	NaN	NaN	NaN	NaN
ENSG00000006757	ENSG00000006757	0.090352	0.100551	4	NaN	NaN	NaN	NaN
ENSG00000117505	ENSG00000117505	0.090451	0.101163	5	0.162982	0.140160	1176.0	NaN
...	...	...	...	...	...	...	...	...
ENSG00000251562	ENSG00000251562	4.220352	0.969610	4638	4.312087	0.978009	4321.0	1246.790112
ENSG00000198804	ENSG00000198804	4.278880	0.986131	4639	4.397296	0.986289	4323.0	1110.438579
ENSG00000198712	ENSG00000198712	4.430985	0.986947	4640	4.689565	0.990395	4326.0	1562.438389
ENSG00000168878	ENSG00000168878	4.643004	0.944728	4641	0.329463	0.174074	2878.0	1692.315966
ENSG00000168484	ENSG00000168484	5.248846	0.863553	4642	0.949610	0.454262	3990.0	1672.888264

4642 rows × 8 columns

expression1, pct1 and rank1 represent the expression intensity, expression ratio and rank in the interst group, while expression2, pct2 and rank2 represent the expression intensity, expression ratio and rank in other groups except interst group. Such info can be used to remove negative markers (good for distinguishing interst group from other groups but have low expression in the interest group) out of informative features stored in features key. You can do:

[31]:

# get the expression intensity and ratio info
df = markers["0"]["sub_high_expression_genes"]["0"]

# use feature subset from aggregating feature rankings
important_features = [i[0] for i in markers["0"]["features"]["final_feature_selection_by_ensemble"]]

# remove little informative features
df = df.loc[df.feature.isin(important_features)]

# calculate log2 fold change
df.loc[:, "log2_fold_change"] = np.log2(df.expression1 / df.expression2)

# only keep features with satified expression intensity and ratio in the interest group
df = df.loc[((df.expression1 > 1) & (1 - (df.pct2/df.pct1) > 0.5)) | (np.isnan(df.pct2)) ].sort_values("score", ascending = False)
df.head(5)

/tmp/ipykernel_3943216/2645709130.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, "log2_fold_change"] = np.log2(df.expression1 / df.expression2)

[31]:

	feature	expression1	pct1	rank1	expression2	pct2	rank2	score	log2_fold_change
ENSG00000124107	ENSG00000124107	4.134322	0.908016	4634	0.512570	0.210174	3497.0	1692.613278	3.011830
ENSG00000168878	ENSG00000168878	4.643004	0.944728	4641	0.329463	0.174074	2878.0	1692.315966	3.816870
ENSG00000161055	ENSG00000161055	1.940696	0.587803	4514	0.458728	0.228191	3355.0	1691.958975	2.080864
ENSG00000131400	ENSG00000131400	3.110635	0.843769	4602	NaN	NaN	NaN	1691.304883	NaN
ENSG00000166347	ENSG00000166347	2.939046	0.944116	4586	0.717823	0.402265	3822.0	1690.470148	2.033649