Feature selection

Here, we show how to use SuperSCC to find highly variable genes for the overall dataset or markers of each cluster/cell type for scRNAseq data.

[24]:
import SuperSCC as scc
import pandas as pd
import scanpy as sc
import numpy as np
[6]:
data = pd.read_csv('/mnt/disk5/zhongmin/superscc/师兄整理的肺数据/未去批次效应couns数据/没有去除批次效应_Banovich_Kropski_2020数据.csv', index_col=0)
cell_type = pd.read_csv('/home/fengtang/jupyter_notebooks/working_script/evulate_clustering/cell_type_info/finest/Banovich_Kropski_2020_finest_celltype.csv', index_col = 0)
[ ]:
adata = sc.AnnData(data.select_dtypes("number"))
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)

data_norm = pd.DataFrame(adata.X)
data_norm.columns = adata.var_names
data_norm.index = adata.obs_names

Since SuperSCC’ feature selection is an supervised process, it neeeds clustering or cell type labels. So we need to do add this information into the normalized counts matrix:

[ ]:
data_norm.loc[:, "cell_type"] = cell_type.cell_type.values
[ ]:
# find highly variable genes for the overall dataset

my_logger = scc.log_file("logger", "a") # create a logging object

hvgs = scc.feature_selection.find_signature_genes(data_norm.copy(),
                                label_column = "cell_type",
                                save = True,
                                logger = my_logger,
                                variance_threshold = "mean",
                                n_features_to_select = 0.15,
                                model = "svm",
                                mutual_info = False,
                                F_test = True,
                                normalization_method = "Min-Max"
                                )

# alternatively
# hvgs = scc.feature_selection.feature_selection(data_norm.copy(),
#                              label_column = "cell_type",
#                              model = "svm",
#                              normalization_method = "Min-Max",
#                              save = True,
#                              logger = my_logger)
[13]:
hvgs.keys()
[13]:
dict_keys(['retained_features_ranking_by_variance', 'retained_features_ranking_by_correlation', 'retained_features_ranking_by_embedding', 'retained_features_ranking_by_wrapping', 'final_feature_selection_by_ensemble', 'model_accuracy', 'params_used_for_feature_selection', 'retained_features_by_filtering', 'retained_features_by_embedding', 'retained_features_by_wrapping', 'final_feature_selection_by_intersection'])

The output includes:

  1. retained_features_ranking_by_variance: Feature rankings ordered by variance.

  2. retained_features_ranking_by_correlation: Feature rankings ordered by correlation-test p-value.

  3. retained_features_ranking_by_embedding: Feature rankings ordered by induction-model-derived feature importances.

  4. retained_features_ranking_by_wrapping: Feature rankings ordered by RFE-induction-model-derived feature importances.

  5. retained_features_by_filtering: Feature subset from variance and correlation filter.

  6. retained_features_by_embedding: Feature subset from induction-model.

  7. retained_features_by_wrapping: Feature subset from RFE-induction-model.

  8. model_accuracy: Accuracy scores for the training model.

  9. params_used_for_feature_selection: Settings used for running the feature_selection function.

  10. final_feature_selection_by_intersection: Ideal feature subset from aggregating different feature subsets.

  11. final_feature_selection_by_ensemble: Ideal feature subset from aggregating differentfeature rankings.

  12. label_classes: The unique label class of cell type in the data.

[ ]:
# find markers of each cluster/cell type
markers = scc.feature_selection.find_markers_ovr(data_norm.copy(),
                               label_column = "cell_type",
                               model = "svm",
                               normalization_method = "Min-Max",
                               save = True,
                               logger = my_logger,
                               variance_threshold = "mean",
                               n_features_to_select = 0.15,
                               mutual_info = False,
                               F_test = True,
                               normalization_method = "Min-Max"
                               )
[ ]:
markers.keys(), markers["0"].keys()
(dict_keys(['3', '0', '4', '1', '2', '5']),
 dict_keys(['features', 'sub_high_expression_genes']))

For each cluster/cell type, besides the output mentioned above stored in the features key, it also includes the feature expression intensity and expression ratio info stored in sub_high_expression_genes key.

[ ]:
markers["0"]["sub_high_expression_genes"]["0"]
feature expression1 pct1 rank1 expression2 pct2 rank2 score
ENSG00000100246 ENSG00000100246 0.089330 0.100143 1 NaN NaN NaN NaN
ENSG00000051596 ENSG00000051596 0.090060 0.102182 2 NaN NaN NaN 804.086531
ENSG00000139579 ENSG00000139579 0.090104 0.101774 3 NaN NaN NaN NaN
ENSG00000006757 ENSG00000006757 0.090352 0.100551 4 NaN NaN NaN NaN
ENSG00000117505 ENSG00000117505 0.090451 0.101163 5 0.162982 0.140160 1176.0 NaN
... ... ... ... ... ... ... ... ...
ENSG00000251562 ENSG00000251562 4.220352 0.969610 4638 4.312087 0.978009 4321.0 1246.790112
ENSG00000198804 ENSG00000198804 4.278880 0.986131 4639 4.397296 0.986289 4323.0 1110.438579
ENSG00000198712 ENSG00000198712 4.430985 0.986947 4640 4.689565 0.990395 4326.0 1562.438389
ENSG00000168878 ENSG00000168878 4.643004 0.944728 4641 0.329463 0.174074 2878.0 1692.315966
ENSG00000168484 ENSG00000168484 5.248846 0.863553 4642 0.949610 0.454262 3990.0 1672.888264

4642 rows × 8 columns

expression1, pct1 and rank1 represent the expression intensity, expression ratio and rank in the interst group, while expression2, pct2 and rank2 represent the expression intensity, expression ratio and rank in other groups except interst group. Such info can be used to remove negative markers (good for distinguishing interst group from other groups but have low expression in the interest group) out of informative features stored in features key. You can do:

[31]:
# get the expression intensity and ratio info
df = markers["0"]["sub_high_expression_genes"]["0"]

# use feature subset from aggregating feature rankings
important_features = [i[0] for i in markers["0"]["features"]["final_feature_selection_by_ensemble"]]

# remove little informative features
df = df.loc[df.feature.isin(important_features)]

# calculate log2 fold change
df.loc[:, "log2_fold_change"] = np.log2(df.expression1 / df.expression2)

# only keep features with satified expression intensity and ratio in the interest group
df = df.loc[((df.expression1 > 1) & (1 - (df.pct2/df.pct1) > 0.5)) | (np.isnan(df.pct2)) ].sort_values("score", ascending = False)
df.head(5)
/tmp/ipykernel_3943216/2645709130.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, "log2_fold_change"] = np.log2(df.expression1 / df.expression2)
[31]:
feature expression1 pct1 rank1 expression2 pct2 rank2 score log2_fold_change
ENSG00000124107 ENSG00000124107 4.134322 0.908016 4634 0.512570 0.210174 3497.0 1692.613278 3.011830
ENSG00000168878 ENSG00000168878 4.643004 0.944728 4641 0.329463 0.174074 2878.0 1692.315966 3.816870
ENSG00000161055 ENSG00000161055 1.940696 0.587803 4514 0.458728 0.228191 3355.0 1691.958975 2.080864
ENSG00000131400 ENSG00000131400 3.110635 0.843769 4602 NaN NaN NaN 1691.304883 NaN
ENSG00000166347 ENSG00000166347 2.939046 0.944116 4586 0.717823 0.402265 3822.0 1690.470148 2.033649