SuperSCC.feature_selection.feature_selection

SuperSCC.feature_selection.feature_selection(data, label_column, filename=None, logger=None, rank_method='dense', merge_rank_method='geom.mean', variance_threshold='mean', mutual_info=False, chi_square_test=False, F_test=True, model='svm', random_foreast_threshold=None, n_estimators=100, random_state=10, normalization_method='Min-Max', logistic_multi_class='ovr', linear_svm_multi_class='ovr', class_weight='balanced', n_features_to_select=0.15, step=100, cv=5, n_jobs=-1, save=True)[source]

A function to do feature seletion based on filtering, embedding and wrapping method respectively or combing those methods together.

Parameters:

data – A log normalized expression matrix (rows are cells; columns are features) with an extra column containing clustering or cell type labels.
label_column – The name of cell type column in the data.
filename – A string to name the output file. Default is None.
logger – A log_file object to write log information into disk. Default is None.
rank_method – A string to decide which rank method will be used to rank the coefficient values returned by different estimators. Default is “dense”. Other available words including “average”, “min”, “max” and “ordinal”.
merge_rank_method – A string to decide which method will be used to combine the rankings from different estimators. Default is “geom.mean”. Other available words including “mean”, “min”, and “max”.
variance_threshold – A string to decide which variance cutoff is used to filter out features. “zero” or “mean” could be selected. Default is ‘mean’.
mutual_info – A Bool value decide whether a mutual information method is employed to filtering out features further. When it’s true, F_test and chi_sqaure_test should be specified in false. Default is False.
chi_sqaure_test – A Bool value decide whether a chi square test method is employed to filtering out features further. When it’s true, F_test and mutual_info should be specified in false. Default is False.
F_test – A Bool value decide whether a F test method is employed to filtering out features further. When it’s true, chi_sqaure_test and mutual_info should be specified in false. Default is True.
model – A string to decide which model is used by embedding- and wrapper- based feature selection. “random_foreast”, “logistic” and “svm” could be selected. Default is ‘svm’.
random_foreast_threshold – A float or int value to set the cutoff (feature_importance_) by random foreast model-basedd embedding feature selection. It only takes effect when model is set in ‘random_foreast’. Default is None. When it is None, 1 / the number of all features will be automaticcally used for this value.
n_estimators – A int to indicate the number of trees in the forest. It only takes effect when model is set in ‘random_foreast’. Default is 100.
random_state – A int to control the randomness of the bootstrapping of the samples. It takes effect when model is set in ‘random_foreast’ or in “logistic”. Default is 10.
normalization_method – A string to decide how to normalize data. Default is “Min-Max”. Other available words including “Standardization”.
logistic_multi_class – A string to decide which mode to deal with multiclassification in the logistic model. Default is “ovr”. Other available words include “multinomial” and “auto”.
linear_svm_multi_class – A string to decide which mode to deal with multiclassification in the linear svm model. Default is “ovr”. This parameter only takes effect when model is set in ‘svm’. Other available words include “ovo”.
class_weight – A string to decide whether class weights will be considered. If None, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Default is ‘balanced’.
n_featurs_to_selct – A int or float to control the number of features to select. If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select. Default is 0.15.
step – A int or float to control the number of features be removed in each round of RFECV algorithm. If greater than or equal to 1, then step corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage (rounded down) of features to remove at each iteration. Default is 100.
cv – A int to decide the number of cross validation in RFECV algorithm. Default is 5.
n_jobs – A int to decide the number of thread used for the program. Default is -1, meaning using all available threads.
save – A Bool value to decide whether write the output into the disk.