Warm tip: This article is reproduced from serverfault.com, please click

Code enters infinite loop when trying to select features

发布于 2020-11-28 18:23:32

I am trying to use scikit learn's Recursive feature elimination with cross-validation for a (5000, 37) data that has binary class problem and whenever i fit the model the algorithm enters infinite loop. Currently, i am following this example: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html on how to employ this algorithm.

My data is:

    from sklearn.svm import SVC
    from sklearn.model_selection import StratifiedKFold
    from sklearn.feature_selection import RFECV
    
        X = np.random.randint(0,363175645.191632,size=(5000, 37))
        Y = np.random.choice([0, 1], size=(37,))

What i tried doing to select the features by:

    svc = SVC(kernel="linear")
    rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
                  scoring='accuracy')
    
    rfecv.fit(X, Y)

The code hangs and enters infinite loop, however when i try using another algorithm such as ExtraTreesClassifier it works just fine, what is going on, please help?

Questioner
rick horn
Viewed
0
StupidWolf 2020-11-29 03:49:46

When you perform svm, because it is distance based, it makes sense to scale your feature variables, especially in your case when they are huge. you can also check out this intro to svm. Using an example dataset:

from sklearn.datasets import make_blobs
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler

Scaler =  StandardScaler()

X, y = make_blobs(n_samples=5000, centers=3, shuffle=False,random_state=42)
X = np.concatenate((X,np.random.randint(0,363175645.191632,size=(5000,35))),axis=1)
y = (y==1).astype('int')

X_scaled = Scaler.fit_transform(X)

This dataset has only 2 useful variables in the first two columns, as you can see from the plot:

plt.scatter(x=X_scaled[:,0],y=X_scaled[:,1],c=['k' if i else 'b' for i in y])

enter image description here

Now we run rfe on scaled data and we can see it returns the first two columns as top variables:

from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),scoring='accuracy')
rfecv.fit(X_scaled, y)

rfecv.ranking_

array([ 1,  2, 17, 28, 33, 22, 23, 26,  6, 19, 20,  4, 10, 25,  3, 27, 11,
        8, 18,  5, 29, 14,  7, 21,  9, 13, 24, 30, 35, 31, 32, 34, 16, 36,
       37, 12, 15])