Warm tip: This article is reproduced from serverfault.com, please click

feature-selection machine-learning python python-3.x scikit-learn

Code enters infinite loop when trying to select features

发布于 2020-11-28 18:23:32

I am trying to use scikit learn's Recursive feature elimination with cross-validation for a (5000, 37) data that has binary class problem and whenever i fit the model the algorithm enters infinite loop. Currently, i am following this example: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html on how to employ this algorithm.

My data is:

    from sklearn.svm import SVC
    from sklearn.model_selection import StratifiedKFold
    from sklearn.feature_selection import RFECV
    
        X = np.random.randint(0,363175645.191632,size=(5000, 37))
        Y = np.random.choice([0, 1], size=(37,))

What i tried doing to select the features by:

    svc = SVC(kernel="linear")
    rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
                  scoring='accuracy')
    
    rfecv.fit(X, Y)

The code hangs and enters infinite loop, however when i try using another algorithm such as ExtraTreesClassifier it works just fine, what is going on, please help?

Questioner

rick horn

Viewed

0

StupidWolf 2020-11-29 03:49:46

When you perform svm, because it is distance based, it makes sense to scale your feature variables, especially in your case when they are huge. you can also check out this intro to svm. Using an example dataset:

from sklearn.datasets import make_blobs
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler

Scaler =  StandardScaler()

X, y = make_blobs(n_samples=5000, centers=3, shuffle=False,random_state=42)
X = np.concatenate((X,np.random.randint(0,363175645.191632,size=(5000,35))),axis=1)
y = (y==1).astype('int')

X_scaled = Scaler.fit_transform(X)

This dataset has only 2 useful variables in the first two columns, as you can see from the plot:

plt.scatter(x=X_scaled[:,0],y=X_scaled[:,1],c=['k' if i else 'b' for i in y])

Now we run rfe on scaled data and we can see it returns the first two columns as top variables:

from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),scoring='accuracy')
rfecv.fit(X_scaled, y)

rfecv.ranking_

array([ 1,  2, 17, 28, 33, 22, 23, 26,  6, 19, 20,  4, 10, 25,  3, 27, 11,
        8, 18,  5, 29, 14,  7, 21,  9, 13, 24, 30, 35, 31, 32, 34, 16, 36,
       37, 12, 15])

rick horn 2020-11-28 20:01:45

Should i scale the data, then?

StupidWolf 2020-11-28 20:02:16

yes you definitely should. can also try minmax

热门帖子

1

推荐一些好玩的/大众的手游

2

求指教后端项目迁移方案

3

迷你洗衣机是不是都是智商税？

4

求助一个排查了半年没解决的 MySQL order by 子句导致索引失效的问题， 500 多万条记录的小表要查快两分钟

5

个人开发了一款 WordPress 主题： iPao，集成了 AI 总结功能

6

偶然发现奇游加速器会在系统里植入根证书

7

国内有蒲公英替代品推荐吗？

8

语音助手这个东西真的会监听谈话并且上传，从而泄漏隐私吗？

9

出一些有意思的域名-明盘

10

jetbrains 全家桶升级 2024 后，在滚动代码时候感觉有点掉帧

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books