{"id":15979,"date":"2022-04-24T19:09:43","date_gmt":"2022-04-24T18:09:43","guid":{"rendered":"https:\/\/complex-systems-ai.com\/?page_id=15979"},"modified":"2022-11-27T21:08:37","modified_gmt":"2022-11-27T20:08:37","slug":"selection-des-colonnes","status":"publish","type":"page","link":"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/","title":{"rendered":"Selection of columns"},"content":{"rendered":"<div data-elementor-type=\"wp-page\" data-elementor-id=\"15979\" class=\"elementor elementor-15979\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6f22d74 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6f22d74\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-12395e3\" data-id=\"12395e3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f47fa18 elementor-align-justify elementor-widget elementor-widget-button\" data-id=\"f47fa18\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Data analysis<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-a6c743b\" data-id=\"a6c743b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c75a9c0 elementor-align-justify elementor-widget elementor-widget-button\" data-id=\"c75a9c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/complex-systems-ai.com\/en\/\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Home page<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-e2c918a\" data-id=\"e2c918a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a0d0a27 elementor-align-justify elementor-widget elementor-widget-button\" data-id=\"a0d0a27\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/en.wikipedia.org\/wiki\/Data_analysis\" target=\"_blank\" rel=\"noopener\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Wiki<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-47942c9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"47942c9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-46273d5\" data-id=\"46273d5\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b46b9b6 elementor-widget elementor-widget-text-editor\" data-id=\"b46b9b6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Modern datasets are very rich in information with data collected from millions of IoT devices and sensors. This makes for high dimensional data and it&#039;s quite common to see datasets with hundreds of features and it&#039;s not unusual to see them grow into tens of thousands.<\/p><p>Column\/feature selection is a very critical element in a Data Scientist&#039;s workflow. When presenting data with very high dimensionality, models usually choke because<\/p><ol><li>Training time increases exponentially with the number of features.<\/li><li>Models have an increasing risk of overfitting with an increasing number of features.<\/li><li>Feature selection methods help solve these problems by reducing the dimensions without much loss of the total information. It also helps to make sense of the features and their importance.<\/li><\/ol><p><img decoding=\"async\" class=\"aligncenter wp-image-11096 size-full\" src=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2020\/09\/cropped-Capture.png\" alt=\"column selection\" width=\"97\" height=\"97\" title=\"\"><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-cd73c46 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"cd73c46\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-530ce84\" data-id=\"530ce84\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e4cdfcf elementor-widget elementor-widget-heading\" data-id=\"e4cdfcf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Selection-des-colonnes\" >Selection of columns<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Methodes-de-filtrage\" >Filtering methods<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#F-Test\" >F Test<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Information-mutuelle\" >Mutual information<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Seuil-de-variance\" >Variance threshold<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Methodes-demballage-wrapper\" >Wrapping methods (wrapper)<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Forward-Search\" >ForwardSearch<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Recursive-Feature-Elimination\" >Recursive Feature Elimination<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/selection-of-columns\/#Methodes-embarquees-embedded\" >Embedded methods<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Selection-des-colonnes\"><\/span>Selection of columns<span class=\"ez-toc-section-end\"><\/span><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-39961dc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"39961dc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b78f8b6\" data-id=\"b78f8b6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9472183 elementor-widget elementor-widget-text-editor\" data-id=\"9472183\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In this page, I discuss the following feature selection techniques and their characteristics.<\/p><ol><li>Filtering methods<\/li><li>packing methods and<\/li><li>Embedded methods.<\/li><\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e4036a6 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e4036a6\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d631d80\" data-id=\"d631d80\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d92e6bd elementor-widget elementor-widget-heading\" data-id=\"d92e6bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Methodes-de-filtrage\"><\/span>Filtering methods<span class=\"ez-toc-section-end\"><\/span><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e589ebc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e589ebc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d24ce7d\" data-id=\"d24ce7d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9991695 elementor-widget elementor-widget-text-editor\" data-id=\"9991695\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ada8\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">The filter methods take into account the relationship between the features and the target variable to calculate the significance of the features.<\/p><h3 id=\"823b\" class=\"mu ls ja bn lt mv mw mx lx my mz na mb ki nb nc mf km nd ne mj kq nf ng mn nh gc\"><span class=\"ez-toc-section\" id=\"F-Test\"><\/span>F Test<span class=\"ez-toc-section-end\"><\/span><\/h3><p id=\"a715\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">F-Test is a statistical test used to compare models and check if the difference is significant between models.<\/p><p id=\"5796\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">F-Test makes an X and Y hypothesis test model where X is a model created only by a constant and Y is the model created by a constant and a feature.<\/p><p id=\"93f2\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">The least squares errors in the two models are compared and checks whether the difference in errors between models X and Y is significant or introduced by chance.<\/p><p id=\"5253\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">F-Test is helpful in feature selection as we get to know the importance of each feature in improving the model.<\/p><p id=\"86d0\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">Scikit learn provides top K features using F-Test.<\/p><pre class=\"kw kx ky kz gz ni bt nj\"><span id=\"2831\" class=\"gc mu ls ja nk b do nl nm l nn\" data-selectable-paragraph=\"\">sklearn.feature_selection.f_regression<\/span><\/pre><p id=\"988c\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">For classification type columns:<\/p><pre class=\"kw kx ky kz gz ni bt nj\"><span id=\"1d8f\" class=\"gc mu ls ja nk b do nl nm l nn\" data-selectable-paragraph=\"\">sklearn.feature_selection.f_classif<\/span><\/pre><p id=\"674a\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">There are some downsides to using F-Test to select your features. F-Test only checks and captures linear relationships between features and labels. A highly correlated characteristic receives a higher score and less correlated characteristics receive a lower score.<\/p><ol><li id=\"0bff\" class=\"ld le ja jz b ka kb ke kf ki lf km lg kq lh ku li lj lk ll gc\" data-selectable-paragraph=\"\">The <a href=\"https:\/\/complex-systems-ai.com\/en\/correlation-and-regressions\/\">correlation<\/a> is very misleading because it does not capture strong nonlinear relationships.<\/li><li id=\"bf4e\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\">Using crude statistics like correlation can be a bad idea, as Anscombe&#039;s quartet illustrates.<\/li><\/ol><p><img fetchpriority=\"high\" decoding=\"async\" class=\"aligncenter wp-image-15985 size-full\" src=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_oMbcPjuDprAu_QAGizFf7g.png\" alt=\"\" width=\"990\" height=\"720\" title=\"\" srcset=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_oMbcPjuDprAu_QAGizFf7g.png 990w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_oMbcPjuDprAu_QAGizFf7g-300x218.png 300w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_oMbcPjuDprAu_QAGizFf7g-768x559.png 768w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_oMbcPjuDprAu_QAGizFf7g-18x12.png 18w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_oMbcPjuDprAu_QAGizFf7g-600x436.png 600w\" sizes=\"(max-width: 990px) 100vw, 990px\" \/><\/p><p>Francis Anscombe illustrates how four separate data sets have the same mean, variance, and correlation to point out that &quot;summary statistics&quot; do not fully describe the data sets and can be quite misleading.<\/p><h3 id=\"41f8\" class=\"mu ls ja bn lt mv mw mx lx my mz na mb ki nb nc mf km nd ne mj kq nf ng mn nh gc\"><span class=\"ez-toc-section\" id=\"Information-mutuelle\"><\/span>Mutual information<span class=\"ez-toc-section-end\"><\/span><\/h3><p id=\"6224\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">The mutual information between two variables measures the dependence of one variable on another. If X and Y are two variables, and<\/p><ol class=\"\"><li id=\"9b2e\" class=\"ld le ja jz b ka kb ke kf ki lf km lg kq lh ku li lj lk ll gc\" data-selectable-paragraph=\"\">If X and Y are independent, no information about Y can be obtained by knowing X or vice versa. Therefore, their <a href=\"https:\/\/complex-systems-ai.com\/en\/data-partitioning\/external-quality-criteria\/\">mutual information<\/a> is 0.<\/li><li id=\"c398\" class=\"ld le ja jz b ka lm ke ln ki lo km lp kq lq ku li lj lk ll gc\" data-selectable-paragraph=\"\">If X is a deterministic function of Y, then we can determine X from Y and Y from X with mutual information 1.<\/li><li id=\"4d4b\" class=\"ld le ja jz b ka lm ke ln ki lo km lp kq lq ku li lj lk ll gc\" data-selectable-paragraph=\"\">When we have Y = f(X,Z,M,N), 0 &lt; mutual information &lt; 1<\/li><\/ol><p id=\"23f2\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">We can select our features from the feature space by classifying their mutual information with the target variable.<\/p><p id=\"3c42\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">The advantage of using mutual information over F-Test is that it works well with the nonlinear relationship between the feature and the target variable.<\/p><p id=\"d38a\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">Sklearn offers a selection of features with mutual information for tasks of <a href=\"https:\/\/complex-systems-ai.com\/en\/correlation-and-regressions\/data-transformation-and-regression\/\">regression<\/a> and classification.<\/p><pre class=\"kw kx ky kz gz ni bt nj\"><span id=\"c6a2\" class=\"gc mu ls ja nk b do nl nm l nn\" data-selectable-paragraph=\"\">sklearn.feature_selection.mututal_info_regression <br \/>sklearn.feature_selection.mututal_info_classif<\/span><\/pre><figure class=\"kw kx ky kz gz la gn go paragraph-image\"><div class=\"nq nr dq ns cf nt\" tabindex=\"0\" role=\"button\"><div><img decoding=\"async\" class=\"aligncenter wp-image-15986 size-large\" src=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_Kr8bbyVaQNLCZBPUXs8b1g-1024x357.png\" alt=\"\" width=\"1024\" height=\"357\" title=\"\" srcset=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_Kr8bbyVaQNLCZBPUXs8b1g-1024x357.png 1024w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_Kr8bbyVaQNLCZBPUXs8b1g-300x105.png 300w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_Kr8bbyVaQNLCZBPUXs8b1g-768x268.png 768w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_Kr8bbyVaQNLCZBPUXs8b1g-18x6.png 18w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_Kr8bbyVaQNLCZBPUXs8b1g-600x209.png 600w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_Kr8bbyVaQNLCZBPUXs8b1g.png 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/div><div>F-Test captures the linear relationship well. Mutual information captures any type of relationship between two variables. http:\/\/scikit-learn.org\/stable\/auto_examples\/feature_selection\/plot_f_test_vs_mi.html<\/div><div>\u00a0<\/div><\/div><\/figure><h3 id=\"7bbf\" class=\"mu ls ja bn lt mv mw mx lx my mz na mb ki nb nc mf km nd ne mj kq nf ng mn nh gc\"><span class=\"ez-toc-section\" id=\"Seuil-de-variance\"><\/span>Variance threshold<span class=\"ez-toc-section-end\"><\/span><\/h3><p id=\"7323\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">This method removes features with variation below a certain threshold.<\/p><p id=\"93b7\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">The idea is that when a feature doesn&#039;t vary much on its own, it usually has very little predictive power.<\/p><pre class=\"kw kx ky kz gz ni bt nj\"><span id=\"717e\" class=\"gc mu ls ja nk b do nl nm l nn\" data-selectable-paragraph=\"\">sklearn.feature_selection.VarianceThreshold<\/span><\/pre><p id=\"ddaf\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">The variance threshold does not take into account the relationship of the characteristics with the target variable.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b63b5d3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b63b5d3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f8b4e65\" data-id=\"f8b4e65\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-76a5407 elementor-widget elementor-widget-heading\" data-id=\"76a5407\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Methodes-demballage-wrapper\"><\/span>Wrapping methods (wrapper)<span class=\"ez-toc-section-end\"><\/span><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-fd474eb elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fd474eb\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b4d8433\" data-id=\"b4d8433\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5913be2 elementor-widget elementor-widget-text-editor\" data-id=\"5913be2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9e54\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">Wrapper methods generate models with a subset of functionality and measure the performance of their models.<\/p><h3 id=\"8506\" class=\"mu ls ja bn lt mv mw mx lx my mz na mb ki nb nc mf km nd ne mj kq nf ng mn nh gc\"><span class=\"ez-toc-section\" id=\"Forward-Search\"><\/span>ForwardSearch<span class=\"ez-toc-section-end\"><\/span><\/h3><p id=\"47b7\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">This method allows you to find the best feature relative to model performance and add them to your feature subset one at a time.<\/p><p data-selectable-paragraph=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-15987 size-large\" src=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_YnXZS86uR2HibB3jlJB10A-1024x823.png\" alt=\"\" width=\"1024\" height=\"823\" title=\"\" srcset=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_YnXZS86uR2HibB3jlJB10A-1024x823.png 1024w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_YnXZS86uR2HibB3jlJB10A-300x241.png 300w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_YnXZS86uR2HibB3jlJB10A-768x617.png 768w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_YnXZS86uR2HibB3jlJB10A-15x12.png 15w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_YnXZS86uR2HibB3jlJB10A-600x482.png 600w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_YnXZS86uR2HibB3jlJB10A.png 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p><p class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">Direct selection method when used to select the best 3 features out of 5 features, features 3, 2 and 5 as the best subset.<\/p><p>For data with n features,<\/p><p>-&gt; In the first round, \u201cn\u201d models are created with individual functionality and the best predictive functionality is selected.<\/p><p>-&gt; In the second round, &quot;n-1&quot; models are created with each feature and the previously selected feature.<\/p><p>-&gt;This is repeated until a better feature subset \u201cm\u201d is selected.<\/p><h3 id=\"b047\" class=\"mu ls ja bn lt mv mw mx lx my mz na mb ki nb nc mf km nd ne mj kq nf ng mn nh gc\"><span class=\"ez-toc-section\" id=\"Recursive-Feature-Elimination\"><\/span>Recursive Feature Elimination<span class=\"ez-toc-section-end\"><\/span><\/h3><p id=\"ae64\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">As the name suggests, this method eliminates the worst performing features on a particular model one after the other until the best subset of features is known.<\/p><p data-selectable-paragraph=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-15988 size-large\" src=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_qXqx7_hDtsO9ez7_nxSXOw-1024x689.png\" alt=\"\" width=\"1024\" height=\"689\" title=\"\" srcset=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_qXqx7_hDtsO9ez7_nxSXOw-1024x689.png 1024w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_qXqx7_hDtsO9ez7_nxSXOw-300x202.png 300w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_qXqx7_hDtsO9ez7_nxSXOw-768x517.png 768w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_qXqx7_hDtsO9ez7_nxSXOw-18x12.png 18w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_qXqx7_hDtsO9ez7_nxSXOw-600x404.png 600w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_qXqx7_hDtsO9ez7_nxSXOw.png 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p><figure class=\"kw kx ky kz gz la gn go paragraph-image\"><figcaption class=\"nu bm gp gn go nv nw bn b bo bp co\" data-selectable-paragraph=\"\">Recursive elimination eliminates the less explanatory features one after the other. Features 2, 3 and 5 are the best subset of features that arrived by recursive elimination.<\/figcaption><\/figure><p>For data with n features,<\/p><p>-&gt; In the first round, \u201cn-1\u201d models are created with a combination of all features except one. Worst performing feature is removed<\/p><p>-&gt; In the second round, the &quot;n-2&quot; models are created by removing another feature.<\/p><p>Wrapper Methods promises you a better feature set with extensive greedy search.<\/p><p>But the main disadvantage of wrapper methods is the amount of models that need to be trained. It is very computationally expensive and impractical with a large number of features.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4af10e0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4af10e0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5578c60\" data-id=\"5578c60\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9699df1 elementor-widget elementor-widget-heading\" data-id=\"9699df1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"ez-toc-section\" id=\"Methodes-embarquees-embedded\"><\/span>Embedded methods<span class=\"ez-toc-section-end\"><\/span><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b974a29 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b974a29\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-99ceac8\" data-id=\"99ceac8\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4e9476d elementor-widget elementor-widget-text-editor\" data-id=\"4e9476d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"70ea\" class=\"pw-post-body-paragraph jx jy ja jz b ka mp kc kd ke mq kg kh ki mr kk kl km ms ko kp kq mt ks kt ku it gc\" data-selectable-paragraph=\"\">Feature selection can also be achieved through information provided by some machine learning models.<\/p><p id=\"6d9f\" class=\"pw-post-body-paragraph jx jy ja jz b ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku it gc\" data-selectable-paragraph=\"\">LASSO linear regression can be used for feature selections. Lasso regression is performed by adding an extra term to the linear regression cost function. This, in addition to preventing overfitting, also reduces the coefficients of less important characteristics to zero.<\/p><p data-selectable-paragraph=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-15989 size-full\" src=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_jb_Zlb85QsArbwC6PEjrHg.png\" alt=\"\" width=\"802\" height=\"830\" title=\"\" srcset=\"https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_jb_Zlb85QsArbwC6PEjrHg.png 802w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_jb_Zlb85QsArbwC6PEjrHg-290x300.png 290w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_jb_Zlb85QsArbwC6PEjrHg-768x795.png 768w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_jb_Zlb85QsArbwC6PEjrHg-12x12.png 12w, https:\/\/complex-systems-ai.com\/wp-content\/uploads\/2022\/04\/1_jb_Zlb85QsArbwC6PEjrHg-600x621.png 600w\" sizes=\"(max-width: 802px) 100vw, 802px\" \/><\/p><figure class=\"kw kx ky kz gz la gn go paragraph-image\"><figcaption class=\"nu bm gp gn go nv nw bn b bo bp co\" data-selectable-paragraph=\"\">As we vary \u019b in the cost function, the coefficients have been plotted in this graph. We observe that for \u019b ~=0, the coefficients of most traits tend to zero. In the graph above, we can see that only \u201clcavol\u201d, \u201csvi\u201d and \u201clweight\u201d are the features with non-zero coefficients when \u019b = 0.4.<\/figcaption><\/figure><p>Tree-based models calculate feature importance because they need to keep the best-performing features as close to the root of the tree. Build a <a href=\"https:\/\/complex-systems-ai.com\/en\/graph-theory-2\/trees-and-trees\/\">tree<\/a> Decision making involves calculating the best predictive characteristic.<\/p><p>Decision trees keep the most important features close to the root. In this decision tree, we find that the number of legs is the most important characteristic, followed by whether it hides under the bed and whether it is delicious, etc.<\/p><p>Feature importance in tree-based models is calculated based on the index of <a href=\"https:\/\/complex-systems-ai.com\/en\/data-analysis\/gini-entropy-and-error\/\">Gini<\/a>, entropy or chi-square value.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>","protected":false},"excerpt":{"rendered":"<p>Data Analysis Wiki Home Page Modern datasets are very rich in information with data collected from millions of devices... <\/p>","protected":false},"author":1,"featured_media":0,"parent":15503,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-15979","page","type-page","status-publish","hentry"],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/pages\/15979","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/comments?post=15979"}],"version-history":[{"count":4,"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/pages\/15979\/revisions"}],"predecessor-version":[{"id":17880,"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/pages\/15979\/revisions\/17880"}],"up":[{"embeddable":true,"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/pages\/15503"}],"wp:attachment":[{"href":"https:\/\/complex-systems-ai.com\/en\/wp-json\/wp\/v2\/media?parent=15979"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}