Effective feature selection using feature vector graph for classification

Effective feature selection using feature vector graph for classi?cation

Guodong Zhao a,b,Yan Wu a,n,Fuqiang Chen a,Junming Zhang a,Jing Bai a

a School of Electronics and Information,Tongji University,Shanghai201804,China

b School of Mathematics and Physics,Shanghai Dian Ji University,Shanghai201306,China

a r t i c l e i n f o

Article history:

Received22March2014

Received in revised form

4July2014

Accepted15September2014

Communicated by Haowei Liu

Available online30September2014

Keywords:

Feature selection

Community modularity

Relevant independency

Feature vector graph

Classi?cation accuracy

a b s t r a c t

Optimal feature subset selection is often required as a preliminary work in machine learning and data

mining.The choice of feature subset determines the classi?cation accuracy.It is a crucial aspect to

construct ef?cient feature selection algorithm.Here,by constructing the feature vector graph,a new

feature evaluation criterion based on community modularity in complex network is proposed to select

the most informative features.To eliminate the relevant redundancy among features,conditional mutual

information-based criterion is used to capture information about relevant independency between

features,which is the amount of information they can predict about label variable,but they do not share.

The most informative features with maximum relevant independency are added to the optimal subset.

Integrating these two points,a method named the community modularity Q value-based feature

selection(CMQFS)is put forward in this paper.Furthermore,our method based on community

modularity can be certi?ed by k-means cluster theory.We compared the proposed algorithm with

other state-of-the-art methods by several experiments to indicate that CMQFS is more ef?cient and

accurate.

1.Introduction

The rapid development of information technology makes it

easy to accumulate the data sets with high dimensionality.Never-

theless,most of the features in huge dataset are irrelevant or

redundant,which typically deteriorates the performance of

machine learning algorithms.An effective way to mitigate the

problem is to reduce the dimensionality of feature space with

ef?cient feature selection technique.Feature selection is to iden-

tify a subset from the original feature set of data,which will

improve the quality of the data.Therefore,feature selection

becomes more and more important in machine learning and data

mining.

According to the way they combine the optimal feature subset

search with the construction of learning models,feature selection

methods can be roughly divided into three types,i.e.,embedded,

wrapper and?lter methods.Embedded and wrapper methods

[1,2]are classi?er-dependent,which evaluate the features using a

learning algorithm.They outperform?lter methods in terms of

accuracy;but they suffer from poor generalization ability for other

classi?ers and high computational complexity in the learning

process for high dimensional datasets,because they are tightly

coupled with speci?ed learning algorithms[25].On the contrary,

the?lter methods,which are not speci?ed to a learning algorithm,

evaluate the discrimination capability of each feature by investi-

gating only the intrinsic properties of the data.Hence they are not

coupled with any learning https://www.360docs.net/doc/0d14452965.html,pared to the above two

methods,the?lter methods[3–5]are more commonly adopted for

feature selection because of their simplicity,computational ef?-

ciency and scalability for very high-dimensional datasets.The?lter

methods have attracted great attention and a large number of

?lter algorithms have been developed in the past decades.In this

paper,we focus on the?lter methods.

The?lter methods are generally subdivided into two classes:

ranking and subset selection.In the?rst class,the ranking methods

evaluate the signi?cance of discrimination for each feature based

on different evaluation criterion.A weight(or score)is?rstly

calculated for each feature depending on a speci?ed weighting

function,and the features with top weights(or scores)are picked

into optimal subset while the rest are discarded.In the second

class,the solution of feature optimal subset with the highest

accuracy is NP hard[51].In order to avoid the combinatorial

search problem to?nd an optimal subset,the most popular

variable selection methods mainly include forward,backward

and?oating sequential schemes,which always use heuristic

approaches to provide a sub-optimal solution.In this study,a

new scoring criterion based on community modularity in complex

network has been developed,which could well identify the

Contents lists available at ScienceDirect

journal homepage:https://www.360docs.net/doc/0d14452965.html,/locate/neucom

Neurocomputing

https://www.360docs.net/doc/0d14452965.html,/10.1016/j.neucom.2014.09.027

0925-2312/&2014Elsevier B.V.All rights

reserved.

n Corresponding author.

E-mail addresses:zgd215@https://www.360docs.net/doc/0d14452965.html,(G.Zhao),yanwu@https://www.360docs.net/doc/0d14452965.html,(Y.Wu).

Neurocomputing151(2015)376–389

discriminative information of each feature.The analysis on rele-vant independency between features can effectively deal with the relevant redundancy among selected features.Hence,through the proposed method,not only the most relevant features are selected and redundant features are eliminated,but also useful intrinsic feature groups are retained.

2.Related work

As mentioned previously,a large number of?lter-based feature selection(FS)algorithms have been presented in the past few decades for mining the optimal features subset from the high-dimensional feature spaces.According to the way of utilizing label information in the feature selection process,FS methods can be divided into three classes:unsupervised FS,semi-supervised FS and supervised FS.

The feature ranking based selection methods have been presented to calculate the features'scoring based on constructing the different scoring functions.The Variance Score method[6]is to select the features with maximum variances by calculating the variance of each feature to re?ect its representative power.The Laplacian Score method [6]is another popular unsupervised FS under the assumption that two close samples should have similar feature values and a good feature should have similar values for samples in the same class and large margin values for samples in different https://www.360docs.net/doc/0d14452965.html,ing normalized mutual information to measure the dependence between a pair of features,Yazhou Ren[7]proposed an improved Laplacian Score-based feature selection method based on local and global structure preser-ving.As a supervised FS,Fisher Score chose features with the best discriminate ability[8–10].A number of supervised learning algo-rithms have been used to implement the?lter methods,which include Relief family[11–15]and Fuzzy-Margin-based Relief(FM Relief)[16]. Battiti[17]investigated the application of mutual information criterion to evaluate candidate features and to select the top ranked features used as input data for a neural network classi?er.These FS algorithms based on score function are widely used in data mining and pattern recognition.However,these methods have been criticized for their ignoring the redundancy among features,which may lead to selection over many redundant features and bring a bad in?uence on the performance of the following classi?ers.To overcome the above problem,the optimal feature subset based methods have been affected considering the redundancy among the selected features by most researchers.Mutual Information(MI)is a measure of the amount of information between two random variables,which is symmetric and non-negative,and is zero if and only if the variables are independent. Then,the methods based on Mutual Information(MI)have been popular lately.Yu and Liu[18]introduced a novel framework that decoupled relevance analysis and redundancy analysis.They proposed a correlation-based subset selection method named FCBF for relevance and redundancy analysis,and then removed redundant features by approximate Markov Blanket technique.The MIFS algorithm[17]was proposed to calculate the mutual information both with respect to class variables and the already selected features for each feature and selected those features that have maximum mutual information with class labels but less redundant among the selected features.However, the MIFS algorithm ignored feature synergy,MIFS and its variants may cause a big bias when features are combined to cooperate together.To avoid the draw-back,Gang Wang[45]proposed a novel feature selection method for text categorization called conditional mutual information maximin(CMIM)where the triplet form is used to estimate conditional mutual information(CMI),aiming at greatly relieving the computation overhead and a set of individually discri-minating and weakly dependent features can be selected.Based on information gain and MI,FESLP was proposed by Ye Xu[46]to address the link prediction problem,whose superior advantage is that those features with the greatest discriminative power are selected and

simultaneously the correlations among features such that redundancy

in the learned feature space are minimized as small as possible.The

CMIF method has been proposed by Hongrong Cheng[19]based on

the link between interaction information and conditional mutual

information,which not only takes account of both redundancy and

synergy interactions of features and identi?es discriminative features,

but also combines feature redundancy evaluation with classi?cation

tasks.Skwak and Choi[20]improved the MIFS method under the

assumption of uniform distributions of information of input features,

and put forward an algorithm called MIFS-U.Both MIFS and MIFS-U

involved a redundancy parameterβ,which is used to interpret the redundancy among input features.Ifβ?0,the MI among input features is not taken into consideration and is deteriorated into the

FCBF method.However,ifβis chosen too large,the algorithms simply included the relation among the input features,and does not include the relation between the individual input features and the class[21]. Peng proposed the‘minimal redundancy maximal relevance’(mrmr) method[22],a special case of the MIFS algorithm whenβ?1/|S|,where |S|is the number of feature in S.Then,the‘normalized mutual information feature selection’(NMIFS)algorithm[23]used the normal-ized MI by the minimum entropy of both features and the average normalized MI as a measure of redundancy of the individual feature and the subset of selected feature,where authors claimed that the NMIFS algorithm was an enhancement over the MIFS,MIFS-U and mrmr algorithms.Based on metric applied on continuous and discrete data representations,Jose′Mart?′nez Sotoca[24]built a dissimilarity space using information theoretic measure,in particular conditional mutual information between features with respect to a relevant variable that represents the class labels.Applying a hierarchical clustering,the algorithm searched for a compression of the informa-tion contained in the original set of features.Sun[25]presented a new scheme for feature relevance,interdependence and redundancy analysis using information theoretic criteria,whose primary character-istic was that the feature was weighted according to its interaction with the selected features.And the weight of features will be dynamically updated after each candidate feature has been selected. Gavin Brown[26]has pointed out that common heuristics for information based feature selection(including Markov Blanket algo-rithms as a special case)are approximate iterative maximizations of the conditional likelihood and presented a unifying framework for information theoretic feature selection,bringing almost two decades of researches on heuristic?lter criteria under a single theoretical interpretation.From a new view point of?nding the unique informa-tion,Okan Sakar[27]has proposed a method called Kernel Canonical Correlation Analysis based minimum Redundancy-Maximum Rele-vance(KCCAmrmr)which explored and used all the correlated func-tions(covariants)between variables to compute their unique (conditional)information about the target.However,most traditional information-theoretic based selectors will ignore some features which have strong discriminatory power as a group but are weak as individuals.To cope with this problem,Sun[28,29]introduced a cooperative game theory based frame work to evaluate the power of each feature,which can be served as a metric of the importance of each feature according to the intricate and intrinsic interrelation among features and provided the weighted features to feature selection algorithm,which was more stable for complex dataset,but had issue with high runtime complexity.More feature selection methods in the detailed description can be found in[30–32].

In the past few decades,a lot of complex network theories have

also become extremely useful as the representation of a wide variety

of systems in different areas,such as biological,social,technological,

and information https://www.360docs.net/doc/0d14452965.html,work analysis has become crucial to

understand the features of these systems.In this paper,a new feature

evaluation criterion based on community modularity in network

analysis is proposed to evaluate discriminatory power of each

G.Zhao et al./Neurocomputing151(2015)376–389377

features by constructing the feature vector graph under the assump-tion that two close samples from the same class in an informative feature should have similar feature values and large margin values for samples in different classes as the Laplacian Score algorithm [6].The measure of relevant independency between features with respect to class variable is utilized to eliminate the relevant redun-dancy among features.The detailed framework of the proposed method is organized in the next two sections.3.Preliminaries

In this section,preliminaries of feature selection are presented,and the concepts of relevant independency and relevant redun-dancy based on information theory and community modularity theory are introduced.

https://www.360docs.net/doc/0d14452965.html,munity modularity in complex network

Many real-world systems can be described by complex net-works.An important common feature of most real-world complex networks is community structure,which means most of the networks are composed of some communities or groups within which they have close connections between nodes,but sparse connections between communities relatively [33].

A community consists of nodes and edges,where nodes often cluster into tightly knit groups with high density of within-group edges and low density of between-group edges.So far,none of quantitative de ?nition of community is universally accepted.In Fig.1,a schematic example of a graph with three communities is shown to illustrate the community structure.

In recent years,numerous algorithms have been proposed to detect communities to identify good partitions for a graph.But what is a good partitioning?Hence it is necessary to have a quantitative criterion to assess the goodness of a graph partition.The most admired quality function is the modularity of Newman and Girvan [34].The modularity Q can be written as follows:

Q ?1

∑ij A ij àk i k j

δeC i ;C j T:e1T

Where the sum runs over all pairs of nodes,A is the adjacency matrix,M is the total number of edges in the graph,and k i ,k j represent the degree of node i and node j separately.The δ-func-tion equals one if nodes i and j are in the same community,zero otherwise.And another widely used description of modularity Q can be rewritten as follows:

Q ?∑n c

c ?1l c à

d c 2"#:e2T

Here,n c is the number of communities,l c is the total number of

edges joining nodes of module C and d c is the sum of the degrees

of the nodes of C .The range of modularity Q is at [à1,1].Modularity-based methods [44]assumed that high value of modularity indicates good partitions.In other words,the higher the modularity Q is,the more signi ?cant the community structure is.

From the de ?nition of community,the within-class distance is small and between-classes distance is large.If graph has clear community structures,the nodes in the same community can be locally and linearly separated easily,which can be seen in Fig.1.Furthermore,the feature that minimizes the within-cluster dis-tance and maximizes between-cluster distance is preferred and hence gets higher weight [43].Inspired by above,if the sample graph in a feature,called Feature Vector Graph (FVG)in this paper,has apparent community structure,the feature has strong dis-criminative power due to making intra-class distance small and inter-class distance large.This can be proved sequentially by the k-means cluster theory in Section 6.

3.2.Relevant independency and relevant redundancy

https://www.360docs.net/doc/0d14452965.html,rmation theory

Entropy and mutual information (MI )[21]are the fundamental concepts in information theory.Entropy refers to the uncertainty of random variables and MI is the measure of the information shared by them.The entropy H eX Tof a discrete random variable X is de ?ned as follows:H eX T?à∑x A X

p ex Tlog p ex T

e3T

Here,p ex Tis the probability mass function of X .It can be concluded that entropy does not depend on the actual values of X ,but only on the probabilities.

When some certain variables Y are known,the remaining uncertainty of variable X is measured by the conditional entropy H eX j Y T:

H eX j Y T?à∑x A X ∑y A Y

p ex ;y Tlog p ex j y T

e4T

where,p ex ;y T,p ex j y Tis the joint probability mass function and the conditional probability mass function.

MI I eX ;Y Tbetween two variables is de ?ned as I eX ;Y T?I eY ;X T?∑x A X ∑y A Y

p ex ;y Tlog

p ex ;y Tp ex Tp ey T

e5T

I eX ;Y T?H eX TàH eX j Y T?H eY TàH eY j X T

e6T

Thus,MI is the reduction in the uncertainty of one random variable when the other is known.If the mutual information between two random variables is large (small),it means two variables are closely (not closely)related.

Conditional mutual information (CMI )is de ?ned as the amount of information shared by variables X and Y ,when Z is given.It is formally de ?ned by I eX ;Y j Z T?H eX j Z TàH eX j Y ;Z T

?H eY j Z TàH eY j X ;Z T

e7T

Conditional mutual information (CMI )is also used as the reduc-tion in the uncertainty of X due to knowledge of Y when Z is given.3.2.2.Relevant independency and relevant redundancy

In mrmr algorithm as shown in formula (8),the entropy-based mrmr score is calculated by the mutual information (MI )between the candidate variable and the target variable (the relevance term)subtracting the average MI between the candidate variable and the variables already selected (the redundancy term).The higher

the

Fig.1.A simple graph with three communities,enclosed by the dashed circles.Reprinted ?gure with permission from [35].

G.Zhao et al./Neurocomputing 151(2015)376–389

378

MI is for a feature,the more the feature is needed.

max

x j A F àS

I ex j ;C Tà

S j j ∑x i A S I ex j ;x i

e8T

Here,F is the whole set of features;C is the target variable;x i is the i th feature;S j j is the number of features in S .

It is can be clearly seen that it only measures the quantity of irrelevant redundancy (IR )as shown in Fig.2if their values are completely correlated between the candidate variables and the selected variables,however,which does not deal with relevant redundancy (RR )as shown in Fig.2which is the common information that they carry about target variable.Thus,it chooses some irrelevant variables too early and some useful variables too late [37,38],the reasons are that the irrelevant features with respect to class but having less redundancy with selected features can be chosen early;however those informative features which carry unique information about target variable and seem highly redundant with the previously selected variables might be picked late.In fact,these features possessing strong discriminative power are useful for classi ?cation.To avoid the drawback,Sotoca [24]used conditional mutual information to estimate relevant redun-dancies and relevant independency.The concept of relevant independency (RI )between features X i and X j with respect to target variable C measures the amount of information they can predict C ,but they do not share.Reversely,the complementary concept of relevant redundancy captures the information they share which can be used to predict target variable C .

The relevant independency (RI )between features X i and X j can be de ?ned by [24]RI eX i ;X j T?

I eX i ;C j X j TtI eX j ;C j X i T

2H eC T

e9T

Where,the entropy of the target variable 2H eC Tis a normalization factor,making the RI range between 0and 1.

The RI function would now be the amount of information features X i can,but feature X j cannot,predict the target variable C ,and vice versa .In fact,RI is a measure of relevant independency between features X i and X j in terms of target variable C ,which is also the amount of information they can predict C ,but they do not share.The intuitive description of concepts about IR ,RR and RI has been shown in Fig.2.It can be concluded easily that the larger the RI eX i ;X j Tis,the more independent the two features X i and X j are with respect to target variable C ,which means the combination of both features X i and X j has strong discriminatory power as a group.

https://www.360docs.net/doc/0d14452965.html,munity modularity Q-based feature selection

4.1.A new feature evaluation scheme

As mentioned above,the features that minimize the within-class distance and maximize between-class distance are preferred [43].The discriminative capability of a feature is that samples in the same class should have similar feature values but large margin values in the different class.Hence,the FVG in an informative feature should have clear community structure,corresponding to the larger value of community modularity Q .

In this paper,a different evaluation criterion is developed based on community modularity Q value under the assumption men-tioned above.A useful feature should have closer feature values for samples in the same class and large gap in different classes.In other words,the feature values in the same class should be in a tight cluster,similarly,in a sparse cluster for different classes.If a good feature could be constructed into a feature vector graph (FVG ),the graph should have a distinct community structure,that is to say the feature values for samples in the same class should be in the same community.The community modularity Q value of the FVG is higher,and the number of community is the number of classes.According to Eq.(2),the higher the community modularity Q values is for a FVG,the more apparent community structure it has,which means the feature is more relevant with respect to the target variable.Based on the above idea,we can calculate the community modularity Q value of each feature vector graph based on the known community structure in advance by the class variable.The features with top higher community modularity Q value are picked into the optimal subset.The key point is how to construct the FVG .

Given the r -th feature,the feature vector is f r ??f r 1;f r 2;:::;f ri ;:::;f rm T ,where f ri stands for the feature value of the i -th sample in the r -th feature,i ?1;:::;m ;r ?1;:::;n .The sample vector X i ??f i 1;f i 2;:::;f in is a row vector.For convenience,we take a binary classi ?cation data set for example and we assume that the ?rst k samples are in the same class and the rest m àk samples are in another class.Thus,the r -th feature vector can be represented as

f r ??f r 1;f r 2;:::;f rk ;f r k t1;:::;f rm T

For a feature with discriminative information,the ?rst k feature values ?f ri T i ?1;:::;k should be closer,in other words,the k à1nearest neighbors N k à1ef ri Tof any value f ri in ?f ri T i ?1;:::;k should be the rest k à1values in ?f ri i ?1;:::;k .We join the f ri with every value in N k à1ef ri Twith an edge to establish a sub-FVG ,which is a strong community.Similarly the same work can be done for the last m àk feature values ?f ri T i ?k t1;:::;m .Therefore,the framework of constructing a FVG can be summarized as follows:

For a feature vector f r ??f ri T i ?1;:::;m ,we put an edge between nodes i and j ,which corresponds to feature values f ri and f rj in f r if f ri and f rj are close to each other,i.e.f ri A N p j à1ef rj Tor f rj A N p i à1ef ri T,p i corresponding to the number of samples in a class,which includes sample X i ,similarly,for p j .

From the example above,it can be clearly seen that the idea behind our relevance measurement is based on the intrinsic characteristics within data,which means that these features minimizing the intra-class distance and maximizing the inter-class distance are more https://www.360docs.net/doc/0d14452965.html,rger inter-class distance implies that the local margin of any sample should be large enough.By the large margin theory [47],the upper bound of the leave-one-out cross-validation error of a nearest-neighbor classi-?er in the feature space is minimized and usually generalizes well on unseen test data [48,49].However,traditional mutual informa-tion based relevance evaluation between feature and class is established on the information amount by which the knowledge

)

(C H )

(i X H )

(j X H RR

Fig.2.Visualization of IR ,RR and RI between features X i and X j .

G.Zhao et al./Neurocomputing 151(2015)376–389379

provided by the feature vector decreases the uncertainty about the class,which is a function of the joint probability distribution of class c and feature f and independent of sample values.However,this evaluation under certain circumstance cannot accurately measure the discriminative power of a feature.In order to better illustrate this,for simplicity,the feature vectors f 1,f 2and the class vector c are de ?ned as follows:f 1?e1111133333TT f 2?e1111166666TT

c ?e0000011111TT :

According to Eq.(5),I ef 1;c T?I ef 2;c T,which means the feature f 1has the same relevancy as f 2.In our method,feature f 2has more discriminative power than f 1because the community modularity Q of FVG in feature f 2is larger than feature f 1.Obviously,feature f 2should be more relevant than f 1due to the fact that its between-class distance is larger than f 1.However,the MI based method cannot capture the differences between f 1and f 2.Therefore,our relevancy evaluation criterion based on community modularity Q is ef ?cient and accurate.However,our evaluation criterion is not good at imbalanced datasets,especially,including fewer samples in one class compared to the other class for binary classi ?cation.Because modularity optimization is widely criticized for resolution limit illustrated in Fig.3that may prevent it from detecting clusters which are comparatively small with respect to the graph as a whole [35],which can bring about the problem that the maximum modularity does not correspond to good community structure,that is,these features with higher Q value may not be relevant.To resolve the problem is our future https://www.360docs.net/doc/0d14452965.html,munity modularity Q based feature selection method https://www.360docs.net/doc/0d14452965.html,munity modularity Q based feature selection

In the previous subsection,we introduced a new feature evaluation criterion by constructing the FVG .The community modularity Q r value can be calculated based on community structure known in advance,one class corresponding to one community.The larger the Q r value is,the better the r -th feature is.Thus,we propose a community modularity Q value based

feature selection method (CMQFS )for ranking features.In this paper,we employ the feature selection procedure in a straight forward way.Firstly,for all features in feature space,the FVG could be easily built independently.Then,the community modularity Q r values of the FVG can be calculated.Features will be ranked in descending order with respect to Q r values and subset with top λprede ?ned features with higher Q r value will be selected.To help reader better understand our evaluation scheme,we take an UCI dataset,iris ,for example,which consists of 150samples,4features and is divided into 3classes.For convenience,we select all the samples in classes {12},50samples respectively for consideration and group them by classes.For the fourth feature:f 4?1;1;:::::;10:2;0:3;:::;0:2

z?????????????????}|?????????????????{50samples 2;2;:::;21:4;1:5;:::;1:3

z?????????????????}|?????????????????{50samples 264

The top line is the label vector,the below is the feature vector.Apparently,the feature values of samples in the same class are close but differ in different classes presented previously.Now,we construct FVG of the fourth feature,the ?rst value f 41in f 4is 0.2and p 1?50,the N 49e0:2Tare the rest of 49feature values in class {1}and we put an edge between f 41?0:2and all values in N 49e0:2Trespectively.S imilarly,the same job can be done like that for the values in class {2}.To here,the FVG of f 4has been ?nished successfully.Next,we can calculate the community modularity Q r value of the FVG above according to the known two community structures in advance to score the fourth feature based on Eq.(1)or (2).The FVGs of the remaining three features could be structured as the four -th feature does.Fig.4illustrates the FVGs of all the four features in dataset iris .Four community modularity Q r values are shown in Table 1.

As shown in Fig.4,it is clear that the FVGs of the third and fourth features have more intuitive community structure than the ?rst and second features as it is described in Table 1that their community modularity values are larger.Therefore,the relevancy to the target variable C of all four features could be sorted as {3412}as most methods have done in the iris dataset,which means our method (CMQFS )is effective and accurate on evaluating the discrimination with respect to target variable.For the multi-classes dataset,the FVG of each feature can be built as the binary-class dataset does to identify the relevancy similarly.

4.2.2.Relevant independency analysis

As was discussed in the previous subsection,CMQFS prefers the feature with higher community modularity value.However,like other ranking-based ?lter methods based on score function,it still cannot handle high redundancy among the selected features.To relieve the problem,the relevant independency (RI )de ?ned in Eq.(9),instead of the relevancy between features,is applied to capture the relevancy information among selected features and the features with larger RI in terms of the selected features S are picked up.

Next,we use the sum of relevant independency (RI )between a feature f r and all features in S to estimate the relevancy of the f r when the selected features I ef r ;C j S Tis given,which can be de ?ned as follows:

I r ef r ;C j S T?∑f j A S

RI ef r ;f j T

e10T

A larger value of I r ef r ;C j S Tstands for that f r has high relevant independency with features in S .Since the values of RI r and Q r may vary greatly,we adapt a linear transformation to change

the

Fig. 3.Resolution limit of modularity optimization.The natural community structure of the graph,represented by the individual cliques (circles),is not recognized by optimizing modularity,if the cliques are smaller than a scale depending on the size of the graph.In this case,the maximum modularity corresponds to a partition whose clusters include two or more cliques (like the groups indicated by the dashed contours)[35,50].

G.Zhao et al./Neurocomputing 151(2015)376–389

380

values of RI r and Q r to the range [0,1]as follows:NQ r ?Q r àQ min Q max àQ min e11T

NRI r ?

RI r àRI min RI max àRI min

e12T

where Q min and Q max denote the minimum and maximum of f Q 1;Q 2;:::;Q n g respectively.Thus NQ r takes values in [0,1].A value 0or 1of NQ r indicates that NQ r is the minimum or maximum of f Q 1;Q 2;:::;Q n g ,respectively.Similarly,RI min ,RI max and NRI r .Then,by the linear combination of NQ r and NRI r ,the main criterion of CMQFS is to iteratively select the feature r that maximizes w r :

w r ?β?NQ r te1àβT?NRI r

e13T

Where,βis a control parameter,which is taken in [01],β?0means that the NRI r completely plays a dominant role,similarly,NQ r is dominant when β?1.In this study,β?0:3.

The details of CMQFS are presented in Table 2.4.3.The time complexity of CMQFS

As shown in Table 2,the CMQFS mainly includes three steps.The ?rst step is to construct FVG in one feature.The subsequent step is to calculate the community modularity Q value for each

FVG.The third step is to analyze the relevant independency among features.The most time-consuming step is to build the FVG ,whose time complexity is about o enm 2log m T,where n is the number of features in feature space,m is the number of samples in dataset.Fortunately,the fast K-Nearest Neighbor Graph (K-NNG)construc-tion method [41]can be applied into the construction of FVGs,which can reduce the time complexity from o enm 2log m Tto o enm 1:14log m T.In the second step,the spending time is approxi-mately o en log n T.The algorithm has linear complexity o en Tfor calculating the relevant independency (RI ).Hence,the computa-tional cost of the third step is about o eλ2

n T,in general cases the threshold value λ(1o λo n )is certainly much smaller,or o en 2Tin the worst case when all features are selected.Thus,the overall

time cost of CMQFS is about o enm 1:14log m Tto en log n Tto eλ2

n T.

5.Experiments

In this section,to test the proposed approach,we conduct experiments on arti ?cial datasets including binary class and multi-class datasets.We compare our algorithm with ?ve typical feature selection algorithms mentioned above MIFS (β?1.0),U_MIFS (β?1.0),mrmr,Fisher Score ,Laplacian Score (NeighborMode ?‘KNN ’;k ?5;WeightMode ?‘HeatKernel ’;t ?1.0).We use the existing codes,which could be available in [26],to implement MIFS,MIFS_U ,and mrmr methods.

To evaluate the effectiveness of feature selection,the nearest neighborhood classi ?er (1NN)with Euclidean distance and sup-port vector machine (SVM)using a radial basis function (RBF)with default parameters are employed to test the performance of these feature selection algorithms.We utilize the LIBSVM package [36]for binary and multi-class classi ?cation.All experiments

are

Fig.4.The FVGs of four features in iris dataset.The red nodes corresponds to the feature values in class {1}and the blue nodes corresponding to the class {2},(a)The FVG of the four th feature,(b)The FVG of the third feature,(c)The FVG of the second feature,(d)The FVG of the ?rst feature.(For interpretation of the references to color in this ?gure legend,the reader is referred to the web version of this article.)

Table 1

The community modularity Q values of four features in iris dataset.r -th Feature 1234Modularity Q r

0.2142

0.1824

0.4883

0.4828

G.Zhao et al./Neurocomputing 151(2015)376–389381

conducted in Matlab 2012b on a PC with Intel s Core ?i3-2310CPU @2.10GHz and 4G main memory.

5.1.Datasets and preprocessing

To verify the effectiveness of our method for no matter binary classi ?cation or multi-classi ?cation,in our simulation experi-ments,10datasets from the Libsvm datasets 1[36]and UCI and two microarray datasets [39]are adopted.The Libsvm dataset contains many classi ?cation,regression,and multi-label data sets stored in LIBSVM format.Many are from UCI,Statlog,StatLib and other collections.Each attribute of most sets has been linearly scaled to [à1,1]or [0,1].In our experiments,all the features in datasets are uniformly scaled to zero mean and unit variance.The details of 12datasets are shown in Table 3.5.2.Feature selection and classi ?cation results

In this subsection,we use the classi ?cation performance,which is one of the most effective and direct way,to validate the feature selection method.For estimating the performance of classi ?cation algorithms,10-fold cross-validation is used to avoid the over-?tting problem.In the cross-validation process,the data is ran-domly divided into 10nearly equivalent sized folds.Each fold of the 10folds should be regarded as testing data while the rest 9folds are used for training.To reduce the unintentional effect,all the experimental results are the average of 10independent runs.For different compared methods and our method,we generate feature subsets by picking the top p selected features to access respective method in terms of classi ?cation accuracy,where p ?1;:::;λ.In mutual information computations during the analy-sis phase of relevant independence among selected features,we discretized continuous features to nine discrete levels as in [27,52]by converting the feature values between μàσ=2and μtσ=2to 0,the four intervals of size σto the right of μtσ=2to discrete levels from 1to 4,and the four intervals of size σto the left of μàσ=2to discrete levels from à1to à4whereas very large positive or small negative feature values are truncated and discretized to 74appropriately.In this paper,we use the FEAST tool [26],2to calculate MI and conditional mutual information (CMI).

Figs.5and 6respectively represents the classi ?cation rates using both SVM and 1-NN classi ?ers with respect to the subset of p

features selected by each method.The x -axis represents the subset of features selected,whereas the y -axis shows the average classi ?cation accuracy obtained by each method.Tables 4and 5summarizes respectively the average classi ?cation rates for the two classi ?ers under the subset of different p features selected by each method in different dataset.The bold value means that it is the largest one among these feature selection methods under the same classi ?er and the same number of selected feature subset.To avoid being in ?uenced by the scarcity of the data,we average the accuracies in the same selector and show the average,which is the average value of the accuracy rate in different p for all datasets in terms of each method.

Emphatically,for each compared method in one dataset,the results of the time consumption,which is the average run time while all the features are selected into S ,are provided to see a better comparison (accuracy-time)in Table 6.

According to the results in Tables 4and 5and Figs.5and 6,it can be concluded that CMQFS performs better than others in both classi ?ers,whose ‘Avg .’vaules are respectively 78.99%in 1-NN and 80.99%in SVM classi ?ers and are higher than other methods.Furthermore,in most cases,the average accuracies in the two classi ?ers are much higher than other selectors and always achieves signi ?cantly higher classi ?cation accuracy with less number of selected feature subset than the other methods,which means CMQFS can achieve the goal of feature selection which is to select a smaller subset of features and get higher learning accuracy.In several datasets (Sonar,Glass,Madelon,Vehicle ),CMQFS is wholly superior to other algorithms under the selected p features.From the accuracy results in Tables 4and 5,it could be said that CMQFS in most cases can gain the higher classi ?cation accuracy than other methods.

Table 3

Characteristics of the data sets in our experiment.No.

Dataset Sample Features Classes Source 1.Wine 178133Libsvm 2.Sonar

208602Libsvm 3.Svmguide2391203Libsvm 4.Glass 21496Libsvm 5.Vehicle 846184Libsvm 6.Madelon 20005002UCI 7.Letter 45001626Libsvm 8.Segment 2310197Libsvm 9.DLBCL_C 5837954[39]10.Breast_A 9812133[39]11.Hill_valley 6061002Libsvm 12.Zoo

101167UCI 13.

Magic04

19,020

UCI

Table?2

CMQFS:community modularity Q based feature selection.

Input:A training dataset D m ?n with m samples and n features in space F and the target C ,λprede ?ned feature number,control parameter βOutput:Selected feature subset S ;

1.Initialize parameters:S ?f ;

2.Group training sample D by class;

3.Calculate the value of NQ r for each feature f r in F according to Eqs.(2)and (11),r ?1;2;:::;n ;

4.?re _NQ r ;re _F ?sort eNQ r ;'descend 'T;

add re _F e1Tto S ,S ’S [f re _F e1Tg ,Set F ’re _F n re _F e1T;5.While (S j j r λ)6.For each feature f r in F 7.Calculate RI r ef r ;S Taccording to Eqs.(10)and (12)for all

pairs of ef r ;s Twith f r A F ,s A S ,if it is not yet available.Calculate w r according to Eq.(13).

8.End for

9.Select the feature f r with maximum value of w r .

Set S ’S [f f r g .F ’F n f f r g ;10.End while

1https://www.360docs.net/doc/0d14452965.html,.tw/$cjlin/libsvmtools/datasets/2

https://www.360docs.net/doc/0d14452965.html,/$gbrown/fstoolbox/

G.Zhao et al./Neurocomputing 151(2015)376–389

382

In micro-array datasets, e.g.DLBCL_C and Breast_A,CMQFS still performs well than MI based methods since samples are not enough to be estimated the MI exactly,which illustrates that our method is able to work well for the small samples datasets.Also,CMQFS is ef ?cient for large dataset,such as Magic04dataset.However,it also can be found that CMQFS does not win over some algorithms by a very large margin in some cases (e.g.Letter ).Therefore,the one-tailed-paired two samples for means test is used to assess the statistical signi ?cance of the differences between accuracy from all compared methods for both classi ?ers.In this test,the null hypoth-esis is that the average accuracy of CMQFS in different number of subset is not more than other feature selection algorithms in the classi ?cation,reversely;the alternative hypothesis is that our CMQFS is superior to compared feature selection algorithms in the classi ?ca-tion.For example,if we are going to compare the performance of CMQFS with the Fish Score method (CMQFS vs.Fish Score ),the null

024********

Number of features

)

%(N N 1h t i w y c a r u c c A n o i t a c i f i s s a l C 0

35404550556065707580

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h 1N N (%)

657075808590 Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h 1N N (%)

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h 1N N (%

)

05101520

4550

5560657075

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h 1N N (%)

Number of features

C l a s s i f i c a t i o n

A c c u r a c y w i t h 1N N (%)

102030405060708090100 Number of features

)

%(N N 1h t i w y c a r u c c A n o i

t a c i f i s s a l C 0102030405060708090100 Number of features

)

%(N N 1h t i w y c a r u c c A n o i t a c i f i s s a l C 505560657075808590

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h 1N N (%)

405060708090

100 Number of features )

%(N N 1h t i w y c a r u c c A n o i t a c i f i s s a l C 40424446485052545658

Number of features C l a s s i f i c a t i o n A c c u r a c y w i t h 1N N (%)

6065

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h 1N N (%)

Fig.5.The average classi ?cation accuracy using 1-NN classi ?er with respect to the subset of p features,selected by different method.

G.Zhao et al./Neurocomputing 151(2015)376–389383

and alternative it can be de ?ned as:H 0:μCMQFS r μFish _Score and H 1:μCMQFS 4μFish _Score ,where μCMQFS and μFish _Score are the average classi ?cation accuracy of CMQFS and the Fish Score method in different number of selected features on 12datasets.The signi ?cance level is set as 5%.From the test results in Tables 7and 8,it can be found that the p -values obtained by all pair-wise t-test are less than 0.05,which means the proposed CMQFS dramatically outperforms other algorithms.

From the results in Table 6,our method is comparatively more effective to other methods with respect to accuracy-time.The time consumption of our method can be tolerant compared to the classi ?cation accuracy.The Fisher Score method is the most ef ?cient but with poor performance.

To further verify the excellence of our feature evaluation criterion,speci ?cally,Fig.7(a –f)are the decision boundary of 1-NN classi ?er in the two best informative features selected by each methods for Wine database.It can be clearly observed that the two features selected by CMQFS are relatively informative as MIFS and mrmr methods (e.g.,Fig.7(c,e,f)),helping well separate part of the sample data,however,the rest methods are relatively noisy (e.g.,Fig.7(a,b,d)).It is justi ?ably believed that higher classi ?ca-tion accuracy could be achieved in the two best informative features space,which can be con ?rmed that the average classi ?ca-tion accuracies in the two feature space by CMQFS in both classi ?ers are larger than other methods as is shown in Figs.5and 6and Tables 4and 5.Furthermore,from Figs.5and 6,it can be clearly concluded that most of the average classi ?ca-tion accuracies by CMQFS in top one feature space (the best relevant with class variable)are almost superior to any method (https://www.360docs.net/doc/0d14452965.html,placieScore,Fish Score,MIFS,MIFS_U and mrmr),which means our method could be able to select the most informative feature.From the above mentioned,it is suf ?cient to prove that CMQFS has the capability of capturing the intrinsic characteristics of each feature and the relevant redundancy among features and it can select these informative features with fewer redundancies to the following classi ?cation tasks.CMQFS can perform better than other feature selection algorithms.

6.Justi ?cation of CMQFS based on K-means cluster

In this section,the justi ?ability of proposed feature evaluation

criterion based on community modularity Q value is demonstrated by means of K-means cluster theory,that is,why the feature where the Q value of FVG is higher are more discriminative?

The k-means cluster [42]is the most well-known clustering algorithm,which iteratively attempts to address the following objective:given a set of points in a Euclidean space and a positive integer k (the number of clusters),split the points into k clusters so that the total sum of the Euclidean distances of each point to its nearest cluster center is minimized,which can be de ?ned as follows:

J eC ;μT?∑C

t ?1∑i A c t ‖x i àμc t ‖2

e14T

Here,x i and μc t are respectively the i -th sample point and its

nearest cluster center,‖U ‖2is L 2-norm.

To feature weighting k-means,the feature that minimizes the within-cluster distance and maximizes between-cluster distance is preferred and hence gets higher weight [43].It can be easily thought whether the feature with higher community modularity Q value in our method can minimize the within-cluster distance and maximize between-cluster distance?It is necessary to con ?rm it.

According to Eq.(2),Q ?∑n c

c ?1?el c =m Tàe

d c =2m T2 ,making th

e Q value higher is to equivalently maximize the inner edges l c and minimize d c ,l c ?ed in =2T,d c p d out ,that is to say,each community

of FVG in a feature has larger inner-degree d in and smaller out-degree d out between communities,that is,sample points in the feature with the same class labels can be correctly classi ?ed into in the same class as many as possible and into different classes as few as possible if the feature is more discriminative as individual.Now,the expected number of sample points in the feature correctly classi ?ed can be calculated using Neighborhood Components Analysis [44].

Given a candidate feature f ,in particular,each sample point i in f feature space selects another sample point j as its neighbor with some probability P ij .The P ij can be de ?ned using a soft-max over Euclidean distances:P ij ?

exp eà‖x i àx j ‖2T∑k a i

exp eà‖x i àx k ‖2T?exp eà‖x i àx j ‖2T

D i

;

P ii ?0

e15T

Under this stochastic selection rule,we can compute the

probability P i that point i will be correctly classi ?ed (denote the set of points in the same class as i by C t ?f j c t ?c j

g ):P i ?∑j A C i

P ij

e16T

Hence,the expected number of sample points in the f feature space correctly (ENC )classi ?ed into the same class is de ?ned by:ENC ef T?∑i

P i ?∑i ∑j A C i

P ij

e17Tf ?arg max f A F

ENC ef T

e18T

The feature f with larger ENC is more discriminative.

According to Eqs.a conclusion can be made below that maximizing the ENC is mutually equivalent to minimizing the k-means cluster objective J eC ;μT.

proof (1).minimizing J (C ,μ))maximizing ENC (f )

Given a feature f A F ,Eq.(15)is substituted into (17),so ENC ef T?∑i

P i ?∑i

∑

j A C i

exp eà‖x i àx j ‖2Ti ?∑C t ?1∑

i A c t ;j A c t =i exp eà‖x i àx j ‖2Ti 4∑

t ?1∑

i A c t ;j A c t =i

exp eà‖x i àx j ‖2T

D max ;

D max ?max D 1;D 2;:::;D n f g 41

D max ∑C t ?1exp à∑i ;j A c t

‖x i àx j ‖2 !:C is the number of clusters.

The lower bound of ENC (f )is de ?ned by ENC

L _bound :

ENC L_bound ?

D max ∑C t ?1exp à∑i ;j A c t

‖x i àx j ‖2 !?

D max ∑C t ?1exp à∑i ;j A c t ;i o j

‖x i àx j ‖2 !To here,the ENC ef Tcan be maximized simultaneously by

maximizing its lower bound ENC L _bound ,equivalently,minimizing ∑C t ?1

∑i ;j A c t ;i o j

‖x i àx j ‖2.As we know,∑

t ?1∑

i ;j A c t ;i o j

‖x i àx j ‖2r 2∑C

t ?1∑i A c t

‖x i àμc t ‖2?2J eC ;μTp J eC ;μT

which,therefore,denotes that the lower bound ENC L _bound has been maximizing,ENC ef Tgetting maximum value when the k-means objective Eq.(14)is optimizing for minimum.proof (2).maximizing ENC ef Tminimizing J eC ;μT

G.Zhao et al./Neurocomputing 151(2015)376–389

384

From the results in proof (1),and ENC ef To ∑

t ?1∑

i A c t ;j A c t ài

exp eà‖x i àx j ‖2T

D min ;

D min ?min f D 1;D 2;:::;D n g

Obviously,maximizing the ENC ef Tis equivalent to minimizing

∑C t ?1∑i A c t ;j A c t ài

‖x i àx j ‖2,and because

∑

t ?1∑i A c t ;j A c t ài

‖x i àx j ‖2?∑

t ?1∑i A c t ;j A c t ài

n t

‖x i àx j ‖2

n t

100

Number of features

)

%(M V S h t i w y c a r u c c A n o i t a c i f i s s a l C

6570758085

90 Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h S V M (%)

45505560657075

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h S V

M (%)

45505560657075808590

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h S V M (%)

606570758085

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h S V M (%)

Number of features

C l a s s i f i c a t i

o n A c c u r a c y w i t h S V M (%)

102030405060708090100 Number of features

)

%(M V S h t i w y c a r u c c A n o i t a c i f i s s a l C 0

102030405060708090

100 Number of features

)

%(M V S h t i w y c a r u c c A n o i t a c i f i s s a l C 30

405060708090

Number of features

)

%(M V S h t i w y c a r u c c A n o i t a c i f i s s a l C

657075808590

Number of features C l a s s i f i c a t i o n A c c u r a c y w i t h S V M (%)

4850525456586062 Number of features C l a s s i f i c a t i o n A c c u r a c y w i t h S V M (%)

6570

Number of features

C l a s s i f i c a t i o n A c c u r a c y w i t h S V M (%)

Fig.6.The average classi ?cation accuracy using SVM classi ?er with respect to the subset of p features,selected by different method.

G.Zhao et al./Neurocomputing 151(2015)376–389

385

Z ∑C

t ?1∑i A c t

n t ‖x i àμc t ‖2p J eC ;μT

Hence,the k-means cluster function J eC ;μTis being minimized

while ∑C t ?1

∑i A c t ;j A c t ài

‖x i àx j ‖2is being minimized,ENC ef Tbeing maximized.Here,n t is the number of samples in class t .

So far,it can be concluded that J eC ;μTin f feature space must be minimizing meanwhile the community modularity Q value of FVG in f feature space gets higher value,which indicates that the

Table 4

Accuracy of different FS algorithms based on 1NN (%).#R FisherScore

LaplacianScore

MIFS

MIFS_U

mrmr

CMQFS

Wine p ?284.2484.8692.1284.8092.7192.71p ?897.1595.4994.3497.3094.9398.44p ?995.5296.6096.6396.0797.2298.43Sonar p ?570.1973.9772.5973.1172.0776.45p ?1076.9180.8382.1477.8583.1185.59p ?1578.2884.0983.6982.3383.1986.97Glass p ?256.3851.5161.6862.6463.6364.52p ?371.4758.3760.6961.2959.7672.83p ?471.9959.7855.5667.1668.6579.37Vehicle p ?460.5161.0151.1859.9351.6667.25p ?863.5865.3566.1865.1263.8367.51p ?1266.9165.9469.1469.1466.7870.08p ?469.5360.1166.7767.0161.6571.12p ?666.2565.7271.6068.5669.8171.82p ?12

69.83

65.6970.6168.2668.2673.89Madelon p ?1086.1549.0052.1071.4052.9587.95p ?2072.8048.8051.6559.1057.1088.60p ?3065.9548.8053.8052.1053.7078.25Letter p ?222.0620.7524.2024.2223.3526.68p ?338.0841.1541.7141.5742.0047.84p ?4

46.53

60.3558.8854.2253.9760.33Segment p ?284.1914.5086.4086.8386.7188.34p ?388.0156.4096.7596.4996.6297.89p ?488.4879.7896.1996.8896.9696.67DLBCL_C p ?566.6658.3374.6677.3374.0081.00p ?1065.3356.0077.6674.3384.6690.00p ?2063.3360.3377.3374.3383.0088.33Breast_A p ?583.6667.2286.7783.7790.7794.00p ?1080.5573.3392.8885.7787.7792.88p ?1584.7769.3386.6689.6688.8989.77Hill_valley p ?548.1745.8849.3353.8247.8455.27p ?1046.0446.3750.4950.3548.4954.61p ?1547.2047.7149.0252.3050.1352.63Zoo p ?383.2788.0981.2781.0984.0990.09p ?686.1891.0095.0092.1896.1896.09p ?991.0091.0993.0991.0094.1896.09Magic04p ?578.9676.8975.0578.6176.2979.82p ?681.6780.5276.6678.8476.4983.21p ?781.6380.8577.8878.7677.9182.21Avg.

71.27

64.66

71.80

72.65

72.34

78.99

Table 5

Accuracy of different FS algorithms based on SVM (%).#R FisherScore

LaplacianScore

MIFS

MIFS_U

mrmr

CMQFS

Wine p ?290.8889.9393.8590.9494.9394.96p ?897.1598.3397.197.2296.1197.77p ?996.5697.7197.2297.2297.7798.88Sonar p ?574.5971.6671.7171.2172.1676.47p ?1069.6976.0279.8371.6482.1487.97p ?1579.2680.7386.5282.1681.2690.33Glass p ?251.3246.1966.6866.7565.8166.32p ?365.8266.0363.1362.5962.5767.33p ?4

69.54

67.3361.1666.8866.2970.06Vehicle p ?462.8762.4053.6661.6853.4271.98p ?870.8074.3574.3472.3371.9677.78p ?1277.7882.0378.5979.5575.7882.85Svmguide2p ?475.9265.7474.4175.7275.4677.48p ?879.8372.6476.2076.9876.9881.08p ?1079.5375.7376.9878.2779.0181.36Madelon p ?1086.0549.1054.9075.5056.5086.55p ?2076.8550.3555.5064.3557.8087.55p ?3070.5049.9557.9060.9560.5077.85Letter p ?227.3329.1533.2833.3533.4630.64p ?345.7545.5346.9747.1147.3148.82p ?4

51.33

58.0659.5756.2456.3360.84Segment p ?284.8017.4481.9981.7782.1684.58p ?386.3256.7595.9396.0195.9796.32p ?486.9276.1496.7596.7596.4097.23DLBCL_C p ?565.3340.0080.0083.0078.3384.34p ?1067.3356.6679.0077.3376.3389.33p ?2076.0071.0067.3384.6684.0090.00Breast_A p ?574.4462.3379.6686.6692.7792.98p ?1080.4475.6686.0086.7784.8892.88p ?1576.5577.4487.7789.7787.8891.77Hill_valley p ?554.1052.9454.4557.5753.1459.39p ?1053.9652.9953.3156.4354.1260.74p ?1554.6255.6253.9557.7454.5960.56Zoo p ?379.2788.0987.1889.1890.0092.00p ?686.0987.0092.1893.0996.0996.00p ?995.0089.0090.0988.0095.0096.09Magic04p ?582.4983.0080.5182.2680.9583.67p ?685.1185.8481.4182.4881.4287.23p ?785.9486.1482.5783.3582.4688.56Avg.

73.69

67.26

74.09

75.93

75.13

80.99

Table 6

the running time results of different methods on different dataset (time:s).Datasets FisherScore LaplacianScore MIFS MIFS_U mrmr CMQFS Wine 0.0260.0790.0120.0420.0200.080Sonar 0.0370.0790.1520.2370.0540.081Glass 0.0110.0640.0070.0290.0190.068Vehicle 0.0340.1540.0930.1140.0420.176Svmguide2

0.029

0.087

0.056

0.084

0.031

0.091

G.Zhao et al./Neurocomputing 151(2015)376–389

386

feature selected by our method is able to minimize the within-cluster distance.Similarly,the expected number of points incor-rectly classi ?ed is de ?ned by

ENIC ef T?n àENC ef T,n is the number of samples.The smaller ENIC ef Tgives rise to fewer edges between communities and greater distance between-cluster.It can be reasonably believed that the feature with higher Q value is more relevant with class labels,which not only minimizes the within-cluster distance,but also maximizes between-cluster distance.Therefore,the features selected by our method have strong discriminative power.

7.Discussion and conclusions

In this paper,by constructing a feature vector graph (FVG ),a novel feature evaluation criterion using the community modular-ity Q value in a complex networks is presented to measure the relevancy of each feature with the target variable.In fact,the justi ?able idea behind our feature evaluation criterion can be demonstrated by k-means cluster theory.To overcome the redun-dancy problem of ranking based ?lter methods,the relevant independency (RI )based on information theoretic criteria is utilized for feature relevant redundancy analysis among selected features,instead of irrelevant redundancy (IR )between https://www.360docs.net/doc/0d14452965.html,bining the above two points,a new feature selection method,CMQFS ,is proposed for feature subset selection.Based on this framework,the average relevancy in the selected feature subset can be maximized and the average relevant independency (RI )(relevant redundancy (RR ))among the selected features can be maximized (minimized),simultaneously.Another advantage of our method is that the CMQFS is free of parameters.

The proposed method is compared with ?ve other feature selection methods using two different classi ?cation schemes on 12publicly available datasets.The experimental results demonstrate

Table 7

The pair-wise one-tailed paired two samples for means t -test results of CMQFS and other algorithms in 1NN .

Pair-wise t-test p -value CMQFS vs.Fish Score

6.41E à03CMQFS https://www.360docs.net/doc/0d14452965.html,placianScore 2.28E à04CMQFS vs.MIFS 2.10E à02CMQFS vs.MIFS_U 2.74E à02CMQFS

vs.

mrmr

2.54E à02

Table 8

The pair-wise one-tailed paired two samples for means t -test results of CMQFS and other algorithms in SVM .

Pair-wise t -test p -Value CMQFS vs.Fish Score

3.58E à09CMQFS https://www.360docs.net/doc/0d14452965.html,placianScore 1.01E à06CMQFS vs.MIFS 1.61E à06CMQFS vs.MIFS_U 1.73E à07CMQFS

vs.

mrmr

4.56E à06

Feature 1f e a t u r e 2

Feature 1

f e a t u r e

Feature 1f e a t u r e 2

Feature 1

f e a t u r e 2

Fig.7.Decision boundary of 1-NN classi ?er of samples with the two best informative features for different methods.Three colors represent three classes.(For interpretation of the references to color in this ?gure legend,the reader is referred to the web version of this article.)(a)LaplaciaScore,(b)Fish Score,(c)MIFS,(d)MIFS_U,(e)mrmr and (f)CMQFS.

Table 6(continued )Datasets FisherScore LaplacianScore MIFS MIFS_U mrmr CMQFS Madelon 0.1520.48331.97530.8148.25735.231Letter 0.0980.1470.3880.3930.1260.431Segment 0.0620.6710.2740.2860.0950.781DLBCL_A 0.1740.078 1.506 2.6490.422 2.827Breast_A 0.3140.102 1.849 3.1180.516 3.211Hill_valley 0.0360.1060.7350.8260.2120.934Magic04

0.031

0.093

0.629

0.576

0.198

1.532

G.Zhao et al./Neurocomputing 151(2015)376–389387

that our method achieves remarkably promising improvement or comparable level on feature selection and classi?cation accuracy with smaller feature subset,which veri?es the ability of the proposed method to select a subset with high discriminative power.In addition,the one-tailed-paired two samples for means test is used to assess the statistical signi?cance of the differences between accuracy from all compared methods for both classi?ers. It can be found that the p-values obtained by all pair-wise t-test are less than0.05,which means the proposed CMQFS dramatically outperforms other algorithms.In short,three key reasons why our method performs well than other MI based compared methods are as follows:

1.Our novel feature evaluation scheme based on community

modularity Q value can capture the discriminative power for each feature while the intrinsic structure of data is well protected.Higher Q value represents that the FVG in a feature has apparent community structure,corresponding to larger inner degree but smaller out-degree in community structure

[40],that is to say,the sample points in the feature space with

smaller within-cluster distance and larger between-cluster distance can be easily clustered.Hence,the feature is more relevant in terms of label variable.

2.However,unlike our method,the above MI-based feature

selection methods have been criticized for a crucial limitation: the loss of the intrinsic information in raw data can occur due to estimating the probability distribution of the feature vectors by the discretization of feature variable.Furthermore,the MI based methods perform poorly when the samples are insuf?-cient resulting from incorrectly estimating the probability of feature vector.

3.The relevancy among selected features is analyzed using

relevant independency,instead of irrelevant redundancy and relevant redundancy which are widely used in many methods, such as MI based methods,which can make our method select discriminative features as groups.

As is presented above,the CMQFS algorithm in some cases suffers from resolution limit and is not good at imbalanced datasets,furthermore,our method is a supervised method.The unsupervised CMQFS is obligatory when the label information could not be available.Our future research will be carried out on solving these issues above.

Acknowledgment

The authors would like to thank the anonymous referees for their comments which signi?cantly helped improve the paper. References

[1]I.Guyon,J.Weston,S.Barnhill,V.Vapnik,Gene selection for cancer classi?ca-

tion using support vector machines,Mach.Learn.46(2002)389–422.

[2]P.Bermejo,L.de la Ossa,J.A.Gámez,J.M.Puerta,Fast wrapper feature subset

selection in high-dimensional datasets by means of?lter re-ranking,Knowl.-based Syst.25(2012)35–44.

[3]M.ElAlami,A?lter model for feature subset selection based on genetic

algorithm,Knowl.-based Syst.22(2009)356–362.

[4]Q.He,C.Wu,D.Chen,S.Zhao,Fuzzy rough set based attribute reduction for

information systems with fuzzy decisions,Knowl.-based Syst.24(2011) 689–696.

[5]H.Liu,J.Sun,L.Liu,H.Zhang,Feature selection with dynamic mutual

information,Pattern Recognit.42(2009)1330–1339.

[6]X.F.He,D.Cai,P.Niyogi,Laplacian score for feature selection,in:Proceedings

of Neural Information Processing Systems,Cambridge,2005,pp.505–512. [7]Y.Z.Ren,G.J.Zhang,G.X.Yu,X.Li,Local and global structure preserving based

feature selection,Neurocomputing89(2012)147–157.

[8]W.Hu,K.-S.Choi,Y.Gu,S.Wang,Minimum–maximum local structure

information for feature selection,Pattern Recognit.Lett.34(5)(2013)527–535.

[9]C.M.Bishop,Neural Networks for Pattern Recognition,Oxford University Press,

USA,1996,pp.1–18.

[10]D.Q.Zhang,S.C.Chen,Z.H.Zhou,Constraint score:a new?lter method for

feature selection with pairwise constraints,Pattern Recognit.41(5)(2008) 1440–1451.

[11]K.Kira,L.A.Rendell,The feature selection problem:traditional methods and

new algorithm,in:Proceedings of the Tenth National Conference on Arti?cial Intelligence,AAAI Press,1992,pp.129–134.

[12]K.Kira,L.A.Rendell,A practical approach to feature selection,D.Sleeman,

P.Edwards(Eds.),in:Proceedings of the Ninth International Workshop on Machine Learning,Morgan Kaufmann,1992,pp.249–256.

[13]I.Kononenko,Estimating attributes:analysis and extensions of Relief,in:L.

D.Raedt,F.Bergadano(Eds.),Proceedings of the European Conference on

Machine Learning on Machine Learning,Springer Verlag,1994,pp.171–182.

[14]M.R.Sikonja,I.Kononenko,An adaptation of Relief for attribute estimation in

regression,in:D.H.Fisher(Ed.),Proceedings of the Fourteenth International Conference on Machine Learning,Morgan Kaufmann,1997,pp.296–304. [15]M.R.Sikonja,I.Kononenko,Theoretical and empirical analysis of ReliefF and

RReliefF,Mach.Learn.53(1–2)(2003)23–69.

[16]Z.Deng,F.Chung,S.Wang,Robust relief-feature weighting,margin maximiza-

tion,and fuzzy optimization,IEEE Trans.Fuzzy Syst.18(4)(2010)726–744.

[17]R.Battiti,Using mutual information for selecting features in supervised neural

net learning,IEEE Trans.Neural Netw.5(1994)537–550.

[18]L.Yu,H.Liu,Ef?cient feature selection via analysis of relevance and

redundancy,J.Mach.Learn.Res.5(2004)1205–1224.

[19]H.R.Cheng,Z.G.Qin,C.S.Feng,Y.W,F.G.Li,Conditional mutual information-

based feature selection analyzing for synergy and redundancy,ETRI J.33(2) (2011)210–217.

[20]N.Kwak,C.H.Choi,Input feature selection for classi?cation problems,IEEE

Trans.Neural Netw.13(1)(2002)143–159.

[21]S.Cang,H.Yu,Mutual information based input feature selection for classi?ca-

tion problems,Decis.Support Syst.54(2012)691–698.

[22]H.Peng,F.Long,C.Ding,Feature selection based on mutual information:

criteria of max-dependency,max-relevance and min-redundancy,IEEE Trans.

Pattern Anal.Mach.Intell.27(8)(2005)1226–1238.

[23]P.A.Estévez,M.Tesmer,C.A.Perez,J.M.Zurada,Normalized mutual informa-

tion feature selection,IEEE Trans.Neural Netw.20(2)(2009)189–201. [24]J.Mart?′nez Sotoca,F.Pla,Supervised feature selection by clustering using

conditional mutual information-based distances,Pattern Recognit.43(2010) 2068–2081.

[25]X.Sun,Feature selection using dynamic weights for classi?cation,Knowl.-

based Syst.37(2013)541–549.

[26]G.Brown,A.Pocock,M.J.Zhao,M.Luj′an,Conditional likelihood maximisa-

tion:a unifying framework for information theoretic feature selection,J.Mach.

Learn.Res.13(2012)27–66.

[27]C.O.Sakar,A feature selection method based on kernel canonical correlation

analysis and the minimum redundancy-maximum relevance?lter method, Exp.Syst.Appl.39(2012)3432–3437.

[28]X.Sun,Feature evaluation and selection with cooperative game theory,

Pattern Recognit.45(2012)2992–3002.

[29]X.Sun,Y.Liu,J.Li,J.Zhu,X.Liu,H.Chen,Using cooperative game theory to

optimize the feature selection problem,Neurocomputing97(2012)86–93. [30]H.Liu,L.Yu,Toward integrating feature selection algorithms for classi?cation

and clustering,IEEE Trans.Knowl.Data Eng.17(2005)491–502.

[31]L.C.Molina,L.Belanche,A.Nebot,Feature selection algorithms:a survey and

experimental evaluation,in:Proceedings of IEEE International Conference on Data Mining,IEEE Computer Society,2002,pp.306–313.

[32]I.Guyon,E.Andr,An introduction to variable and feature selection,J.Mach.

Learn.Res.3(2003)1157–1182.

[33]G.D.Zhao,Y.Wu,Y.F.Ren,M.Zhu,EAMCD:an ef?cient algorithm based on

minimum coupling distance for community identi?cation in complex net-works,Eur.Phys.J.B86(2013)14.

[34]M.E.J.Newman,M.Girvan,Finding and evaluating community structure in

networks,Phys.Rev.E69(2)(2004)026113.

[35]S.Fortunato,Community detection in graphs,Phys.Rep.486(2010)75–174.

[36]C.W.Hsu,C.J.Lin,A comparison of methods for multi-class support vector

machines,IEEE Trans.Neural Netw.13(2)(2002)415–425.

[37]Q.He,C.Wu,D.Chen,S.Zhao,Fuzzy rough set based attribute reduction for

information systems with fuzzy decisions,Knowl.-based Syst.24(2011)689–696.

[38]Y.Chen,D.Miao,R.Wang,K.Wu,A rough set approach to feature selection

based on power set tree,Knowl.-based Syst.24(2011)275–281.

[39]Y.Hoshida,J.P.Brunet,P.Tamayo,T.R.Golub,J.P.Mesirov,Subclass mapping:

identifying common subtypes in independent disease data sets,PLoS One 2(11)(2007)e1195.https://www.360docs.net/doc/0d14452965.html,/10.1371/journal.pone.0001195.

[40]H.J.Li,X.S.Zhang,Analysis of stability of community structure across multiple

hierarchical levels,Europhys.Lett.103(5)(2014)8002.

[41]W.Dong,M.Charikar,K.Li,Ef?cient k-nearest neighbor graph construction for

generic similarity measures,in:Proceedings of the International World Wide Web Conference Committee(IW3C2),March28–April1,2011,Hyderabad, India.

[42]C.Boutsidis,P.Drineas,M.W.Mahoney,Unsupervised feature selection for the

k-means clustering problem,Advances in Neural Information Processing Systems(2009)153–161.

[43]S.Alelyani,J.Tang,H.Liu,Feature selection for clustering:a review,in:Data

Clustering:Algorithms and Applications,Arizona State University Book,2013, pp.29–60.

G.Zhao et al./Neurocomputing151(2015)376–389 388

[44]J.Goldberger,S.Roweis,G.Hinton,R.Salakhutdinov,Neighbourhood compo-

nents analysis,Advances in Neural Information Processing Systems17(2005) 513–520.

[45]G.Wang,F.Lochovsky,Q.Yang,Feature selection with conditional mutual

information maximin in text categorization,in:CIKM'04Proceedings of the Thirteen ACM International Conference on Information and Knowledge Management,pp.342–349.

[46]Y.Xu,D.Rockmore,Feature selection for link prediction,in:Proceeding PIKM

'12Proceedings of the5th Ph.D.Workshop on Information and Knowledge, 2012,pp.25–32.

[47]V.Vapnik,Statistical Learning Theory,Wiley,New York,1998.

[48]Y.Sun,Iterative RELIEF for feature weighting:algorithms,theories,and

applications,IEEE Trans.Pattern Anal.Mach.Intell.29(6)(2007)1035–1051.

[49]B.Chen,H.Liu,J.Chai,Large margin feature weighting method via linear

programming,IEEE Trans.Knowl.Data Eng.21(10)(2009)1475–1488. [50]S.Fortunato,M.Barthélemy,Resolution limit in community detection,Proc.

https://www.360docs.net/doc/0d14452965.html,A104(2007)3–41.

[51]G.Chandrashekar,F.Sahin,A survey on feature selection methods,Comput.

Electr.Eng.40(2014)16–28.

[52]O.Kursun,C.O.Sakar,O.Favorov,N.Aydin,F.Gurgen,Using covariates for

improving the minimum redundancy maximum relevance feature selection

method,https://www.360docs.net/doc/0d14452965.html,put.Sci.18(6)(2010)975–989

Guodong Zhao is a Ph.D.candidate at the School of Electronics and Information Engineering of Tongji Uni-versity,ShangHai,China.Currently,he is a teacher in Shanghai Dianji University,Shanghai,China.He received his M.Sc.degree in Huazhong University of Science and Technology,Wu'han,China,in2009.His current research interests include machine learning,

pattern recognition and

bioinformatics.

Yan Wu is a full professor and Doctoral Advisor in the College of Electronics and Information Engineering, Tongji University,Shanghai,China.She received her Ph.D.degree in traf?c information engineering and control from Shanghai Tiedao University,China,in 1999.From2000to2003,she had worked as a Post-doctoral Research Fellow in Department of Electric Engineering,Fudan University,China.She has pub-lished more than100papers on important national and international journals and conference proceedings. Now she is mainly engaged in deep learning,intelligent information processing,pattern

recognition.Fuqiang Chen is a master candidate in the Computa-tional Intelligence and Technology Group at the College of Electronics and Information Engineering,Tongji University,Shanghai,China.He received his B.Sc. degree in Applied Mathematics from Shandong Uni-versity,Weihai,China,in2012.His current research interests include machine learning,data mining,com-puter vision and pattern

recognition.

Junming Zhang is currently a Ph.D.student at the School of Electronics and Information Engineering of Tongji University,Shanghai,China.He got M.S.in control engineering from Changchun University of Technology,Changchun,China,in2009and B.S.in computer science and technology from Henan Univer-sity of science and technology,Henan,China,in2006. His research interests include machine learning,bio-medical signal processing,and automatic sleep

staging. Jing Bai is a master candidate at the School of Electro-nics and Information Engineering of Tongji University, ShangHai,China.She received her B.Sc.degree in information management and information system with honors from He'nan University,He'nan,China,in2012. Her current research interests include machine learn-ing and pattern recognition,especially unsupervised learning in deep learning.

G.Zhao et al./Neurocomputing151(2015)376–389389