Mining ArgoUML with Dynamic Analysis to Establish a Set of Key Classes for Program Comprehe

合集下载

大规模时序图中稠密子图搜索算法的研究与应用

大规模时序图中稠密子图搜索算法的研究与应用时序图是一种常用于描述时间序列数据的图形模型，广泛应用于许多领域，例如交通流量分析、社交网络分析等。

随着数据规模的不断增大，大规模时序图的分析与挖掘成为了一个重要的研究课题。

其中，稠密子图搜索算法的研究与应用在大规模时序图的分析中具有重要意义。

稠密子图是指图中节点之间存在大量的连接关系，具有较高的密度。

在时序图中，稠密子图可以表示节点之间的相似性或者关联性。

因此，对大规模时序图进行稠密子图搜索可以帮助我们发现其中具有相似特征或相关性的节点集合，进而深入分析和挖掘时序数据的内在规律。

在研究中，学者们提出了许多有效的大规模时序图稠密子图搜索算法。

其中，一种常用的方法是基于图的聚类算法。

该算法将时序图中的节点划分为不同的簇，每个簇代表一个稠密子图。

通过计算节点之间的相似性度量，将相似的节点聚类在一起，从而找到稠密子图。

另一种常用的方法是基于子图挖掘的算法。

该算法通过枚举所有可能的子图组合，并计算其密度，找到具有最高密度的子图作为稠密子图。

为了提高算法的效率，研究者们还提出了一些剪枝策略，减少子图挖掘的计算量。

除了算法研究，大规模时序图稠密子图搜索算法也得到了广泛的应用。

例如，在交通流量分析中，研究者通过搜索稠密子图，可以找到具有相似车辆行驶模式的节点集合，从而预测交通拥堵情况。

在社交网络分析中，稠密子图搜索算法可以帮助我们发现具有相似兴趣爱好的用户群体，为个性化推荐等应用提供支持。

总之，大规模时序图中稠密子图搜索算法的研究与应用具有重要意义。

这些算法可以帮助我们从海量的时序数据中挖掘出具有相似特征或相关性的节点集合，为数据分析和挖掘提供支持。

随着数据规模的不断增大，稠密子图搜索算法的研究与应用还有很大的发展空间，将对各个领域的研究和实践产生重要影响。

dynamicreports 交叉表-概述说明以及解释

dynamicreports 交叉表-概述说明以及解释1.引言1.1 概述动态报告是一种高度可定制和自动化生成的报告生成工具，可以通过使用动态报告库来创建丰富多样的报告，包括但不限于交叉表。

在数据分析和数据可视化领域，交叉表是一种常用的工具，用于展示不同变量之间的关系以及它们对结果的影响。

本文将重点介绍dynamicreports库中的交叉表功能，该库是一个功能强大的Java报告生成库，提供了丰富的报告生成选项和灵活的报告布局。

它可以帮助开发人员轻松地生成精美的报告，并且具有良好的扩展性和可定制性。

在本文中，我们将首先介绍动态报告的基本概念，包括其原理和使用方法。

随后，我们将详细介绍dynamicreports库的特点和功能，以及它为创建交叉表所提供的支持。

我们将解释交叉表的概念和应用，并展示如何使用dynamicreports库来创建交叉表报告。

在文章的结论部分，我们将总结动态报告的优势，包括其在报告生成方面的灵活性和效率。

同时，我们还将对dynamicreports库中的交叉表功能进行评价，并探讨未来的发展方向。

通过本文的阅读，读者将能够全面了解动态报告和交叉表的基本概念，以及如何使用dynamicreports库来创建交叉表报告。

同时，读者还可以深入了解该库的特点和功能，并探索其在数据分析和报告生成领域的应用价值。

1.2 文章结构本篇长文将按照以下结构进行组织和论述：第一部分是引言部分，主要对本文的研究对象进行概述，并介绍了文章的整体结构和目的。

引言部分将介绍动态报告和交叉表的基本概念，以及本文的研究目的。

第二部分是正文部分，主要围绕动态报告和交叉表展开讨论。

首先，将介绍动态报告的基本概念，包括其用途和特点。

然后，将详细介绍dynamicreports库，包括其功能和应用场景。

接着，将介绍交叉表的概念和应用，以及交叉表在数据分析中的重要性。

最后，将详细介绍dynamicreports库中的交叉表功能，包括如何创建和定制交叉表，以及如何使用它们进行数据分析和可视化。

From Data Mining to Knowledge Discovery in Databases

s Data mining and knowledge discovery in databases have been attracting a signiﬁcant amount of research, industry, and media atten-tion of late. What is all the excitement about?This article provides an overview of this emerging ﬁeld, clarifying how data mining and knowledge discovery in databases are related both to each other and to related ﬁelds, such as machine learning, statistics, and databases. The article mentions particular real-world applications, speciﬁc data-mining techniques, challenges in-volved in real-world applications of knowledge discovery, and current and future research direc-tions in the ﬁeld.A cross a wide variety of ﬁelds, data arebeing collected and accumulated at adramatic pace. There is an urgent need for a new generation of computational theo-ries and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of digital data. These theories and tools are the subject of the emerging ﬁeld of knowledge discovery in databases (KDD).At an abstract level, the KDD ﬁeld is con-cerned with the development of methods and techniques for making sense of data. The basic problem addressed by the KDD process is one of mapping low-level data (which are typically too voluminous to understand and digest easi-ly) into other forms that might be more com-pact (for example, a short report), more ab-stract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for exam-ple, a predictive model for estimating the val-ue of future cases). At the core of the process is the application of speciﬁc data-mining meth-ods for pattern discovery and extraction.1This article begins by discussing the histori-cal context of KDD and data mining and theirintersection with other related ﬁelds. A briefsummary of recent KDD real-world applica-tions is provided. Deﬁnitions of KDD and da-ta mining are provided, and the general mul-tistep KDD process is outlined. This multistepprocess has the application of data-mining al-gorithms as one particular step in the process.The data-mining step is discussed in more de-tail in the context of speciﬁc data-mining al-gorithms and their application. Real-worldpractical application issues are also outlined.Finally, the article enumerates challenges forfuture research and development and in par-ticular discusses potential opportunities for AItechnology in KDD systems.Why Do We Need KDD?The traditional method of turning data intoknowledge relies on manual analysis and in-terpretation. For example, in the health-careindustry, it is common for specialists to peri-odically analyze current trends and changesin health-care data, say, on a quarterly basis.The specialists then provide a report detailingthe analysis to the sponsoring health-care or-ganization; this report becomes the basis forfuture decision making and planning forhealth-care management. In a totally differ-ent type of application, planetary geologistssift through remotely sensed images of plan-ets and asteroids, carefully locating and cata-loging such geologic objects of interest as im-pact craters. Be it science, marketing, ﬁnance,health care, retail, or any other ﬁeld, the clas-sical approach to data analysis relies funda-mentally on one or more analysts becomingArticlesFALL 1996 37From Data Mining to Knowledge Discovery inDatabasesUsama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00areas is astronomy. Here, a notable success was achieved by SKICAT ,a system used by as-tronomers to perform image analysis,classiﬁcation, and cataloging of sky objects from sky-survey images (Fayyad, Djorgovski,and Weir 1996). In its ﬁrst application, the system was used to process the 3 terabytes (1012bytes) of image data resulting from the Second Palomar Observatory Sky Survey,where it is estimated that on the order of 109sky objects are detectable. SKICAT can outper-form humans and traditional computational techniques in classifying faint sky objects. See Fayyad, Haussler, and Stolorz (1996) for a sur-vey of scientiﬁc applications.In business, main KDD application areas includes marketing, ﬁnance (especially in-vestment), fraud detection, manufacturing,telecommunications, and Internet agents.Marketing:In marketing, the primary ap-plication is database marketing systems,which analyze customer databases to identify different customer groups and forecast their behavior. Business Week (Berry 1994) estimat-ed that over half of all retailers are using or planning to use database marketing, and those who do use it have good results; for ex-ample, American Express reports a 10- to 15-percent increase in credit-card use. Another notable marketing application is market-bas-ket analysis (Agrawal et al. 1996) systems,which ﬁnd patterns such as, “If customer bought X, he/she is also likely to buy Y and Z.” Such patterns are valuable to retailers.Investment: Numerous companies use da-ta mining for investment, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million;since its start in 1993, the system has outper-formed the broad stock market (Hall, Mani,and Barr 1996).Fraud detection: HNC Falcon and Nestor PRISM systems are used for monitoring credit-card fraud, watching over millions of ac-counts. The FAIS system (Senator et al. 1995),from the U.S. Treasury Financial Crimes En-forcement Network, is used to identify ﬁnan-cial transactions that might indicate money-laundering activity.Manufacturing: The CASSIOPEE trou-bleshooting system, developed as part of a joint venture between General Electric and SNECMA, was applied by three major Euro-pean airlines to diagnose and predict prob-lems for the Boeing 737. To derive families of faults, clustering methods are used. CASSIOPEE received the European ﬁrst prize for innova-intimately familiar with the data and serving as an interface between the data and the users and products.For these (and many other) applications,this form of manual probing of a data set is slow, expensive, and highly subjective. In fact, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains.Databases are increasing in size in two ways:(1) the number N of records or objects in the database and (2) the number d of ﬁelds or at-tributes to an object. Databases containing on the order of N = 109objects are becoming in-creasingly common, for example, in the as-tronomical sciences. Similarly, the number of ﬁelds d can easily be on the order of 102or even 103, for example, in medical diagnostic applications. Who could be expected to di-gest millions of records, each having tens or hundreds of ﬁelds? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially.The need to scale up human analysis capa-bilities to handling the large number of bytes that we can collect is both economic and sci-entiﬁc. Businesses use data to gain competi-tive advantage, increase efﬁciency, and pro-vide more valuable services to customers.Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. Be-cause computers have enabled humans to gather more data than we can digest, it is on-ly natural to turn to computational tech-niques to help us unearth meaningful pat-terns and structures from the massive volumes of data. Hence, KDD is an attempt to address a problem that the digital informa-tion era made a fact of life for all of us: data overload.Data Mining and Knowledge Discovery in the Real WorldA large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, the focus articles within the last two years in Business Week , Newsweek , Byte , PC Week , and other large-circulation periodicals. Unfortu-nately, it is not always easy to separate fact from media hype. Nonetheless, several well-documented examples of successful systems can rightly be referred to as KDD applications and have been deployed in operational use on large-scale real-world problems in science and in business.In science, one of the primary applicationThere is an urgent need for a new generation of computation-al theories and tools toassist humans in extractinguseful information (knowledge)from the rapidly growing volumes ofdigital data.Articles38AI MAGAZINEtive applications (Manago and Auriol 1996).Telecommunications: The telecommuni-cations alarm-sequence analyzer (TASA) wasbuilt in cooperation with a manufacturer oftelecommunications equipment and threetelephone networks (Mannila, Toivonen, andVerkamo 1995). The system uses a novelframework for locating frequently occurringalarm episodes from the alarm stream andpresenting them as rules. Large sets of discov-ered rules can be explored with ﬂexible infor-mation-retrieval tools supporting interactivityand iteration. In this way, TASA offers pruning,grouping, and ordering tools to reﬁne the re-sults of a basic brute-force search for rules.Data cleaning: The MERGE-PURGE systemwas applied to the identiﬁcation of duplicatewelfare claims (Hernandez and Stolfo 1995).It was used successfully on data from the Wel-fare Department of the State of Washington.In other areas, a well-publicized system isIBM’s ADVANCED SCOUT,a specialized data-min-ing system that helps National Basketball As-sociation (NBA) coaches organize and inter-pret data from NBA games (U.S. News 1995). ADVANCED SCOUT was used by several of the NBA teams in 1996, including the Seattle Su-personics, which reached the NBA ﬁnals.Finally, a novel and increasingly importanttype of discovery is one based on the use of in-telligent agents to navigate through an infor-mation-rich environment. Although the ideaof active triggers has long been analyzed in thedatabase ﬁeld, really successful applications ofthis idea appeared only with the advent of theInternet. These systems ask the user to specifya proﬁle of interest and search for related in-formation among a wide variety of public-do-main and proprietary sources. For example, FIREFLY is a personal music-recommendation agent: It asks a user his/her opinion of several music pieces and then suggests other music that the user might like (<http:// www.fﬂ/>). CRAYON(/>) allows users to create their own free newspaper (supported by ads); NEWSHOUND(<http://www. /hound/>) from the San Jose Mercury News and FARCAST(</> automatically search information from a wide variety of sources, including newspapers and wire services, and e-mail rele-vant documents directly to the user.These are just a few of the numerous suchsystems that use KDD techniques to automat-ically produce useful information from largemasses of raw data. See Piatetsky-Shapiro etal. (1996) for an overview of issues in devel-oping industrial KDD applications.Data Mining and KDDHistorically, the notion of ﬁnding useful pat-terns in data has been given a variety ofnames, including data mining, knowledge ex-traction, information discovery, informationharvesting, data archaeology, and data patternprocessing. The term data mining has mostlybeen used by statisticians, data analysts, andthe management information systems (MIS)communities. It has also gained popularity inthe database ﬁeld. The phrase knowledge dis-covery in databases was coined at the ﬁrst KDDworkshop in 1989 (Piatetsky-Shapiro 1991) toemphasize that knowledge is the end productof a data-driven discovery. It has been popular-ized in the AI and machine-learning ﬁelds.In our view, KDD refers to the overall pro-cess of discovering useful knowledge from da-ta, and data mining refers to a particular stepin this process. Data mining is the applicationof speciﬁc algorithms for extracting patternsfrom data. The distinction between the KDDprocess and the data-mining step (within theprocess) is a central point of this article. Theadditional steps in the KDD process, such asdata preparation, data selection, data cleaning,incorporation of appropriate prior knowledge,and proper interpretation of the results ofmining, are essential to ensure that usefulknowledge is derived from the data. Blind ap-plication of data-mining methods (rightly crit-icized as data dredging in the statistical litera-ture) can be a dangerous activity, easilyleading to the discovery of meaningless andinvalid patterns.The Interdisciplinary Nature of KDDKDD has evolved, and continues to evolve,from the intersection of research ﬁelds such asmachine learning, pattern recognition,databases, statistics, AI, knowledge acquisitionfor expert systems, data visualization, andhigh-performance computing. The unifyinggoal is extracting high-level knowledge fromlow-level data in the context of large data sets.The data-mining component of KDD cur-rently relies heavily on known techniquesfrom machine learning, pattern recognition,and statistics to ﬁnd patterns from data in thedata-mining step of the KDD process. A natu-ral question is, How is KDD different from pat-tern recognition or machine learning (and re-lated ﬁelds)? The answer is that these ﬁeldsprovide some of the data-mining methodsthat are used in the data-mining step of theKDD process. KDD focuses on the overall pro-cess of knowledge discovery from data, includ-ing how the data are stored and accessed, howalgorithms can be scaled to massive data setsThe basicproblemaddressed bythe KDDprocess isone ofmappinglow-leveldata intoother formsthat might bemorecompact,moreabstract,or moreuseful.ArticlesFALL 1996 39A driving force behind KDD is the database ﬁeld (the second D in KDD). Indeed, the problem of effective data manipulation when data cannot ﬁt in the main memory is of fun-damental importance to KDD. Database tech-niques for gaining efﬁcient data access,grouping and ordering operations when ac-cessing data, and optimizing queries consti-tute the basics for scaling algorithms to larger data sets. Most data-mining algorithms from statistics, pattern recognition, and machine learning assume data are in the main memo-ry and pay no attention to how the algorithm breaks down if only limited views of the data are possible.A related ﬁeld evolving from databases is data warehousing,which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehousing helps set the stage for KDD in two important ways: (1) data cleaning and (2)data access.Data cleaning: As organizations are forced to think about a uniﬁed logical view of the wide variety of data and databases they pos-sess, they have to address the issues of map-ping data to a single naming convention,uniformly representing and handling missing data, and handling noise and errors when possible.Data access: Uniform and well-deﬁned methods must be created for accessing the da-ta and providing access paths to data that were historically difﬁcult to get to (for exam-ple, stored ofﬂine).Once organizations and individuals have solved the problem of how to store and ac-cess their data, the natural next step is the question, What else do we do with all the da-ta? This is where opportunities for KDD natu-rally arise.A popular approach for analysis of data warehouses is called online analytical processing (OLAP), named for a set of principles pro-posed by Codd (1993). OLAP tools focus on providing multidimensional data analysis,which is superior to SQL in computing sum-maries and breakdowns along many dimen-sions. OLAP tools are targeted toward simpli-fying and supporting interactive data analysis,but the goal of KDD tools is to automate as much of the process as possible. Thus, KDD is a step beyond what is currently supported by most standard database systems.Basic DeﬁnitionsKDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimate-and still run efﬁciently, how results can be in-terpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any one particular discipline such as machine learning. In this context, there are clear opportunities for other ﬁelds of AI (be-sides machine learning) to contribute to KDD. KDD places a special emphasis on ﬁnd-ing understandable patterns that can be inter-preted as useful or interesting knowledge.Thus, for example, neural networks, although a powerful modeling tool, are relatively difﬁcult to understand compared to decision trees. KDD also emphasizes scaling and ro-bustness properties of modeling algorithms for large noisy data sets.Related AI research ﬁelds include machine discovery, which targets the discovery of em-pirical laws from observation and experimen-tation (Shrager and Langley 1990) (see Kloes-gen and Zytkow [1996] for a glossary of terms common to KDD and machine discovery),and causal modeling for the inference of causal models from data (Spirtes, Glymour,and Scheines 1993). Statistics in particular has much in common with KDD (see Elder and Pregibon [1996] and Glymour et al.[1996] for a more detailed discussion of this synergy). Knowledge discovery from data is fundamentally a statistical endeavor. Statistics provides a language and framework for quan-tifying the uncertainty that results when one tries to infer general patterns from a particu-lar sample of an overall population. As men-tioned earlier, the term data mining has had negative connotations in statistics since the 1960s when computer-based data analysis techniques were ﬁrst introduced. The concern arose because if one searches long enough in any data set (even randomly generated data),one can ﬁnd patterns that appear to be statis-tically signiﬁcant but, in fact, are not. Clearly,this issue is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct rele-vance to KDD. Thus, data mining is a legiti-mate activity as long as one understands how to do it correctly; data mining carried out poorly (without regard to the statistical as-pects of the problem) is to be avoided. KDD can also be viewed as encompassing a broader view of modeling than statistics. KDD aims to provide tools to automate (to the degree pos-sible) the entire process of data analysis and the statistician’s “art” of hypothesis selection.Data mining is a step in the KDD process that consists of ap-plying data analysis and discovery al-gorithms that produce a par-ticular enu-meration ofpatterns (or models)over the data.Articles40AI MAGAZINEly understandable patterns in data (Fayyad, Piatetsky-Shapiro, and Smyth 1996).Here, data are a set of facts (for example, cases in a database), and pattern is an expres-sion in some language describing a subset of the data or a model applicable to the subset. Hence, in our usage here, extracting a pattern also designates ﬁtting a model to data; ﬁnd-ing structure from data; or, in general, mak-ing any high-level description of a set of data. The term process implies that KDD comprises many steps, which involve data preparation, search for patterns, knowledge evaluation, and reﬁnement, all repeated in multiple itera-tions. By nontrivial, we mean that some search or inference is involved; that is, it is not a straightforward computation of predeﬁned quantities like computing the av-erage value of a set of numbers.The discovered patterns should be valid on new data with some degree of certainty. We also want patterns to be novel (at least to the system and preferably to the user) and poten-tially useful, that is, lead to some beneﬁt to the user or task. Finally, the patterns should be understandable, if not immediately then after some postprocessing.The previous discussion implies that we can deﬁne quantitative measures for evaluating extracted patterns. In many cases, it is possi-ble to deﬁne measures of certainty (for exam-ple, estimated prediction accuracy on new data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness(for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be deﬁned explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to deﬁne knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this deﬁnition ispurely user oriented and domain speciﬁc andis determined by whatever functions andthresholds the user chooses.Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efﬁciency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space ofArticlesFALL 1996 41Figure 1. An Overview of the Steps That Compose the KDD Process.methods, the effective number of variables under consideration can be reduced, or in-variant representations for the data can be found.Fifth is matching the goals of the KDD pro-cess (step 1) to a particular data-mining method. For example, summarization, clas-siﬁcation, regression, clustering, and so on,are described later as well as in Fayyad, Piatet-sky-Shapiro, and Smyth (1996).Sixth is exploratory analysis and model and hypothesis selection: choosing the data-mining algorithm(s) and selecting method(s)to be used for searching for data patterns.This process includes deciding which models and parameters might be appropriate (for ex-ample, models of categorical data are differ-ent than models of vectors over the reals) and matching a particular data-mining method with the overall criteria of the KDD process (for example, the end user might be more in-terested in understanding the model than its predictive capabilities).Seventh is data mining: searching for pat-terns of interest in a particular representa-tional form or a set of such representations,including classiﬁcation rules or trees, regres-sion, and clustering. The user can signiﬁcant-ly aid the data-mining method by correctly performing the preceding steps.Eighth is interpreting mined patterns, pos-sibly returning to any of steps 1 through 7 for further iteration. This step can also involve visualization of the extracted patterns and models or visualization of the data given the extracted models.Ninth is acting on the discovered knowl-edge: using the knowledge directly, incorpo-rating the knowledge into another system for further action, or simply documenting it and reporting it to interested parties. This process also includes checking for and resolving po-tential conﬂicts with previously believed (or extracted) knowledge.The KDD process can involve signiﬁcant iteration and can contain loops between any two steps. The basic ﬂow of steps (al-though not the potential multitude of itera-tions and loops) is illustrated in ﬁgure 1.Most previous work on KDD has focused on step 7, the data mining. However, the other steps are as important (and probably more so) for the successful application of KDD in practice. Having deﬁned the basic notions and introduced the KDD process, we now focus on the data-mining component,which has, by far, received the most atten-tion in the literature.patterns is often inﬁnite, and the enumera-tion of patterns involves some form of search in this space. Practical computational constraints place severe limits on the sub-space that can be explored by a data-mining algorithm.The KDD process involves using the database along with any required selection,preprocessing, subsampling, and transforma-tions of it; applying data-mining methods (algorithms) to enumerate patterns from it;and evaluating the products of data mining to identify the subset of the enumerated pat-terns deemed knowledge. The data-mining component of the KDD process is concerned with the algorithmic means by which pat-terns are extracted and enumerated from da-ta. The overall KDD process (ﬁgure 1) in-cludes the evaluation and possible interpretation of the mined patterns to de-termine which patterns can be considered new knowledge. The KDD process also in-cludes all the additional steps described in the next section.The notion of an overall user-driven pro-cess is not unique to KDD: analogous propos-als have been put forward both in statistics (Hand 1994) and in machine learning (Brod-ley and Smyth 1996).The KDD ProcessThe KDD process is interactive and iterative,involving numerous steps with many deci-sions made by the user. Brachman and Anand (1996) give a practical view of the KDD pro-cess, emphasizing the interactive nature of the process. Here, we broadly outline some of its basic steps:First is developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the KDD process from the customer’s viewpoint.Second is creating a target data set: select-ing a data set, or focusing on a subset of vari-ables or data samples, on which discovery is to be performed.Third is data cleaning and preprocessing.Basic operations include removing noise if appropriate, collecting the necessary informa-tion to model or account for noise, deciding on strategies for handling missing data ﬁelds,and accounting for time-sequence informa-tion and known changes.Fourth is data reduction and projection:ﬁnding useful features to represent the data depending on the goal of the task. With di-mensionality reduction or transformationArticles42AI MAGAZINEThe Data-Mining Stepof the KDD ProcessThe data-mining component of the KDD pro-cess often involves repeated iterative applica-tion of particular data-mining methods. This section presents an overview of the primary goals of data mining, a description of the methods used to address these goals, and a brief description of the data-mining algo-rithms that incorporate these methods.The knowledge discovery goals are deﬁned by the intended use of the system. We can distinguish two types of goals: (1) veriﬁcation and (2) discovery. With veriﬁcation,the sys-tem is limited to verifying the user’s hypothe-sis. With discovery,the system autonomously ﬁnds new patterns. We further subdivide the discovery goal into prediction,where the sys-tem ﬁnds patterns for predicting the future behavior of some entities, and description, where the system ﬁnds patterns for presenta-tion to a user in a human-understandableform. In this article, we are primarily con-cerned with discovery-oriented data mining.Data mining involves ﬁtting models to, or determining patterns from, observed data. The ﬁtted models play the role of inferred knowledge: Whether the models reﬂect useful or interesting knowledge is part of the over-all, interactive KDD process where subjective human judgment is typically required. Two primary mathematical formalisms are used in model ﬁtting: (1) statistical and (2) logical. The statistical approach allows for nondeter-ministic effects in the model, whereas a logi-cal model is purely deterministic. We focus primarily on the statistical approach to data mining, which tends to be the most widely used basis for practical data-mining applica-tions given the typical presence of uncertain-ty in real-world data-generating processes.Most data-mining methods are based on tried and tested techniques from machine learning, pattern recognition, and statistics: classiﬁcation, clustering, regression, and so on. The array of different algorithms under each of these headings can often be bewilder-ing to both the novice and the experienced data analyst. It should be emphasized that of the many data-mining methods advertised in the literature, there are really only a few fun-damental techniques. The actual underlying model representation being used by a particu-lar method typically comes from a composi-tion of a small number of well-known op-tions: polynomials, splines, kernel and basis functions, threshold-Boolean functions, and so on. Thus, algorithms tend to differ primar-ily in the goodness-of-ﬁt criterion used toevaluate model ﬁt or in the search methodused to ﬁnd a good ﬁt.In our brief overview of data-mining meth-ods, we try in particular to convey the notionthat most (if not all) methods can be viewedas extensions or hybrids of a few basic tech-niques and principles. We ﬁrst discuss the pri-mary methods of data mining and then showthat the data- mining methods can be viewedas consisting of three primary algorithmiccomponents: (1) model representation, (2)model evaluation, and (3) search. In the dis-cussion of KDD and data-mining methods,we use a simple example to make some of thenotions more concrete. Figure 2 shows a sim-ple two-dimensional artiﬁcial data set consist-ing of 23 cases. Each point on the graph rep-resents a person who has been given a loanby a particular bank at some time in the past.The horizontal axis represents the income ofthe person; the vertical axis represents the to-tal personal debt of the person (mortgage, carpayments, and so on). The data have beenclassiﬁed into two classes: (1) the x’s repre-sent persons who have defaulted on theirloans and (2) the o’s represent persons whoseloans are in good status with the bank. Thus,this simple artiﬁcial data set could represent ahistorical data set that can contain usefulknowledge from the point of view of thebank making the loans. Note that in actualKDD applications, there are typically manymore dimensions (as many as several hun-dreds) and many more data points (manythousands or even millions).ArticlesFALL 1996 43Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes.。

Survey of clustering data mining techniques

A Survey of Clustering Data Mining TechniquesPavel BerkhinYahoo!,Inc.pberkhin@Summary.Clustering is the division of data into groups of similar objects.It dis-regards some details in exchange for data simpliﬁrmally,clustering can be viewed as data modeling concisely summarizing the data,and,therefore,it re-lates to many disciplines from statistics to numerical analysis.Clustering plays an important role in a broad range of applications,from information retrieval to CRM. Such applications usually deal with large datasets and many attributes.Exploration of such data is a subject of data mining.This survey concentrates on clustering algorithms from a data mining perspective.1IntroductionThe goal of this survey is to provide a comprehensive review of diﬀerent clus-tering techniques in data mining.Clustering is a division of data into groups of similar objects.Each group,called a cluster,consists of objects that are similar to one another and dissimilar to objects of other groups.When repre-senting data with fewer clusters necessarily loses certainﬁne details(akin to lossy data compression),but achieves simpliﬁcation.It represents many data objects by few clusters,and hence,it models data by its clusters.Data mod-eling puts clustering in a historical perspective rooted in mathematics,sta-tistics,and numerical analysis.From a machine learning perspective clusters correspond to hidden patterns,the search for clusters is unsupervised learn-ing,and the resulting system represents a data concept.Therefore,clustering is unsupervised learning of a hidden data concept.Data mining applications add to a general picture three complications:(a)large databases,(b)many attributes,(c)attributes of diﬀerent types.This imposes on a data analysis se-vere computational requirements.Data mining applications include scientiﬁc data exploration,information retrieval,text mining,spatial databases,Web analysis,CRM,marketing,medical diagnostics,computational biology,and many others.They present real challenges to classic clustering algorithms. These challenges led to the emergence of powerful broadly applicable data2Pavel Berkhinmining clustering methods developed on the foundation of classic techniques.They are subject of this survey.1.1NotationsTo ﬁx the context and clarify terminology,consider a dataset X consisting of data points (i.e.,objects ,instances ,cases ,patterns ,tuples ,transactions )x i =(x i 1,···,x id ),i =1:N ,in attribute space A ,where each component x il ∈A l ,l =1:d ,is a numerical or nominal categorical attribute (i.e.,feature ,variable ,dimension ,component ,ﬁeld ).For a discussion of attribute data types see [106].Such point-by-attribute data format conceptually corresponds to a N ×d matrix and is used by a majority of algorithms reviewed below.However,data of other formats,such as variable length sequences and heterogeneous data,are not uncommon.The simplest subset in an attribute space is a direct Cartesian product of sub-ranges C = C l ⊂A ,C l ⊂A l ,called a segment (i.e.,cube ,cell ,region ).A unit is an elementary segment whose sub-ranges consist of a single category value,or of a small numerical bin.Describing the numbers of data points per every unit represents an extreme case of clustering,a histogram .This is a very expensive representation,and not a very revealing er driven segmentation is another commonly used practice in data exploration that utilizes expert knowledge regarding the importance of certain sub-domains.Unlike segmentation,clustering is assumed to be automatic,and so it is a machine learning technique.The ultimate goal of clustering is to assign points to a ﬁnite system of k subsets (clusters).Usually (but not always)subsets do not intersect,and their union is equal to a full dataset with the possible exception of outliersX =C 1 ··· C k C outliers ,C i C j =0,i =j.1.2Clustering Bibliography at GlanceGeneral references regarding clustering include [110],[205],[116],[131],[63],[72],[165],[119],[75],[141],[107],[91].A very good introduction to contem-porary data mining clustering techniques can be found in the textbook [106].There is a close relationship between clustering and many other ﬁelds.Clustering has always been used in statistics [10]and science [158].The clas-sic introduction into pattern recognition framework is given in [64].Typical applications include speech and character recognition.Machine learning clus-tering algorithms were applied to image segmentation and computer vision[117].For statistical approaches to pattern recognition see [56]and [85].Clus-tering can be viewed as a density estimation problem.This is the subject of traditional multivariate statistical estimation [197].Clustering is also widelyA Survey of Clustering Data Mining Techniques3 used for data compression in image processing,which is also known as vec-tor quantization[89].Dataﬁtting in numerical analysis provides still another venue in data modeling[53].This survey’s emphasis is on clustering in data mining.Such clustering is characterized by large datasets with many attributes of diﬀerent types. Though we do not even try to review particular applications,many important ideas are related to the speciﬁcﬁelds.Clustering in data mining was brought to life by intense developments in information retrieval and text mining[52], [206],[58],spatial database applications,for example,GIS or astronomical data,[223],[189],[68],sequence and heterogeneous data analysis[43],Web applications[48],[111],[81],DNA analysis in computational biology[23],and many others.They resulted in a large amount of application-speciﬁc devel-opments,but also in some general techniques.These techniques and classic clustering algorithms that relate to them are surveyed below.1.3Plan of Further PresentationClassiﬁcation of clustering algorithms is neither straightforward,nor canoni-cal.In reality,diﬀerent classes of algorithms overlap.Traditionally clustering techniques are broadly divided in hierarchical and partitioning.Hierarchical clustering is further subdivided into agglomerative and divisive.The basics of hierarchical clustering include Lance-Williams formula,idea of conceptual clustering,now classic algorithms SLINK,COBWEB,as well as newer algo-rithms CURE and CHAMELEON.We survey these algorithms in the section Hierarchical Clustering.While hierarchical algorithms gradually(dis)assemble points into clusters (as crystals grow),partitioning algorithms learn clusters directly.In doing so they try to discover clusters either by iteratively relocating points between subsets,or by identifying areas heavily populated with data.Algorithms of theﬁrst kind are called Partitioning Relocation Clustering. They are further classiﬁed into probabilistic clustering(EM framework,al-gorithms SNOB,AUTOCLASS,MCLUST),k-medoids methods(algorithms PAM,CLARA,CLARANS,and its extension),and k-means methods(diﬀer-ent schemes,initialization,optimization,harmonic means,extensions).Such methods concentrate on how well pointsﬁt into their clusters and tend to build clusters of proper convex shapes.Partitioning algorithms of the second type are surveyed in the section Density-Based Partitioning.They attempt to discover dense connected com-ponents of data,which areﬂexible in terms of their shape.Density-based connectivity is used in the algorithms DBSCAN,OPTICS,DBCLASD,while the algorithm DENCLUE exploits space density functions.These algorithms are less sensitive to outliers and can discover clusters of irregular shape.They usually work with low-dimensional numerical data,known as spatial data. Spatial objects could include not only points,but also geometrically extended objects(algorithm GDBSCAN).4Pavel BerkhinSome algorithms work with data indirectly by constructing summaries of data over the attribute space subsets.They perform space segmentation and then aggregate appropriate segments.We discuss them in the section Grid-Based Methods.They frequently use hierarchical agglomeration as one phase of processing.Algorithms BANG,STING,WaveCluster,and FC are discussed in this section.Grid-based methods are fast and handle outliers well.Grid-based methodology is also used as an intermediate step in many other algorithms (for example,CLIQUE,MAFIA).Categorical data is intimately connected with transactional databases.The concept of a similarity alone is not suﬃcient for clustering such data.The idea of categorical data co-occurrence comes to the rescue.The algorithms ROCK,SNN,and CACTUS are surveyed in the section Co-Occurrence of Categorical Data.The situation gets even more aggravated with the growth of the number of items involved.To help with this problem the eﬀort is shifted from data clustering to pre-clustering of items or categorical attribute values. Development based on hyper-graph partitioning and the algorithm STIRR exemplify this approach.Many other clustering techniques are developed,primarily in machine learning,that either have theoretical signiﬁcance,are used traditionally out-side the data mining community,or do notﬁt in previously outlined categories. The boundary is blurred.In the section Other Developments we discuss the emerging direction of constraint-based clustering,the important researchﬁeld of graph partitioning,and the relationship of clustering to supervised learning, gradient descent,artiﬁcial neural networks,and evolutionary methods.Data Mining primarily works with large databases.Clustering large datasets presents scalability problems reviewed in the section Scalability and VLDB Extensions.Here we talk about algorithms like DIGNET,about BIRCH and other data squashing techniques,and about Hoﬀding or Chernoﬀbounds.Another trait of real-life data is high dimensionality.Corresponding de-velopments are surveyed in the section Clustering High Dimensional Data. The trouble comes from a decrease in metric separation when the dimension grows.One approach to dimensionality reduction uses attributes transforma-tions(DFT,PCA,wavelets).Another way to address the problem is through subspace clustering(algorithms CLIQUE,MAFIA,ENCLUS,OPTIGRID, PROCLUS,ORCLUS).Still another approach clusters attributes in groups and uses their derived proxies to cluster objects.This double clustering is known as co-clustering.Issues common to diﬀerent clustering methods are overviewed in the sec-tion General Algorithmic Issues.We talk about assessment of results,de-termination of appropriate number of clusters to build,data preprocessing, proximity measures,and handling of outliers.For reader’s convenience we provide a classiﬁcation of clustering algorithms closely followed by this survey:•Hierarchical MethodsA Survey of Clustering Data Mining Techniques5Agglomerative AlgorithmsDivisive Algorithms•Partitioning Relocation MethodsProbabilistic ClusteringK-medoids MethodsK-means Methods•Density-Based Partitioning MethodsDensity-Based Connectivity ClusteringDensity Functions Clustering•Grid-Based Methods•Methods Based on Co-Occurrence of Categorical Data•Other Clustering TechniquesConstraint-Based ClusteringGraph PartitioningClustering Algorithms and Supervised LearningClustering Algorithms in Machine Learning•Scalable Clustering Algorithms•Algorithms For High Dimensional DataSubspace ClusteringCo-Clustering Techniques1.4Important IssuesThe properties of clustering algorithms we are primarily concerned with in data mining include:•Type of attributes algorithm can handle•Scalability to large datasets•Ability to work with high dimensional data•Ability toﬁnd clusters of irregular shape•Handling outliers•Time complexity(we frequently simply use the term complexity)•Data order dependency•Labeling or assignment(hard or strict vs.soft or fuzzy)•Reliance on a priori knowledge and user deﬁned parameters •Interpretability of resultsRealistically,with every algorithm we discuss only some of these properties. The list is in no way exhaustive.For example,as appropriate,we also discuss algorithms ability to work in pre-deﬁned memory buﬀer,to restart,and to provide an intermediate solution.6Pavel Berkhin2Hierarchical ClusteringHierarchical clustering builds a cluster hierarchy or a tree of clusters,also known as a dendrogram.Every cluster node contains child clusters;sibling clusters partition the points covered by their common parent.Such an ap-proach allows exploring data on diﬀerent levels of granularity.Hierarchical clustering methods are categorized into agglomerative(bottom-up)and divi-sive(top-down)[116],[131].An agglomerative clustering starts with one-point (singleton)clusters and recursively merges two or more of the most similar clusters.A divisive clustering starts with a single cluster containing all data points and recursively splits the most appropriate cluster.The process contin-ues until a stopping criterion(frequently,the requested number k of clusters) is achieved.Advantages of hierarchical clustering include:•Flexibility regarding the level of granularity•Ease of handling any form of similarity or distance•Applicability to any attribute typesDisadvantages of hierarchical clustering are related to:•Vagueness of termination criteria•Most hierarchical algorithms do not revisit(intermediate)clusters once constructed.The classic approaches to hierarchical clustering are presented in the sub-section Linkage Metrics.Hierarchical clustering based on linkage metrics re-sults in clusters of proper(convex)shapes.Active contemporary eﬀorts to build cluster systems that incorporate our intuitive concept of clusters as con-nected components of arbitrary shape,including the algorithms CURE and CHAMELEON,are surveyed in the subsection Hierarchical Clusters of Arbi-trary Shapes.Divisive techniques based on binary taxonomies are presented in the subsection Binary Divisive Partitioning.The subsection Other Devel-opments contains information related to incremental learning,model-based clustering,and cluster reﬁnement.In hierarchical clustering our regular point-by-attribute data representa-tion frequently is of secondary importance.Instead,hierarchical clustering frequently deals with the N×N matrix of distances(dissimilarities)or sim-ilarities between training points sometimes called a connectivity matrix.So-called linkage metrics are constructed from elements of this matrix.The re-quirement of keeping a connectivity matrix in memory is unrealistic.To relax this limitation diﬀerent techniques are used to sparsify(introduce zeros into) the connectivity matrix.This can be done by omitting entries smaller than a certain threshold,by using only a certain subset of data representatives,or by keeping with each point only a certain number of its nearest neighbors(for nearest neighbor chains see[177]).Notice that the way we process the original (dis)similarity matrix and construct a linkage metric reﬂects our a priori ideas about the data model.A Survey of Clustering Data Mining Techniques7With the(sparsiﬁed)connectivity matrix we can associate the weighted connectivity graph G(X,E)whose vertices X are data points,and edges E and their weights are deﬁned by the connectivity matrix.This establishes a connection between hierarchical clustering and graph partitioning.One of the most striking developments in hierarchical clustering is the algorithm BIRCH.It is discussed in the section Scalable VLDB Extensions.Hierarchical clustering initializes a cluster system as a set of singleton clusters(agglomerative case)or a single cluster of all points(divisive case) and proceeds iteratively merging or splitting the most appropriate cluster(s) until the stopping criterion is achieved.The appropriateness of a cluster(s) for merging or splitting depends on the(dis)similarity of cluster(s)elements. This reﬂects a general presumption that clusters consist of similar points.An important example of dissimilarity between two points is the distance between them.To merge or split subsets of points rather than individual points,the dis-tance between individual points has to be generalized to the distance between subsets.Such a derived proximity measure is called a linkage metric.The type of a linkage metric signiﬁcantly aﬀects hierarchical algorithms,because it re-ﬂects a particular concept of closeness and connectivity.Major inter-cluster linkage metrics[171],[177]include single link,average link,and complete link. The underlying dissimilarity measure(usually,distance)is computed for every pair of nodes with one node in theﬁrst set and another node in the second set.A speciﬁc operation such as minimum(single link),average(average link),or maximum(complete link)is applied to pair-wise dissimilarity measures:d(C1,C2)=Op{d(x,y),x∈C1,y∈C2}Early examples include the algorithm SLINK[199],which implements single link(Op=min),Voorhees’method[215],which implements average link (Op=Avr),and the algorithm CLINK[55],which implements complete link (Op=max).It is related to the problem ofﬁnding the Euclidean minimal spanning tree[224]and has O(N2)complexity.The methods using inter-cluster distances deﬁned in terms of pairs of nodes(one in each respective cluster)are called graph methods.They do not use any cluster representation other than a set of points.This name naturally relates to the connectivity graph G(X,E)introduced above,because every data partition corresponds to a graph partition.Such methods can be augmented by so-called geometric methods in which a cluster is represented by its central point.Under the assumption of numerical attributes,the center point is deﬁned as a centroid or an average of two cluster centroids subject to agglomeration.It results in centroid,median,and minimum variance linkage metrics.All of the above linkage metrics can be derived from the Lance-Williams updating formula[145],d(C iC j,C k)=a(i)d(C i,C k)+a(j)d(C j,C k)+b·d(C i,C j)+c|d(C i,C k)−d(C j,C k)|.8Pavel BerkhinHere a,b,c are coeﬃcients corresponding to a particular linkage.This formula expresses a linkage metric between a union of the two clusters and the third cluster in terms of underlying nodes.The Lance-Williams formula is crucial to making the dis(similarity)computations feasible.Surveys of linkage metrics can be found in [170][54].When distance is used as a base measure,linkage metrics capture inter-cluster proximity.However,a similarity-based view that results in intra-cluster connectivity considerations is also used,for example,in the original average link agglomeration (Group-Average Method)[116].Under reasonable assumptions,such as reducibility condition (graph meth-ods satisfy this condition),linkage metrics methods suﬀer from O N 2 time complexity [177].Despite the unfavorable time complexity,these algorithms are widely used.As an example,the algorithm AGNES (AGlomerative NESt-ing)[131]is used in S-Plus.When the connectivity N ×N matrix is sparsiﬁed,graph methods directly dealing with the connectivity graph G can be used.In particular,hierarchical divisive MST (Minimum Spanning Tree)algorithm is based on graph parti-tioning [116].2.1Hierarchical Clusters of Arbitrary ShapesFor spatial data,linkage metrics based on Euclidean distance naturally gener-ate clusters of convex shapes.Meanwhile,visual inspection of spatial images frequently discovers clusters with curvy appearance.Guha et al.[99]introduced the hierarchical agglomerative clustering algo-rithm CURE (Clustering Using REpresentatives).This algorithm has a num-ber of novel features of general importance.It takes special steps to handle outliers and to provide labeling in assignment stage.It also uses two techniques to achieve scalability:data sampling (section 8),and data partitioning.CURE creates p partitions,so that ﬁne granularity clusters are constructed in parti-tions ﬁrst.A major feature of CURE is that it represents a cluster by a ﬁxed number,c ,of points scattered around it.The distance between two clusters used in the agglomerative process is the minimum of distances between two scattered representatives.Therefore,CURE takes a middle approach between the graph (all-points)methods and the geometric (one centroid)methods.Single and average link closeness are replaced by representatives’aggregate closeness.Selecting representatives scattered around a cluster makes it pos-sible to cover non-spherical shapes.As before,agglomeration continues until the requested number k of clusters is achieved.CURE employs one additional trick:originally selected scattered points are shrunk to the geometric centroid of the cluster by a user-speciﬁed factor α.Shrinkage suppresses the aﬀect of outliers;outliers happen to be located further from the cluster centroid than the other scattered representatives.CURE is capable of ﬁnding clusters of diﬀerent shapes and sizes,and it is insensitive to outliers.Because CURE uses sampling,estimation of its complexity is not straightforward.For low-dimensional data authors provide a complexity estimate of O (N 2sample )deﬁnedA Survey of Clustering Data Mining Techniques9 in terms of a sample size.More exact bounds depend on input parameters: shrink factorα,number of representative points c,number of partitions p,and a sample size.Figure1(a)illustrates agglomeration in CURE.Three clusters, each with three representatives,are shown before and after the merge and shrinkage.Two closest representatives are connected.While the algorithm CURE works with numerical attributes(particularly low dimensional spatial data),the algorithm ROCK developed by the same researchers[100]targets hierarchical agglomerative clustering for categorical attributes.It is reviewed in the section Co-Occurrence of Categorical Data.The hierarchical agglomerative algorithm CHAMELEON[127]uses the connectivity graph G corresponding to the K-nearest neighbor model spar-siﬁcation of the connectivity matrix:the edges of K most similar points to any given point are preserved,the rest are pruned.CHAMELEON has two stages.In theﬁrst stage small tight clusters are built to ignite the second stage.This involves a graph partitioning[129].In the second stage agglomer-ative process is performed.It utilizes measures of relative inter-connectivity RI(C i,C j)and relative closeness RC(C i,C j);both are locally normalized by internal interconnectivity and closeness of clusters C i and C j.In this sense the modeling is dynamic:it depends on data locally.Normalization involves certain non-obvious graph operations[129].CHAMELEON relies heavily on graph partitioning implemented in the library HMETIS(see the section6). Agglomerative process depends on user provided thresholds.A decision to merge is made based on the combinationRI(C i,C j)·RC(C i,C j)αof local measures.The algorithm does not depend on assumptions about the data model.It has been proven toﬁnd clusters of diﬀerent shapes,densities, and sizes in2D(two-dimensional)space.It has a complexity of O(Nm+ Nlog(N)+m2log(m),where m is the number of sub-clusters built during the ﬁrst initialization phase.Figure1(b)(analogous to the one in[127])clariﬁes the diﬀerence with CURE.It presents a choice of four clusters(a)-(d)for a merge.While CURE would merge clusters(a)and(b),CHAMELEON makes intuitively better choice of merging(c)and(d).2.2Binary Divisive PartitioningIn linguistics,information retrieval,and document clustering applications bi-nary taxonomies are very useful.Linear algebra methods,based on singular value decomposition(SVD)are used for this purpose in collaborativeﬁlter-ing and information retrieval[26].Application of SVD to hierarchical divisive clustering of document collections resulted in the PDDP(Principal Direction Divisive Partitioning)algorithm[31].In our notations,object x is a docu-ment,l th attribute corresponds to a word(index term),and a matrix X entry x il is a measure(e.g.TF-IDF)of l-term frequency in a document x.PDDP constructs SVD decomposition of the matrix10Pavel Berkhin(a)Algorithm CURE (b)Algorithm CHAMELEONFig.1.Agglomeration in Clusters of Arbitrary Shapes(X −e ¯x ),¯x =1Ni =1:N x i ,e =(1,...,1)T .This algorithm bisects data in Euclidean space by a hyperplane that passes through data centroid orthogonal to the eigenvector with the largest singular value.A k -way split is also possible if the k largest singular values are consid-ered.Bisecting is a good way to categorize documents and it yields a binary tree.When k -means (2-means)is used for bisecting,the dividing hyperplane is orthogonal to the line connecting the two centroids.The comparative study of SVD vs.k -means approaches [191]can be used for further references.Hier-archical divisive bisecting k -means was proven [206]to be preferable to PDDP for document clustering.While PDDP or 2-means are concerned with how to split a cluster,the problem of which cluster to split is also important.Simple strategies are:(1)split each node at a given level,(2)split the cluster with highest cardinality,and,(3)split the cluster with the largest intra-cluster variance.All three strategies have problems.For a more detailed analysis of this subject and better strategies,see [192].2.3Other DevelopmentsOne of early agglomerative clustering algorithms,Ward’s method [222],is based not on linkage metric,but on an objective function used in k -means.The merger decision is viewed in terms of its eﬀect on the objective function.The popular hierarchical clustering algorithm for categorical data COB-WEB [77]has two very important qualities.First,it utilizes incremental learn-ing.Instead of following divisive or agglomerative approaches,it dynamically builds a dendrogram by processing one data point at a time.Second,COB-WEB is an example of conceptual or model-based learning.This means that each cluster is considered as a model that can be described intrinsically,rather than as a collection of points assigned to it.COBWEB’s dendrogram is calleda classiﬁcation tree.Each tree node(cluster)C is associated with the condi-tional probabilities for categorical attribute-values pairs,P r(x l=νlp|C),l=1:d,p=1:|A l|.This easily can be recognized as a C-speciﬁc Na¨ıve Bayes classiﬁer.During the classiﬁcation tree construction,every new point is descended along the tree and the tree is potentially updated(by an insert/split/merge/create op-eration).Decisions are based on the category utility[49]CU{C1,...,C k}=1j=1:kCU(C j)CU(C j)=l,p(P r(x l=νlp|C j)2−(P r(x l=νlp)2.Category utility is similar to the GINI index.It rewards clusters C j for in-creases in predictability of the categorical attribute valuesνlp.Being incre-mental,COBWEB is fast with a complexity of O(tN),though it depends non-linearly on tree characteristics packed into a constant t.There is a similar incremental hierarchical algorithm for all numerical attributes called CLAS-SIT[88].CLASSIT associates normal distributions with cluster nodes.Both algorithms can result in highly unbalanced trees.Chiu et al.[47]proposed another conceptual or model-based approach to hierarchical clustering.This development contains several diﬀerent use-ful features,such as the extension of scalability preprocessing to categori-cal attributes,outliers handling,and a two-step strategy for monitoring the number of clusters including BIC(deﬁned below).A model associated with a cluster covers both numerical and categorical attributes and constitutes a blend of Gaussian and multinomial models.Denote corresponding multivari-ate parameters byθ.With every cluster C we associate a logarithm of its (classiﬁcation)likelihoodl C=x i∈Clog(p(x i|θ))The algorithm uses maximum likelihood estimates for parameterθ.The dis-tance between two clusters is deﬁned(instead of linkage metric)as a decrease in log-likelihoodd(C1,C2)=l C1+l C2−l C1∪C2caused by merging of the two clusters under consideration.The agglomerative process continues until the stopping criterion is satisﬁed.As such,determina-tion of the best k is automatic.This algorithm has the commercial implemen-tation(in SPSS Clementine).The complexity of the algorithm is linear in N for the summarization phase.Traditional hierarchical clustering does not change points membership in once assigned clusters due to its greedy approach:after a merge or a split is selected it is not reﬁned.Though COBWEB does reconsider its decisions,its。

322-韩蒙 RAKING一种高效的不确定图K-极大频繁模式挖掘算法

[15][16]
，文献[17]介
绍了最新不确定数据的相关技术 ,但这些研究仍然主要面向传统数据项。针对不确定图的研究才刚刚开始，其中已有计算不确定图中的最可靠子图对不确定图进行高效 TOP-K 查询
[20] [18][19]
，：
等课题。邹提出
[21,22,23]
在不确定图上挖掘频繁模式的一些有效算法
RAKING:一种高效的不确定图 K-极大频繁模式挖掘算法
韩蒙 1) 张炜 2) 李建中 1) 2)
1) (黑龙江大学计算机科学技术学院黑龙江哈尔滨 150080) 2) (哈尔滨工业大学计算机科学与技术学院黑龙江哈尔滨 150的可能图实例，基于确定图模型的频繁图模式挖掘算法通常难以在不确定图集合上高效运行。本文提出了一种不确定图数据集上的基于随机游走的 K 极大频繁子模式挖掘算法。首先，将每个不确定图转换为相应的确定图并挖掘候选频繁模式；然后，将候选频繁模式恢复为不确定图并生成极大频繁模式搜索空间；最后，通过随机游走以相同概率随机地选择 K 个极大频繁模式。理论分析和实验结果表明本文提出的算法能够高效地获得不确定图集合的 K-极大频繁模式。
Margin[11]先将图数据组织成格，在搜索的同时不断
对搜索空间进行剪裁以减少子图同构的计算，从而更易获得极大频繁模式。但是，因为不确定图的频繁子树也是不确定的，而且不确定图蕴含的全部确定子图空间巨大，即使进行一定的剪裁也很难有效枚举，所以这两种方法都不可以直接应用于不确定图。随机化的算法因可在大规模数据上高效执行被广泛应用。在确定图上，ORIGAMI[12]通过随机化方法解决了获得有代表性模式的问题，但其输出不具有一致性，多次迭代后结果中仍可能漏掉一些重要模式。 MUSK[13]方法则通过随机游走获得极大频繁模式集。近期，Hasan在原有工作基础上提出利用随机游走对各类带约束模式进行挖掘的通用方法[14]，但以上方法对确定图进行的处理并没有考虑边及点的不确定性，不能很好适用于不确定图。对于不确定数据的研究近年也已有了很多成果，如对不确定数据建模及管理的工作

一类三维HindmarshRose神经元模型的动力学分析

经元模型的复杂编码机制，并对刺激做出快速反应．２００８年，Ｍ．Ｓｔｏｒａｃｅ［７］等人利用组织原理和分岔图信息，将神经元模型的动态与分段线性逼近的动态进行比较，并对电路实现进行了自定义．２００９年，Ｈｉｎｄｍａｒｓｈ和Ｒｏｓｅ［８］在ＨＨ模型的基础上加入一慢电流项，化简得出了一个真实的ＨＲ神经元模型．
（１）
式中犪，犫，犮，犱，犐，狉，狊，狓０都是参量：狓表示膜电位；狔表示门控变量；狕表示电流强度；狉是描述膜上慢离
子通道速率的参数（０＜狉＜１）；犐表示神经元受到
收稿日期：２０１７０６１４基金项目：甘肃省自然科学基金资助项目（１１６１０２７，１１２６２００９）；兰州市创新创业基金资助项目（２０１５ＲＣ３）作者简介：杨琼（１９９１— ），女，硕士研究生，主要从事非线性动力学系统研究．
本文在前人提出的ＨＲ神经元模型的基础上，借助分岔理论分析系统平衡点的个数及特性，运用数值模拟的方法，研究系统周期放电与混沌放电间隔的周期窗口的递增特性，借助双参分岔图更加形象地说明了系统丰富的放电规律．
１模型描述
第４０卷第１期Ｖｏｌ．４０Ｎｏ．１
宁夏大学学报（自然科学版）ＪｏｕｒｎａｌｏｆＮｉｎｇｘｉａＵｎｉｖｅｒｓｉｔｙ（ＮａｔｕｒａｌＳｃｉｅｎｃｅＥｄｉｔｉｏｎ）

基于贪婪树的外部支持向量机近似重复图像聚类算法

第２８卷
第４期
信号处理
ＳＧＮＡＬＩＰＲＯＣＥＳＩＳＮＧ
Ｖ０．８Ｎｏ４１２．
Ａｐ．２１ｒ０２源自２１０２年４月基于贪婪树的外部支持向量机近似重复图像聚类算法
蔺博宇李弼程高毫林胡文博
Ｃ —ｃｕｒｎｍａｅｖｓａｏｄｏｔｅｓｍｅｄｒｃｉｎｉｈｔｎｅｎｉｐｃ．ＥｐｒｎａｅｕｔｓｏａｏａｅＯｏｃｒｇｉｇｉｕｗｒｓｔａｉｅｔｎｔｅｌｅｔｍａｔｓａｅｉｌｈｏａｓｃｘｅｍｅｔｌｒｓｌｈｗｔｔｍｐｒｄｉｓｈｃ
用外部支持向量机将数据集聚为两类，然后采用贪婪树生长算法选择 “ 最优 ” 的类进行分解，重复上述过程直到不可分为止。此外，为了克服图像视觉单词的同义性问题，利用概率潜在语义分析模型将同现的图像视觉单词
映射到潜在语义空间中的同一方向上。实验结果表明，与内部支持向量聚类算法和基于均匀分裂的外部支持向
ｄｔｔｎｏｉｐｏｅｔｅｐｒｒａｃｆＵｉｒＳｌｔｇｂｓｄＳｐｏＶｃｏＭａｈｎｘｅａＣｕｔｎＵ－ｅｃｏ．Ｔｍｒｖｈｅｆｅｉｏｎｅｏｎｏｍｆｍｐｔｎａｅｕｐ￣ｅｔｃｉｅＥｔｎｌｌｓｒｇ（ＳｉｉｒｒｅｉＳＭＥ，ａｅｒｄｐｉａａｅｃｓｒｇａｏｔｍｗｉｏｂｎｓＧｅｄｒｅｗｔＳＭＥ（ＴＳＭＥＶＣ）ｎｎａ—ｕｌｔｉｇｌｔｎｌｒｈｈｃｃｍｉｅｒｅｙＴｅｉＶＣＧ－ＶＣ）ｉｐｏｃｅｍｕｅｉｇｉｈｈｓｒ－

一种有效的在不确定图数据库中挖掘频繁子图模式的MUSIC算法

一种有效的在不确定图数据库中挖掘频繁子图模式的MUSIC算法王文龙;李建中【摘要】近年来,如何在不确定图数据库中挖掘频繁子图模式得到了越来越多的关注.该问题的主要难点在于,不仅存在着海量的可能子图模式需要检验,而且还需要做大数量的子图同构性测试来判别图中是否蕴含一个给定的模式.传统的算法是利用近似算法计算子图模式的期望支持度,但计算开销仍然十分巨大.为此提供一个基于建立在不确定数据库上的索引的算法.算法首先根据apriori性质枚举所有可能的首选子图模式,然后利用索引对候选子图模式空间进行剪枝以减少子图同构性检验从而减少期望支持度的计算开销.通过在一个真实数据集上的实验显示本算法可以有效地在不确定图数据库中挖掘频繁子图模式.【期刊名称】《智能计算机与应用》【年(卷),期】2013(003)005【总页数】4页(P20-23)【关键词】不确定图;频繁子图模式;期望支持度;不确定图索引【作者】王文龙;李建中【作者单位】哈尔滨工业大学计算机科学与技术学院,哈尔滨150001;哈尔滨工业大学计算机科学与技术学院,哈尔滨150001【正文语种】中文【中图分类】TP311.130 引言在不确定图数据库中挖掘频繁子图模式是一个具有挑战性的问题。

在期望语义下，检验子图模式是否频繁的标准由其在不确定图数据库的所有蕴含图数据库中的支持度的期望值来评价，称之为期望支持度。

在文献［1］中通过将问题转化为DNF计数问题的一个实例，给出了一个计算子图模式的期望支持度的近似算法。

虽然该方法可以减少针对单一不确定图的计算开销，但整体开销仍然非常巨大。

文献［2，3］分别给出了一个有效的子图同构性检验算法，但由于本问题巨大的子图同构性检验次数，并不能有效降低挖掘所需的时间。

本文提出了MUSIC算法，来解决在不确定图数据库中挖掘频繁子图模式的问题。

算法通过索引来减少判断支持度的计算开销。

文中的实验显示了MUSIC算法的有效性。

1 数据模型和问题定义定义1:确定图一个确定图是一个四元组G=(V，E，Σ，L)。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Mining ArgoUML with Dynamic Analysis to Establish a Set of Key Classes forProgram Comprehensionposition paperAndy Zaidman and Serge DemeyerUniversity of AntwerpDepartment of Mathematics and Computer ScienceLab On Re-EngineeringMiddelheimlaan1,2020Antwerp,Belgium{Andy.Zaidman,Serge.Demeyer}@ua.ac.beAbstractInitial program comprehension could beneﬁt from know-ing the most important classes in the system under study. Typically,in well-designed object-oriented programs,a few key classes work tightly together to provide the bulk of the functionality.Therefore,a viable strategy is to investigate highly coupled classes,as these are likely to be elements of the core design of a software system.Previous research has shown that using webmining principles in combination with dynamic coupling metrics can help in establishing a set of ﬁrst-to-be-looked-at classes when familiarizing oneself with a software system.This paper describes the results of ap-plying this technique on ArgoUML.Keywords:Dynamic analysis,program comprehension, coupling metrics,webmining,ArgoUML.1IntroductionReverse engineering is deﬁned as the analysis of a system in order to identify its current components and their dependencies and to create abstractions of the system’s design[2].In practice,any reverse engineering operation almost always takes place in service of a speciﬁc purpose, such as re-engineering to add a speciﬁc feature,mainte-nance to improve the efﬁciency of a process,reuse of some of its modules in a new system,etc.[18,20].In order to perform any of these operations the software engineer must comprehend a given program sufﬁciently well to plan, design and implement modiﬁcations and/or additions.As such,program comprehension can be deﬁned as the process the software engineer goes through when studying the software artifacts,with as ultimate goal the sufﬁcient understanding of the system to perform the required operations[16,15].Such software artifacts could include the source code,documentation and/or abstractions from the reverse engineering process.Estimates go as far as stating that a programmer spends 40%of the time-budget of a maintenance operation on program comprehension[21,17,3].Although gaining understanding is a daunting task and is often sped-up to deliver the project in the shortest possible timeframe[5]it is absolutely necessary to get an adequate understanding of a software system before making changes.Dynamic analysis can be very helpful for understand-ing the behavior of a large system.Understanding an object-oriented system,for example,can be very hard if one relies solely on the source code and static analysis.Due to the abundant use of polymorphism in object oriented software,the relationships between the system artifacts tend to be obscure[11].Dynamic data gathered from the execution of a sce-nario of the program is typically stored in so-called execution traces.These execution traces tend to explode in size,as the amount of dynamic data collected during the execution of a simple scenario is huge[11,9]. Presenting this data to the user,without any form of processing is–due to its size–quite useless.What is needed is a technique that improves the intelligibility of the trace.In other words:the huge trace has to be chunked in readable and understandable parts.Preferably,these chunks presented to the user are essential in building up the knowledge of the software project under study.Existing solutions rely on visualization schemes for presenting the trace to the user in a readable and under-standable way[4,12,13,19].The amount of data however, remains the same.As such,it is up to the user to decide which information is appropriate and which information he/she actually needs.Another approach in this context is applying heuristics to the event trace in order toﬁnd either (1)what parts of the trace can be removed without losing information(e.g.duplicate parts)or(2)what parts of the trace are absolutely necessary in order to extract moderately sized high-level representations from it[11,7,6].The research presented in this paper should be situated in the latter category.Recently there have been a number of interesting ad-vances within this branch of research[9,11,10,22]. All techniques presented are targeted towards enabling program comprehension in a dynamic analysis context. However,in order to make an adequate comparison of these techniques,these techniques should all be applied on a common case study.The research presented in this paper is aﬁrst step in that direction.This paper presents the results of applying the HITS webmining algorithm[14]on dynamic coupling measures [22].As a common case for comparing different program comprehension tools,ArgoUML was chosen.ArgoUML is a CASE(Computer Aided Software Engineering)tool that allows for graphical software design.Its concept is centered around open source and open standards such as XMI,SVG and PGML.The structure of this paper can be summarized as fol-lows:in Section2we very brieﬂy explain our technique. Section3shows the results of applying our technique on ArgoUML,while Section4concludes with future work and the conclusion.2WebminingUsing coupling measures to determine which classes ”distribute”functionality in an application seems to be a good starting point for determining which classes to exam-ineﬁrst during early program comprehension.By using dynamic coupling measures,one eliminates the drawback that polymorphism isn’t taken into consideration.However, one of the drawbacks of using(dynamic coupling)mea-sures is that each coupling measure is calculated between two classes.As such,a class A,which is weakly coupled to a class B,which itself is tightly coupled to several other classes,will remain a weakly coupled class.Nevertheless, A can have a great impact on B and all classes that B is cou-pled to.We want to address this situation,by looking at a measure of coupling that can best be described as being transitive.Figure1.Example web-graphThis section will explain how webmining algorithms,that are iterative-recursive in nature,can be used for expressing this transitiveness.2.1Introduction to webminingIn datamining,many successful techniques have been developed to analyze the structure of the web[1,8,14]. Typically,these methods consider the Internet as a large graph,in which,based solely on the hyperlink structure, important web pages can be identiﬁed.In this section we show how to apply these successful web mining techniques to a compacted call graph of a program trace,in order to uncover important classes.First we introduce the HITS webmining-algorithm[14]to identify so-called hubs and authorities on the web.Then, the HITS algorithm is combined with the compacted call graph.Through our case studies we will show that the classes that are associated with good“hubs”in the com-pacted call graph are good candidates for early program understanding.2.2Identifying hubs in large webgraphsIn[14],the notions of hub and authority were intro-duced.Intuitively,on the one hand,hubs are pages that rather refer to pages containing information then being informative themselves.Standard examples include web directories,lists of personal pages,...On the other hand,a page is called an authority if it contains useful information. The HITS algorithm is based on this relation between hubs and authorities.Example Consider the webgraph given in Figure1. In this graph,2and3will be good authorities,and4and 5will be good hubs,and1will be a less good hub.The authority of2will be larger than the authority of3,because the only in-links that they do not have in common are 1→2and2→3,and1is a better hub than2.4and5are better hubs than1,as they point to better authorities.The HITS algorithm works as follows.Every page i gets assigned to it two numbers;a i denotes the authority of the page,while h i denotes the hubiness.An edge like i→j denotes that there is a hyperlink from page i to page j.It is also possible to add weights to the edges in the graph. Adding weights to the graph can be interesting to capture the fact that some edges are more important than others. Let w[i,j]be the weight of the edge from page i to page j. The recursive relation between authority and hubiness is captured by the following formulas.h i=i→jw[i,j]·a j(1)a j=i→jw[i,j]·h i(2)The algorithm starts with initializing all h’s and a’s to 1,and repeatedly updates the values for all pages,using the formulas(1)and(2).If after each update the values are normalized,this algorithm is known to converge to stable sets of authority and hub weights.The convergence criterion,i.e.until the sum of absolute values of weight changes fall below a constant threshold,is shown to be reached around eleven iterations[14].In the context of webmining,the identiﬁcation of hubs and authorities by the HITS algorithm has turned out to be very useful.Because HITS only uses the links between webpages,and not the actual content,it can be used on arbitrary graphs to identify important hubs and authorities.2.3Applying webmining to execution tracesWithin our problem domain,hubs can be considered coordinating classes,while authorities correspond to classes providing small functionalities that are used by many other classes.As such,hubs and authorities are conceptually similar to respectively import and export cou-pling.Again we expect hub classes to play a pivotal role in a system’s architecture.Therefore,hubs are excellent can-didates for beginning the program comprehension process or for gaining quick and initial program understanding[22]. In order to apply the HITS webmining algorithm to execution traces,an abstraction of the trace in the form of a graph has to be made.Because a trace can be seen as a large tree that contains the complete calling sequence of a program run,it is not so difﬁcult to transform this large tree into a more compact call graph.Figure2shows an example of a so-called compacted call graph.The compacted call graph is derived from the dynamic call graph;it shows an edge between two classes A→B if an instance of class A sends a message to an instance of class B.As such,the compacted call graph doesn’t has the complete calling-structure information of the program run, but it does have the data for each class-interaction in it. The weights on the edges give an indication of the tightness of the collaboration as it is the number of distinct messages that are sent between instances of both classes.More formally:weight(A,B)=|i,jM(a i,b j)|where a i and b j are instances of respectively class A and class B and M(a,b)is the set of messages sent from a to b.Also,to exclude cohesion:A=B.In this context a message is deﬁned by its signature and thus by the type(s) of its formal parameter(s).Figure2.A compacted call graph3Applying the technique on ArgoUML3.1About ArgoUMLArgoUML is an open source,Java-written CASE tool that embraces open standards such as XMI,SVG and PGML.It allows to design an application through a UML class diagram and has code-generation capabilities.ArgoUML is centered around the concept of a project.A single project can be open at any time.One project corre-sponds to a model plus diagram information,i.e.everything you can edit within the ArgoUML window.This model can contain many modelelements,which form the complete UML description of the system you are describing.ArgoUML requires a number of libraries in order to function,namely an XML parser(Xerces),a parser generator(Antlr),a logging framwork(log4j)and an internationalization library(i18n).For the purposes of ourprogram comprehension task,we excluded these libraries and/or frameworks from our analysis.This leaves us with:•523classes that form the core of ArgoUML•321classes from external libraries(e.g.Sourceforge’s Novosoft UML library and the OCL library from TU Dresden).We decided to include these external li-braries because they form an integral part of the so-lution in the problem domain.3.2Execution scenarioBecause our technique is based on dynamic analysis,a carefully chosen execution scenario is an necessity.The ex-ecution scenario we used for this small experiment was:•Starting up ArgoUML•Drawing a small class diagram which contains6 classes•Saving the project•Quitting ArgoUML3.3ResultsIn our previous experiments with this technique,we have noticed that around10%of the classes of a system should be considered as key classes for initial program understand-ing.For evaluation purposes,we considered the15%high-est ranked classes of our technique in those experiments,to give our heuristic a certain margin of error.However,those software systems under study,where lim-ited to between100and150classes,while in the case of ArgoUML,we are considering more than500classes.Be-cause of the size of the project,we have decided to limit the results of our technique,to the50highest-ranked classes(or just below10%).These results are shown in Table1.4Conclusion and future workThe documentation of ArgoUML doesn’t allow for an immediate benchmark as to how effective our technique is for detecting the key classes in ArgoUML for initial pro-gram comprehension.As such,no real conclusions can be drawn.This paper must be seen as aﬁrst step in a larger study that compares recent research that aims at developing program comprehension support-tools based on dynamic analysis.application.Mainui.DetailsPaneui.TabPropsdiagram.ui.UMLDiagramui.TabStyleui.PropPanelui.ProjectBrowserdiagram.ui.FigNodeModelElementdiagram.static structure.ui.UMLClassDiagramkernel.Projectui.ActionRemoveFromModele case.ui.UMLUseCaseDiagramdiagram.static structure.ui.FigClassui.targetmanager.TargetManagerdiagram.ui.FigEdgeModelElementui.explorer.ExplorerTreeui.TabTaggedValuesfoundation.core.UMLModelElementNamespaceComboBoxModel foundation.core.UMLModelElementStereotypeComboBoxModel ui.StylePanelui.UMLComboBoxModel2ui.UMLTreeCellRendererui.UMLPlainTextDocumentdiagram.ui.FigGeneralizationdiagram.UMLMutableGraphSupportdiagram.static structure.ui.ClassDiagramRenderer persistence.AbstractFilePersisterui.UMLCheckBox2ui.explorer.rules.GoClassiﬁerToSequenceDiagramui.explorer.rules.GoNamespaceToDiagramui.foundation.core.UMLGeneralizationPowertypeComboBoxModel ui.foundation.core.ActionSetModelElementNamespaceui.UMLModelElementListModel2ui.TableModelTaggedValuesui.foundation.core.UMLClassActiveCheckBoxui.foundation.core.UMLGeneralizableElementAbstractCheckBox ui.foundation.core.UMLGeneralizableElementLeafCheckBoxui.foundation.core.UMLGeneralizableElementRootCheckBox cognitive.ui.WizDescriptioncognitive.ToDoListfoundation.core.UMLModelElementVisibilityRadioButtonPanel ui.ActionCollaborationDiagramui.UMLRadioButtonPanelexplorer.rules.GoClassiﬁerToBehavioralFeaturediagram.ui.ActionAddAttributediagram.ui.ActionAddOperationui.ActionGenerateOneui.ActionSequenceDiagramui.TabConstraintsexplorer.rules.GoBehavioralFeatureToStateDiagramexplorer.rules.GoBehavioralFeatureToStateMachineTable1.50most important classes from Ar-goUMLReferences[1]S.Brin and L.Page.The anatomy of a large-scale hypertex-tual web search puter Networks,30(1-7):107–117,1998.[2] E.J.Chikofsky and J.H.Cross II.Reverse engineering anddesign recovery:A taxonomy.IEEE Software,pages13–17, Jan.1990.[3]T.Corbi.Program understanding:Challenge for the90s.IBM Systems Journal,28(2):294–306,1990.[4]W.De Pauw,D.Lorenz,J.Vlissides,and M.Wegman.Ex-ecution patterns in object-oriented visualization.In Pro-ceedings of the4th USENIX Conference on Object-Oriented Technologies and Systems(COOTS),1998.[5]S.Demeyer,S.Ducasse,and O.Nierstrasz.Object-OrientedReengineering Patterns.Morgan Kaufmann,2003.[6]T.Eisenbarth,R.Koschke,and D.Simon.Aiding programcomprehension by static and dynamic feature analysis.In ICSM,pages602–611,2001.[7]M.A.Foltz.Dr.jones:A software archaeologist’s magiclens,2002./457040.html.[8] D.Gibson,J.M.Kleinberg,and P.Raghavan.Inferring webcommunities from link topology.In UK Conference on Hy-pertext,pages225–234,1998.[9]O.Greevy and S.Ducasse.Correlating features and codeusing a compact two-sided trace analysis approach.In Pro-ceedings of the9th European Conference on Software Main-tenance and Reengineering(CSMR2005),pages314–323.IEEE Computer Society,2005.[10] A.Hamou-Lhadj,E.Braun,D.Amyot,and T.Lethbridge.Recovering behavioral design models from execution traces.In Proceedings of the9th European Conference on Software Maintenance and Reengineering(CSMR2005),pages112–121.IEEE Computer Society,2005.[11] A.Hamou-Lhadj,T.C.Lethbridge,and L.Fu.Challengesand requirements for an effective trace exploration tool.In Proceedings of the12th International Workshop on Program Comprehension(IWPC’04),pages70–78.IEEE,2004.[12] D.Jerding and ing visualization for archi-tectural localization and extraction.In Proceedings of the Fourth Working Conference on Reverse Engineering,1997.[13] D.F.Jerding and J.T.Stasko.The information mural:Atechnique for displaying and navigating large information spaces.IEEE Transactions on Visualization and Computer Graphics,4(3):257–271,1998.[14]J.M.Kleinberg.Authoritative sources in a hyperlinked en-vironment.Journal of the ACM,46(5):604–632,1999. [15] khotia.Understanding someone else’s code:Analysisof experiences.Journal of Systems and Software,pages269–275,Dec.1993.[16] D.Ng,D.R.Kaeli,S.Kojarski,and D.H.Lorenz.Pro-gram comprehension using aspects.In ICSE2004Workshop WoDiSEE’2004,2004.[17] D.Spinellis.Code Reading:The Open Source Perspective.Addison-Wesley,2003.[18] E.Stroulia and T.Syst¨a.Dynamic analysis for reverse engi-neering and program understanding.SIGAPP put.Rev.,10(1):8–17,2002.[19]T.Syst¨a.Understanding the behavior of java programs.InProceedings of the Seventh Working Conference on Reverse Engineering,pages214–223.IEEE,2000.[20] A.von Mayrhauser and A.Marie Vans.Program compre-hension during software maintenance and -puter,10(8):44–55,Aug.1995.[21]N.Wilde.Faster reuse and maintenance using software re-connaissance,1994.Technical Report SERC-TR-75F,Soft-ware Engineering Research Center,CSE-301,University of Florida,CIS Department,Gainesville,FL.[22] A.Zaidman,T.Calders,S.Demeyer,and J.Paredaens.Ap-plying webmining techniques to execution traces to sup-port the program comprehension process.In Proceedings of the9th European Conference on Software Maintenance and Reengineering(CSMR2005),pages134–142.IEEE Com-puter Society,2005.。