Mining Spatial-temporal Clusters from Geo-databases

合集下载

聚类mineps值

聚类mineps值聚类mineps值是一种用于评估聚类结果的指标，它可以帮助我们判断聚类的效果好坏。

在聚类分析中，我们常常需要将数据集划分为不同的簇，每个簇内的数据点相似度较高，而簇间的数据点相似度较低。

而聚类mineps值就是用来衡量这种相似度的指标。

首先，聚类mineps值可以通过计算每个数据点与其所属簇中其他数据点的平均距离来得到。

如果某个簇内的数据点之间的平均距离较小，说明这个簇内的数据点相似度较高，簇的紧密度较高。

相应地，聚类mineps值较小。

而如果某个簇内的数据点之间的平均距离较大，说明这个簇内的数据点相似度较低，簇的紧密度较低。

相应地，聚类mineps值较大。

其次，聚类mineps值还可以用来评估不同聚类结果之间的差异性。

当我们尝试不同的聚类算法或调整聚类参数时，聚类mineps值可以帮助我们比较不同聚类结果的优劣。

如果一个聚类结果的聚类mineps值较小，说明这个聚类结果的综合质量较高，数据点之间的相似度较高。

反之，聚类mineps值较大的聚类结果可能存在一些问题，数据点之间的相似度较低。

最后，聚类mineps值的计算方法比较简单。

我们可以使用一些聚类算法，如k-means算法或层次聚类算法，来得到聚类结果。

然后，通过计算每个簇内数据点之间的平均距离，并将这些平均距离求和，即可得到聚类mineps值。

在实际应用中，我们可以根据具体的需求或问题，选择不同的聚类算法和评估指标，来获得更好的聚类结果。

综上所述，聚类mineps值是一种用于评估聚类结果的指标，可以帮助我们判断聚类的效果好坏。

通过计算簇内数据点之间的平均距离，聚类mineps值可以反映出数据点之间的相似度和簇的紧密度。

同时，聚类mineps值还可以用来比较不同聚类结果的优劣，帮助我们选择合适的聚类算法和参数。

在实际应用中，我们可以根据具体情况来选择不同的聚类算法和评估指标，以获得更好的聚类结果。

A_review_on_time_series_data_mining

A review on time series data miningTak-chung FuDepartment of Computing,Hong Kong Polytechnic University,Hunghom,Kowloon,Hong Konga r t i c l e i n f oArticle history:Received19February2008Received in revised form14March2010Accepted4September2010Keywords:Time series data miningRepresentationSimilarity measureSegmentationVisualizationa b s t r a c tTime series is an important class of temporal data objects and it can be easily obtained from scientiﬁcandﬁnancial applications.A time series is a collection of observations made chronologically.The natureof time series data includes:large in data size,high dimensionality and necessary to updatecontinuously.Moreover time series data,which is characterized by its numerical and continuousnature,is always considered as a whole instead of individual numericalﬁeld.The increasing use of timeseries data has initiated a great deal of research and development attempts in theﬁeld of data mining.The abundant research on time series data mining in the last decade could hamper the entry ofinterested researchers,due to its complexity.In this paper,a comprehensive revision on the existingtime series data mining researchis given.They are generally categorized into representation andindexing,similarity measure,segmentation,visualization and mining.Moreover state-of-the-artresearch issues are also highlighted.The primary objective of this paper is to serve as a glossary forinterested researchers to have an overall picture on the current time series data mining developmentand identify their potential research direction to further investigation.&2010Elsevier Ltd.All rights reserved.1.IntroductionRecently,the increasing use of temporal data,in particulartime series data,has initiated various research and developmentattempts in theﬁeld of data mining.Time series is an importantclass of temporal data objects,and it can be easily obtained fromscientiﬁc andﬁnancial applications(e.g.electrocardiogram(ECG),daily temperature,weekly sales totals,and prices of mutual fundsand stocks).A time series is a collection of observations madechronologically.The nature of time series data includes:large indata size,high dimensionality and update continuously.Moreovertime series data,which is characterized by its numerical andcontinuous nature,is always considered as a whole instead ofindividual numericalﬁeld.Therefore,unlike traditional databaseswhere similarity search is exact match based,similarity search intime series data is typically carried out in an approximatemanner.There are various kinds of time series data related research,forexample,ﬁnding similar time series(Agrawal et al.,1993a;Berndtand Clifford,1996;Chan and Fu,1999),subsequence searching intime series(Faloutsos et al.,1994),dimensionality reduction(Keogh,1997b;Keogh et al.,2000)and segmentation(Abonyiet al.,2005).Those researches have been studied in considerabledetail by both database and pattern recognition communities fordifferent domains of time series data(Keogh and Kasetty,2002).In the context of time series data mining,the fundamentalproblem is how to represent the time series data.One of thecommon approaches is transforming the time series to anotherdomain for dimensionality reduction followed by an indexingmechanism.Moreover similarity measure between time series ortime series subsequences and segmentation are two core tasksfor various time series mining tasks.Based on the time seriesrepresentation,different mining tasks can be found in theliterature and they can be roughly classiﬁed into fourﬁelds:pattern discovery and clustering,classiﬁcation,rule discovery andsummarization.Some of the research concentrates on one of theseﬁelds,while the others may focus on more than one of the aboveprocesses.In this paper,a comprehensive review on the existingtime series data mining research is given.Three state-of-the-arttime series data mining issues,streaming,multi-attribute timeseries data and privacy are also brieﬂy introduced.The remaining part of this paper is organized as follows:Section2contains a discussion of time series representation andindexing.The concept of similarity measure,which includes bothwhole time series and subsequence matching,based on the rawtime series data or the transformed domain will be reviewed inSection3.The research work on time series segmentation andvisualization will be discussed in Sections4and5,respectively.InSection6,vary time series data mining tasks and recent timeseries data mining directions will be reviewed,whereas theconclusion will be made in Section7.2.Time series representation and indexingOne of the major reasons for time series representation is toreduce the dimension(i.e.the number of data point)of theContents lists available at ScienceDirectjournal homepage:/locate/engappaiEngineering Applications of Artiﬁcial Intelligence0952-1976/$-see front matter&2010Elsevier Ltd.All rights reserved.doi:10.1016/j.engappai.2010.09.007E-mail addresses:cstcfu@.hk,cstcfu@Engineering Applications of Artiﬁcial Intelligence24(2011)164–181original data.The simplest method perhaps is sampling(Astrom, 1969).In this method,a rate of m/n is used,where m is the length of a time series P and n is the dimension after dimensionality reduction(Fig.1).However,the sampling method has the drawback of distorting the shape of sampled/compressed time series,if the sampling rate is too low.An enhanced method is to use the average(mean)value of each segment to represent the corresponding set of data points. Again,with time series P¼ðp1,...,p mÞand n is the dimension after dimensionality reduction,the‘‘compressed’’time series ^P¼ð^p1,...,^p nÞcan be obtained by^p k ¼1k kX e ki¼s kp ið1Þwhere s k and e k denote the starting and ending data points of the k th segment in the time series P,respectively(Fig.2).That is, using the segmented means to represent the time series(Yi and Faloutsos,2000).This method is also called piecewise aggregate approximation(PAA)by Keogh et al.(2000).1Keogh et al.(2001a) propose an extended version called an adaptive piecewise constant approximation(APCA),in which the length of each segment is notﬁxed,but adaptive to the shape of the series.A signature technique is proposed by Faloutsos et al.(1997)with similar ideas.Besides using the mean to represent each segment, other methods are proposed.For example,Lee et al.(2003) propose to use the segmented sum of variation(SSV)to represent each segment of the time series.Furthermore,a bit level approximation is proposed by Ratanamahatana et al.(2005)and Bagnall et al.(2006),which uses a bit to represent each data point.To reduce the dimension of time series data,another approach is to approximate a time series with straight lines.Two major categories are involved.Theﬁrst one is linear interpolation.A common method is using piecewise linear representation(PLR)2 (Keogh,1997b;Keogh and Smyth,1997;Smyth and Keogh,1997). The approximating line for the subsequence P(p i,y,p j)is simply the line connecting the data points p i and p j.It tends to closely align the endpoint of consecutive segments,giving the piecewise approximation with connected lines.PLR is a bottom-up algo-rithm.It begins with creating aﬁne approximation of the time series,so that m/2segments are used to approximate the m length time series and iteratively merges the lowest cost pair of segments,until it meets the required number of segment.When the pair of adjacent segments S i and S i+1are merged,the cost of merging the new segment with its right neighbor and the cost of merging the S i+1segment with its new larger neighbor is calculated.Ge(1998)extends PLR to hierarchical structure. Furthermore,Keogh and Pazzani enhance PLR by considering weights of the segments(Keogh and Pazzani,1998)and relevance feedback from the user(Keogh and Pazzani,1999).The second approach is linear regression,which represents the subsequences with the bestﬁtting lines(Shatkay and Zdonik,1996).Furthermore,reducing the dimension by preserving the salient points is a promising method.These points are called as perceptually important points(PIP).The PIP identiﬁcation process isﬁrst introduced by Chung et al.(2001)and used for pattern matching of technical(analysis)patterns inﬁnancial applications. With the time series P,there are n data points:P1,P2y,P n.All the data points in P can be reordered by its importance by going through the PIP identiﬁcation process.Theﬁrst data point P1and the last data point P n in the time series are theﬁrst and two PIPs, respectively.The next PIP that is found will be the point in P with maximum distance to theﬁrst two PIPs.The fourth PIP that is found will be the point in P with maximum vertical distance to the line joining its two adjacent PIPs,either in between theﬁrst and second PIPs or in between the second and the last PIPs.The PIP location process continues until all the points in P are attached to a reordered list L or the required number of PIPs is reached(i.e. reduced to the required dimension).Seven PIPs are identiﬁed in from the sample time series in Fig.3.Detailed treatment can be found in Fu et al.(2008c).The idea is similar to a technique proposed about30years ago for reducing the number of points required to represent a line by Douglas and Peucker(1973)(see also Hershberger and Snoeyink, 1992).Perng et al.(2000)use a landmark model to identify the important points in the time series for similarity measure.Man and Wong(2001)propose a lattice structure to represent the identiﬁed peaks and troughs(called control points)in the time series.Pratt and Fink(2002)and Fink et al.(2003)deﬁne extrema as minima and maxima in a time series and compress thetime Fig.1.Time series dimensionality reduction by sampling.The time series on the left is sampled regularly(denoted by dotted lines)and displayed on the right with a largedistortion.Fig.2.Time series dimensionality reduction by PAA.The horizontal dotted lines show the mean of each segment.1This method is called piecewise constant approximation originally(Keoghand Pazzani,2000a).2It is also called piecewise linear approximation(PLA).Tak-chung Fu/Engineering Applications of Artiﬁcial Intelligence24(2011)164–181165series by selecting only certain important extrema and dropping the other points.The idea is to discard minor ﬂuctuations and keep major minima and maxima.The compression is controlled by the compression ratio with parameter R ,which is always greater than one;an increase of R leads to the selection of fewer points.That is,given indices i and j ,where i r x r j ,a point p x of a series P is an important minimum if p x is the minimum among p i ,y ,p j ,and p i /p x Z R and p j /p x Z R .Similarly,p x is an important maximum if p x is the maximum among p i ,y ,p j and p x /p i Z R and p x /p j Z R .This algorithm takes linear time and constant memory.It outputs the values and indices of all important points,as well as the ﬁrst and last point of the series.This algorithm can also process new points as they arrive,without storing the original series.It identiﬁes important points based on local information of each segment (subsequence)of time series.Recently,a critical point model (CPM)(Bao,2008)and a high-level representation based on a sequence of critical points (Bao and Yang,2008)are proposed for ﬁnancial data analysis.On the other hand,special points are introduced to restrict the error on PLR (Jia et al.,2008).Key points are suggested to represent time series in (Leng et al.,2009)for an anomaly detection.Another common family of time series representation approaches converts the numeric time series to symbolic form.That is,ﬁrst discretizing the time series into segments,then converting each segment into a symbol (Yang and Zhao,1998;Yang et al.,1999;Motoyoshi et al.,2002;Aref et al.,2004).Lin et al.(2003;2007)propose a method called symbolic aggregate approximation (SAX)to convert the result from PAA to symbol string.The distribution space (y -axis)is divided into equiprobable regions.Each region is represented by a symbol and each segment can then be mapped into a symbol corresponding to the region inwhich it resides.The transformed time series ^Pusing PAA is ﬁnally converted to a symbol string SS (s 1,y ,s W ).In between,two parameters must be speciﬁed for the conversion.They are the length of subsequence w and alphabet size A (number of symbols used).Besides using the means of the segments to build the alphabets,another method uses the volatility change to build the alphabets.Jonsson and Badal (1997)use the ‘‘Shape Description Alphabet (SDA)’’.Example symbols like highly increasing transi-tion,stable transition,and slightly decreasing transition are adopted.Qu et al.(1998)use gradient alphabets like upward,ﬂat and download as symbols.Huang and Yu (1999)suggest transforming the time series to symbol string,using change ratio between contiguous data points.Megalooikonomou et al.(2004)propose to represent each segment by a codeword from a codebook of key-sequences.This work has extended to multi-resolution consideration (Megalooi-konomou et al.,2005).Morchen and Ultsch (2005)propose an unsupervised discretization process based on quality score and persisting states.Instead of ignoring the temporal order of values like many other methods,the Persist algorithm incorporates temporal information.Furthermore,subsequence clustering is a common method to generate the symbols (Das et al.,1998;Li et al.,2000a;Hugueney and Meunier,2001;Hebrail and Hugueney,2001).A multiple abstraction level mining (MALM)approach is proposed by Li et al.(1998),which is based on the symbolic form of the time series.The symbols in this paper are determined by clustering the features of each segment,such as regression coefﬁcients,mean square error and higher order statistics based on the histogram of the regression residuals.Most of the methods described so far are representing time series in time domain directly.Representing time series in the transformation domain is another large family of approaches.One of the popular transformation techniques in time series data mining is the discrete Fourier transforms (DFT),since ﬁrst being proposed for use in this context by Agrawal et al.(1993a).Raﬁei and Mendelzon (2000)develop similarity-based queries,using DFT.Janacek et al.(2005)propose to use likelihood ratio statistics to test the hypothesis of difference between series instead of an Euclidean distance in the transformed domain.Recent research uses wavelet transform to represent time series (Struzik and Siebes,1998).In between,the discrete wavelet transform (DWT)has been found to be effective in replacing DFT (Chan and Fu,1999)and the Haar transform is always selected (Struzik and Siebes,1999;Wang and Wang,2000).The Haar transform is a series of averaging and differencing operations on a time series (Chan and Fu,1999).The average and difference between every two adjacent data points are computed.For example,given a time series P ¼(1,3,7,5),dimension of 4data points is the full resolution (i.e.original time series);in dimension of two coefﬁcients,the averages are (26)with the coefﬁcients (À11)and in dimension of 1coefﬁcient,the average is 4with coefﬁcient (À2).A multi-level representation of the wavelet transform is proposed by Shahabi et al.(2000).Popivanov and Miller (2002)show that a large class of wavelet transformations can be used for time series representation.Dasha et al.(2007)compare different wavelet feature vectors.On the other hand,comparison between DFT and DWT can be found in Wu et al.(2000b)and Morchen (2003)and a combination use of Fourier and wavelet transforms are presented in Kawagoe and Ueda (2002).An ensemble-index,is proposed by Keogh et al.(2001b)and Vlachos et al.(2006),which ensembles two or more representations for indexing.Principal component analysis (PCA)is a popular multivariate technique used for developing multivariate statistical process monitoring methods (Yang and Shahabi,2005b;Yoon et al.,2005)and it is applied to analyze ﬁnancial time series by Lesch et al.(1999).In most of the related works,PCA is used to eliminate the less signiﬁcant components or sensors and reduce the data representation only to the most signiﬁcant ones and to plot the data in two dimensions.The PCA model deﬁnes linear hyperplane,it can be considered as the multivariate extension of the PLR.PCA maps the multivariate data into a lower dimensional space,which is useful in the analysis and visualization of correlated high-dimensional data.Singular value decomposition (SVD)(Korn et al.,1997)is another transformation-based approach.Other time series representation methods include modeling time series using hidden markov models (HMMs)(Azzouzi and Nabney,1998)and a compression technique for multiple stream is proposed by Deligiannakis et al.(2004).It is based onbaseFig.3.Time series compression by data point importance.The time series on the left is represented by seven PIPs on the right.Tak-chung Fu /Engineering Applications of Artiﬁcial Intelligence 24(2011)164–181166signal,which encodes piecewise linear correlations among the collected data values.In addition,a recent biased dimension reduction technique is proposed by Zhao and Zhang(2006)and Zhao et al.(2006).Moreover many of the representation schemes described above are incorporated with different indexing methods.A common approach is adopted to an existing multidimensional indexing structure(e.g.R-tree proposed by Guttman(1984))for the representation.Agrawal et al.(1993a)propose an F-index, which adopts the R*-tree(Beckmann et al.,1990)to index theﬁrst few DFT coefﬁcients.An ST-index is further proposed by (Faloutsos et al.(1994),which extends the previous work for subsequence handling.Agrawal et al.(1995a)adopt both the R*-and R+-tree(Sellis et al.,1987)as the indexing structures.A multi-level distance based index structure is proposed(Yang and Shahabi,2005a),which for indexing time series represented by PCA.Vlachos et al.(2005a)propose a Multi-Metric(MM)tree, which is a hybrid indexing structure on Euclidean and periodic spaces.Minimum bounding rectangle(MBR)is also a common technique for time series indexing(Chu and Wong,1999;Vlachos et al.,2003).An MBR is adopted in(Raﬁei,1999)which an MT-index is developed based on the Fourier transform and in(Kahveci and Singh,2004)which a multi-resolution index is proposed based on the wavelet transform.Chen et al.(2007a)propose an indexing mechanism for PLR representation.On the other hand, Kim et al.(1996)propose an index structure called TIP-index (TIme series Pattern index)for manipulating time series pattern databases.The TIP-index is developed by improving the extended multidimensional dynamic indexﬁle(EMDF)(Kim et al.,1994). An iSAX(Shieh and Keogh,2009)is proposed to index massive time series,which is developed based on an SAX.A multi-resolution indexing structure is proposed by Li et al.(2004),which can be adapted to different representations.To sum up,for a given index structure,the efﬁciency of indexing depends only on the precision of the approximation in the reduced dimensionality space.However in choosing a dimensionality reduction technique,we cannot simply choose an arbitrary compression algorithm.It requires a technique that produces an indexable representation.For example,many time series can be efﬁciently compressed by delta encoding,but this representation does not lend itself to indexing.In contrast,SVD, DFT,DWT and PAA all lend themselves naturally to indexing,with each eigenwave,Fourier coefﬁcient,wavelet coefﬁcient or aggregate segment map onto one dimension of an index tree. Post-processing is then performed by computing the actual distance between sequences in the time domain and discarding any false matches.3.Similarity measureSimilarity measure is of fundamental importance for a variety of time series analysis and data mining tasks.Most of the representation approaches discussed in Section2also propose the similarity measure method on the transformed representation scheme.In traditional databases,similarity measure is exact match based.However in time series data,which is characterized by its numerical and continuous nature,similarity measure is typically carried out in an approximate manner.Consider the stock time series,one may expect having queries like: Query1:ﬁnd all stocks which behave‘‘similar’’to stock A.Query2:ﬁnd all‘‘head and shoulders’’patterns last for a month in the closing prices of all high-tech stocks.The query results are expected to provide useful information for different stock analysis activities.Queries like Query2in fact is tightly coupled with the patterns frequently used in technical analysis, e.g.double top/bottom,ascending triangle,ﬂag and rounded top/bottom.In time series domain,devising an appropriate similarity function is by no means trivial.There are essentially two ways the data that might be organized and processed(Agrawal et al., 1993a).In whole sequence matching,the whole length of all time series is considered during the similarity search.It requires comparing the query sequence to each candidate series by evaluating the distance function and keeping track of the sequence with the smallest distance.In subsequence matching, where a query sequence Q and a longer sequence P are given,the task is toﬁnd the subsequences in P,which matches Q. Subsequence matching requires that the query sequence Q be placed at every possible offset within the longer sequence P.With respect to Query1and Query2above,they can be considered as a whole sequence matching and a subsequence matching,respec-tively.Gavrilov et al.(2000)study the usefulness of different similarity measures for clustering similar stock time series.3.1.Whole sequence matchingTo measure the similarity/dissimilarity between two time series,the most popular approach is to evaluate the Euclidean distance on the transformed representation like the DFT coefﬁ-cients(Agrawal et al.,1993a)and the DWT coefﬁcients(Chan and Fu,1999).Although most of these approaches guarantee that a lower bound of the Euclidean distance to the original data, Euclidean distance is not always being the suitable distance function in speciﬁed domains(Keogh,1997a;Perng et al.,2000; Megalooikonomou et al.,2005).For example,stock time series has its own characteristics over other time series data(e.g.data from scientiﬁc areas like ECG),in which the salient points are important.Besides Euclidean-based distance measures,other distance measures can easily be found in the literature.A constraint-based similarity query is proposed by Goldin and Kanellakis(1995), which extended the work of(Agrawal et al.,1993a).Das et al. (1997)apply computational geometry methods for similarity measure.Bozkaya et al.(1997)use a modiﬁed edit distance function for time series matching and retrieval.Chu et al.(1998) propose to measure the distance based on the slopes of the segments for handling amplitude and time scaling problems.A projection algorithm is proposed by Lam and Wong(1998).A pattern recognition method is proposed by Morrill(1998),which is based on the building blocks of the primitives of the time series. Ruspini and Zwir(1999)devote an automated identiﬁcation of signiﬁcant qualitative features of complex objects.They propose the process of discovery and representation of interesting relations between those features,the generation of structured indexes and textual annotations describing features and their relations.The discovery of knowledge by an analysis of collections of qualitative descriptions is then achieved.They focus on methods for the succinct description of interesting features lying in an effective frontier.Generalized clustering is used for extracting features,which interest domain experts.The general-ized Markov models are adopted for waveform matching in Ge and Smyth(2000).A content-based query-by-example retrieval model called FALCON is proposed by Wu et al.(2000a),which incorporates a feedback mechanism.Indeed,one of the most popular andﬁeld-tested similarity measures is called the‘‘time warping’’distance measure.Based on the dynamic time warping(DTW)technique,the proposed method in(Berndt and Clifford,1994)predeﬁnes some patterns to serve as templates for the purpose of pattern detection.To align two time series,P and Q,using DTW,an n-by-m matrix M isﬁrstTak-chung Fu/Engineering Applications of Artiﬁcial Intelligence24(2011)164–181167constructed.The(i th,j th)element of the matrix,m ij,contains the distance d(q i,p j)between the two points q i and p j and an Euclidean distance is typically used,i.e.d(q i,p j)¼(q iÀp j)2.It corresponds to the alignment between the points q i and p j.A warping path,W,is a contiguous set of matrix elements that deﬁnes a mapping between Q and P.Its k th element is deﬁned as w k¼(i k,j k)andW¼w1,w2,...,w k,...,w Kð2Þwhere maxðm,nÞr K o mþnÀ1.The warping path is typically subjected to the following constraints.They are boundary conditions,continuity and mono-tonicity.Boundary conditions are w1¼(1,1)and w K¼(m,n).This requires the warping path to start andﬁnish diagonally.Next constraint is continuity.Given w k¼(a,b),then w kÀ1¼(a0,b0), where aÀa u r1and bÀb u r1.This restricts the allowable steps in the warping path being the adjacent cells,including the diagonally adjacent cell.Also,the constraints aÀa uZ0and bÀb uZ0force the points in W to be monotonically spaced in time.There is an exponential number of warping paths satisfying the above conditions.However,only the path that minimizes the warping cost is of interest.This path can be efﬁciently found by using dynamic programming(Berndt and Clifford,1996)to evaluate the following recurrence equation that deﬁnes the cumulative distance gði,jÞas the distance dði,jÞfound in the current cell and the minimum of the cumulative distances of the adjacent elements,i.e.gði,jÞ¼dðq i,p jÞþmin f gðiÀ1,jÀ1Þ,gðiÀ1,jÞ,gði,jÀ1Þgð3ÞA warping path,W,such that‘‘distance’’between them is minimized,can be calculated by a simple methodDTWðQ,PÞ¼minWX Kk¼1dðw kÞ"#ð4Þwhere dðw kÞcan be deﬁned asdðw kÞ¼dðq ik ,p ikÞ¼ðq ikÀp ikÞ2ð5ÞDetailed treatment can be found in Kruskall and Liberman (1983).As DTW is computationally expensive,different methods are proposed to speedup the DTW matching process.Different constraint(banding)methods,which control the subset of matrix that the warping path is allowed to visit,are reviewed in Ratanamahatana and Keogh(2004).Yi et al.(1998)introduce a technique for an approximate indexing of DTW that utilizes a FastMap technique,whichﬁlters the non-qualifying series.Kim et al.(2001)propose an indexing approach under DTW similarity measure.Keogh and Pazzani(2000b)introduce a modiﬁcation of DTW,which integrates with PAA and operates on a higher level abstraction of the time series.An exact indexing approach,which is based on representing the time series by PAA for DTW similarity measure is further proposed by Keogh(2002).An iterative deepening dynamic time warping(IDDTW)is suggested by Chu et al.(2002),which is based on a probabilistic model of the approximate errors for all levels of approximation prior to the query process.Chan et al.(2003)propose aﬁltering process based on the Haar wavelet transformation from low resolution approx-imation of the real-time warping distance.Shou et al.(2005)use an APCA approximation to compute the lower bounds for DTW distance.They improve the global bound proposed by Kim et al. (2001),which can be used to index the segments and propose a multi-step query processing technique.A FastDTW is proposed by Salvador and Chan(2004).This method uses a multi-level approach that recursively projects a solution from a coarse resolution and reﬁnes the projected solution.Similarly,a fast DTW search method,an FTW is proposed by Sakurai et al.(2005) for efﬁciently pruning a signiﬁcant number of search candidates. Ratanamahatana and Keogh(2005)clariﬁed some points about DTW where are related to lower bound and speed.Euachongprasit and Ratanamahatana(2008)also focus on this problem.A sequentially indexed structure(SIS)is proposed by Ruengron-ghirunya et al.(2009)to balance the tradeoff between indexing efﬁciency and I/O cost during DTW similarity measure.A lower bounding function for group of time series,LBG,is adopted.On the other hand,Keogh and Pazzani(2001)point out the potential problems of DTW that it can lead to unintuitive alignments,where a single point on one time series maps onto a large subsection of another time series.Also,DTW may fail to ﬁnd obvious and natural alignments in two time series,because of a single feature(i.e.peak,valley,inﬂection point,plateau,etc.). One of the causes is due to the great difference between the lengths of the comparing series.Therefore,besides improving the performance of DTW,methods are also proposed to improve an accuracy of DTW.Keogh and Pazzani(2001)propose a modiﬁca-tion of DTW that considers the higher level feature of shape for better alignment.Ratanamahatana and Keogh(2004)propose to learn arbitrary constraints on the warping path.Regression time warping(RTW)is proposed by Lei and Govindaraju(2004)to address the challenges of shifting,scaling,robustness and tecki et al.(2005)propose a method called the minimal variance matching(MVM)for elastic matching.It determines a subsequence of the time series that best matches a query series byﬁnding the cheapest path in a directed acyclic graph.A segment-wise time warping distance(STW)is proposed by Zhou and Wong(2005)for time scaling search.Fu et al.(2008a) propose a scaled and warped matching(SWM)approach for handling both DTW and uniform scaling simultaneously.Different customized DTW techniques are applied to theﬁeld of music research for query by humming(Zhu and Shasha,2003;Arentz et al.,2005).Focusing on similar problems as DTW,the Longest Common Subsequence(LCSS)model(Vlachos et al.,2002)is proposed.The LCSS is a variation of the edit distance and the basic idea is to match two sequences by allowing them to stretch,without rearranging the sequence of the elements,but allowing some elements to be unmatched.One of the important advantages of an LCSS over DTW is the consideration on the outliers.Chen et al.(2005a)further introduce a distance function based on an edit distance on real sequence(EDR),which is robust against the data imperfection.Morse and Patel(2007)propose a Fast Time Series Evaluation(FTSE)method which can be used to evaluate the threshold value of these kinds of techniques in a faster way.Threshold-based distance functions are proposed by ABfalg et al. (2006).The proposed function considers intervals,during which the time series exceeds a certain threshold for comparing time series rather than using the exact time series values.A T-Time application is developed(ABfalg et al.,2008)to demonstrate the usage of it.Fu et al.(2007)further suggest to introduce rules to govern the pattern matching process,if a priori knowledge exists in the given domain.A parameter-light distance measure method based on Kolmo-gorov complexity theory is suggested in Keogh et al.(2007b). Compression-based dissimilarity measure(CDM)3is adopted in this paper.Chen et al.(2005b)present a histogram-based representation for similarity measure.Similarly,a histogram-based similarity measure,bag-of-patterns(BOP)is proposed by Lin and Li(2009).The frequency of occurrences of each pattern in 3CDM is proposed by Keogh et al.(2004),which is used to compare the co-compressibility between data sets.Tak-chung Fu/Engineering Applications of Artiﬁcial Intelligence24(2011)164–181 168。

Microbial Community Structure

Microbial Community Structure Microbial communities are complex and dynamic entities that play a crucialrole in various ecosystems. They are composed of a diverse array of microorganisms, including bacteria, archaea, fungi, and viruses, which interact with each otherand their environment in intricate ways. Understanding the structure of these communities is essential for gaining insights into their function and the processes they mediate.The composition of microbial communities can be influenced by a multitude of factors, such as environmental conditions, availability of nutrients, and interactions with other organisms. For instance, temperature and pH cansignificantly affect the types of microorganisms that thrive in a particular habitat. Similarly, the presence of certain chemicals or the absence of others can shape the community structure by favoring the growth of specific microbes.One of the key aspects of microbial community structure is the concept of species richness, which refers to the number of different species present in a community. A high species richness is often associated with a more stable and resilient community, as it can buffer against environmental changes and disturbances. Conversely, a low species richness may indicate a less stable community that is more susceptible to fluctuations and perturbations.Another important aspect of microbial community structure is the relative abundance of different species. Some species may be dominant and constitute alarge proportion of the community, while others may be rare and present in only small numbers. The distribution of species abundances can provide valuable information about the ecological roles of different microbes and theirinteractions with each other.The spatial distribution of microorganisms within a community is also acritical factor that influences its structure. Microbes can be distributed evenly throughout a habitat or they may form distinct clusters or patches. This spatialorganization can be driven by various factors, such as the availability of resources, the presence of physical barriers, or the influence of other organisms.The temporal dynamics of microbial communities are another important consideration. Over time, the composition and structure of a community can change in response to various factors, such as seasonal variations, successional processes, or disturbances. Monitoring these changes can provide valuable insights into the factors that drive community dynamics and the mechanisms that maintain stability and resilience.Finally, it is important to recognize that microbial communities are not isolated entities but are interconnected with other components of the ecosystem. They interact with plants, animals, and other microbes in complex networks of relationships that can influence their structure and function. Understanding these interactions is essential for a comprehensive understanding of ecosystem processes and the role of microbes within them.In conclusion, the study of microbial community structure is a multifaceted and complex endeavor that requires consideration of a wide range of factors and processes. By examining the composition, species richness, relative abundance, spatial distribution, temporal dynamics, and ecological interactions of microbial communities, we can gain a deeper understanding of their function and the roles they play in ecosystems. This knowledge is crucial for managing and conserving these vital components of our natural world.。

Data Mining：Concepts and Techniques

4
Types of Outliers (I)

Three kinds: global, contextual and collective outliers Global Outlier Global outlier (or point anomaly) Object is Og if it significantly deviates from the rest of the data set Ex. Intrusion detection in computer networks Issue: Find an appropriate measurement of deviation Contextual outlier (or conditional outlier) Object is Oc if it deviates significantly based on a selected context o Ex. 80 F in Urbana: outlier? (depending on summer or winter?) Attributes of data objects should be divided into two groups Contextual attributes: defines the context, e.g., time & location Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature Can be viewed as a generalization of local outliers—whose density significantly deviates from its local area Issue: How to define or formulate meaningful context?

Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering-DMKD

Data Min Knowl DiscDOI10.1007/s10618-014-0372-zLeveraging the power of local spatial autocorrelationin geophysical interpolative clusteringAnnalisa Appice·Donato MalerbaReceived:16December2012/Accepted:22June2014©The Author(s)2014Abstract Nowadays ubiquitous sensor stations are deployed worldwide,in order to measure several geophysical variables(e.g.temperature,humidity,light)for a grow-ing number of ecological and industrial processes.Although these variables are,in general,measured over large zones and long(potentially unbounded)periods of time, stations cannot cover any space location.On the other hand,due to their huge vol-ume,data produced cannot be entirely recorded for future analysis.In this scenario, summarization,i.e.the computation of aggregates of data,can be used to reduce the amount of produced data stored on the disk,while interpolation,i.e.the estimation of unknown data in each location of interest,can be used to supplement station records. We illustrate a novel data mining solution,named interpolative clustering,that has the merit of addressing both these tasks in time-evolving,multivariate geophysical appli-cations.It yields a time-evolving clustering model,in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes,in order to predict data.Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data.Weights of the linear combination are deﬁned,in order to reﬂect the inverse distance of the unseen data to each clus-ter geometry.The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations.Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm.Responsible editors:Hendrik Blockeel,Kristian Kersting,Siegfried Nijssen and FilipŽelezný.A.Appice(B)·D.MalerbaDipartimento di Informatica,Universitàdegli Studi di Bari“Aldo Moro”,Via Orabona4,70125Bari,Italye-mail:annalisa.appice@uniba.itD.Malerbae-mail:donato.malerba@uniba.itA.Appice,D.Malerba Keywords Spatial autocorrelation·Clustering·Inverse distance weighting·Geophysical data stream1IntroductionThe widespread use of sensor networks has paved the way for the explosive living ubiquity of geophysical data streams(i.e.streams of data that are measured repeatedly over a set of locations).Procedurally,remote sensors are installed worldwide.They gather information along a number of variables over large zones and long(potentially unbounded)periods of time.In this scenario,spatial distribution of data sources,as well as temporal distribution of measures pose new challenges in the collection and query of data.Much scientiﬁc and industrial interest has recently been focused on the deployment of data management systems that gather continuous multivariate data from several data sources,recognize and possibly adapt a behavioral model,deal with queries that concern present and past data,as well as seen and unseen data.This poses speciﬁc issues that include storing entirely(unbounded)data on disks withﬁnite memory(Chiky and Hébrail2008),as well as looking for predictions(estimations) where no measured data are available(Li and Heap2008).Summarization is one solution for addressing storage limits,while interpolation is one solution for supplementing unseen data.So far,both these tasks,namely summa-rization and interpolation,have been extensively investigated in the literature.How-ever,most of the studies consider only one task at a time.Several summarization algo-rithms,e.g.Rodrigues et al.(2008),Chen et al.(2010),Appice et al.(2013a),have been deﬁned in data mining,in order to compute fast,compact summaries of geophysical data as they arrive.Data storage systems store computed summaries,while discarding actual data.Various interpolation algorithms,e.g.Shepard(1968b),Krige(1951),have been deﬁned in geostatistics,in order to predict unseen measures of a geophysical vari-able.They use predictive inferences based upon actual measures sampled at speciﬁc locations of space.In this paper,we investigate a holistic approach that links predictive inferences to data summaries.Therefore,we introduce a summarization pattern of geo-physical data,which can be computed to save memory space and make predictive infer-ences easier.We use predictive inferences that exploit knowledge in data summaries, in order to yield accurate predictions covering any(seen and unseen)space location.We begin by observing that the common factor of several summarization and inter-polation algorithms is that they accommodate the spatial autocorrelation analysis in the learned model.Spatial autocorrelation is the cross-correlation of values of a variable strictly due to their relatively close locations on a two-dimensional surface.Spatial autocorrelation exists when there is systematic spatial variation in the values of a given property.This variation can exist in two forms,called positive and negative spatial autocorrelation(Legendre1993).In the positive case,the value of a variable at a given location tends to be similar to the values of that variable in nearby locations. This means that if the value of some variable is low at a given location,the presence of spatial autocorrelation indicates that nearby values are also low.Conversely,neg-ative spatial autocorrelation is characterized by dissimilar values at nearby locations. Goodchild(1986)remarks that positive autocorrelation is seen much more frequentlyLocal spatial autocorrelation and interpolative clusteringin practice than negative autocorrelation in geophysical variables.This is justiﬁed by Tobler’sﬁrst law of geography,according to which“everything is related to everything else,but near things are more related than distant things”(Tobler1979).As observed by LeSage and Pace(2001),the analysis of spatial autocorrelation is crucial and can be fundamental for building a reliable spatial component into any (statistical)model for geophysical data.With the same viewpoint,we propose to:(i) model the property of spatial autocorrelation when collecting the data records of a number of geophysical variables,(ii)use this model to compute compact summaries of actual data that are discarded and(iii)inject computed summaries into predictive inferences,in order to yield accurate estimations of geophysical data at any space location.The paper is organized as follows.The next section clariﬁes the motivation and the actual contribution of this paper.In Sect.3,related works regarding spatial autocorre-lation,spatial interpolators and clustering are reported.In Sect.4,we report the basics of the presented algorithm,while in Sect.5,we describe the proposed algorithm.An experimental study with several data collections is presented in Sect.6and conclusions are drawn.2Motivation and contributionsThe analysis of the property of spatial autocorrelation in geophysical data poses spe-ciﬁc issues.One issue is that most of the models that represent and learn data with spatial autocorrelation are based on the assumption of spatial stationarity.Thus,they assume a constant mean and a constant variance(no outlier)across space(Stojanova et al. 2012).This means that possible signiﬁcant variabilities in autocorrelation dependen-cies throughout the space are overlooked.The variability could be caused by a different underlying latent structure of the space,which varies among its portions in terms of scale of values or density of measures.As pointed out by Angin and Neville(2008), when autocorrelation varies signiﬁcantly throughout space,it may be more accurate to model the dependencies locally rather than globally.Another issue is that the spatial autocorrelation analysis is frequently decoupled from the multivariate analysis.In this case,the learning process accounts for the spa-tial autocorrelation of univariate data,while dealing with distinct variables separately (Appice et al.2013b).Bailey and Krzanowski(2012)observe that ignoring complex interactions among multiple variables may overlook interesting insights into the cor-relation of potentially related variables at any site.Based upon this idea,Dray and Jombart(2011)formulate a multivariate deﬁnition of the concept of spatial autocorre-lation,which centers on the extent to which values for a number of variables observed at a given location show a systematic(more than likely under spatial randomness), homogeneous association with values observed at the“neighboring”locations.In this paper,we develop an approach to modeling non stationary spatial auto-correlation of multivariate geophysical data by using interpolative clustering.As in clustering,clusters of records that are similar to each other at nearby locations are identiﬁed,but a cluster description and a predictive model is associated to each clus-A.Appice,D.Malerba ter.Data records are aggregated through clusters based on the cluster descriptions. The associated predictive models,that provide predictions for the variables,are stored as summaries of the clustered data.On any future demand,predictive models queried to databases are processed according to the requests,in order to yield accurate esti-mates for the variables.Interpolative clustering is a form of conceptual clustering (Michalski and Stepp1983)since,besides the clusters themselves,it also provides symbolic descriptions(in the form of conjunctions of conditions)of the constructed clusters.Thus,we can also plan to consider this description,in order to obtain clusters in different contexts of the same domain.However,in contrast to conceptual clus-tering,interpolative clustering is a form of supervised learning.On the other hand, interpolative clustering is similar to predictive clustering(Blockeel et al.1998),since it is a form of supervised learning.However,unlike predictive clustering,where the predictive space(target variables)is typically distinguished from the descriptive one (explanatory variables),1variables of interpolative clustering play,in principle,both target and explanatory roles.Interpolative clustering trees(ICTs)are a class of tree structured models where a split node is associated with a cluster and a leaf node with a single predictive model for the target variables of interest.The top node of the ICT contains the entire sample of training records.This cluster is recursively partitioned along the target variables into smaller sub-clusters.A predictive model(the mean)is computed for each target variable and then associated with each leaf.All the variables are predicted indepen-dently.In the context of this paper,an ICT is built by integrating the consideration of a local indicator of spatial autocorrelation,in order to account for signiﬁcant vari-abilities of autocorrelation dependencies in training data.Spatial autocorrelation is coupled with a multivariate analysis by accounting for the spatial dependence of data and their multivariate variance,simultaneously(Dray and Jombart2011).This is done by maximizing the variance reduction of local indicators of spatial autocorrelation computed for multivariate data when evaluating the candidates for adding a new node to the tree.This solution has the merit of improving both the summarization,as well as the predictive performance of the computed models.Memory space is saved by storing a single summary for data of multiple variables.Predictive accuracy is increased by exploiting the autocorrelation of data clustered in space.From the summarization perspective,an ICT is used to summarize geophysical data according to a hierarchical view of the spatial autocorrelation.We can browse gen-erated clusters at different levels of the hierarchy.Predictive models on leaf clusters model spatial autocorrelation dependencies as stationary over the local geometry of the clusters.From the interpolation perspective,an ICT is used to compute knowledge1The predictive clustering framework is originally deﬁned in Blockeel et al.(1998),in order to com-bine clustering problems and classiﬁcation/regression problems.The predictive inference is performed by distinguishing between target variables and explanatory variables.Target variables are considered when evaluating similarity between training data such that training examples with similar target values are grouped in the same cluster,while training examples with dissimilar target values are grouped in separate clusters. Explanatory variables are used to generate a symbolic description of the clusters.Although the algorithm presented in Blockeel et al.(1998)can be,in principle,run by considering the same set of variables for both explanatory and target roles,this case is not investigated in the original study.Local spatial autocorrelation and interpolative clusteringto make accurate predictive inferences easier.We can use Inverse Distance Weighting2 (Shepard1968b)to predict variables at a speciﬁc location by a weighted linear com-bination of the predictive models on the leaf clusters.Weights are inverse functions of the distance of the query point from the clusters.Finally,we can observe that an ICT provides a static model of a geophysical phe-nomenon.Nevertheless,inferences based on static models of spatial autocorrelation require temporal stationarity of statistical properties of variables.In the geophysical context,data are frequently subject to the temporal variation of such properties.This requires dynamic models that can be updated continuously as new fresh data arrive (Gama2010).In this paper,we propose an incremental algorithm for the construction of a time-adaptive ICT.When a new sample of records is acquired through stations of a sensor network,a past ICT is modiﬁed,in order to model new data of the process, which may change their properties over time.In theory,a distinct tree can be learned for each time point and several trees can be subsequently combined using some gen-eral framework(e.g.Spiliopoulou et al.2006)for tracking cluster evolution over time. However,this solution is prohibitively time-consuming when data arrive at a high rate. By taking into account that(1)geophysical variables are often slowly time-varying, and(2)a change of the properties of the data distribution of variables is often restricted to a delimited group of stations,more efﬁcient learning algorithms can be derived.In this paper,we propose an algorithm that retains past clusters as long as they discrimi-nate between surfaces of spatial autocorrelated data,while it mines novel clusters only if the latest become inaccurate.In this way,we can save computation time and track the evolution of the cluster model by detecting changes in the data properties at no additional cost.The speciﬁc contributions in this paper are:(1)the investigation of the property of spatial autocorrelation in interpolative clustering;(2)the development of an approach that uses a local indicator of the spatial autocorrelation property,in order to build an ICT by taking into account non-stationarity in autocorrelation and multivariate analysis;(3)the development of an incremental algorithm to yield a time-evolving ICT that accounts for the fact that the statistical properties of the geophysical data may change over time;(4)an extensive evaluation of the effectiveness of the proposed (incremental)approach on several real geophysical data.3Related worksThis work has been motivated by the research literature for the property of spatial autocorrelation and its inﬂuence on the interpolation theory and(predictive)clustering. In the following subsections,we report related works from these research lines.2Inverse distance weighting is a common interpolation algorithm.It has several advantages that endorse its widespread use in geostatistics(Li and Revesz2002;Karydas et al.2009;Li et al.2011):simplicity of implementation;lack of tunable parameters;ability to interpolate scattered data and work on any grid without suffering from multicollinearity.A.Appice,D.Malerba3.1Spatial autocorrelationSpatial autocorrelation analysis quantiﬁes the degree of spatial clustering (positive autocorrelation)or dispersion (negative autocorrelation)in the values of a variable measured over a set of point locations.In the traditional approach to spatial autocorre-lation,the “overall pattern”of spatial dependence in data is summarized into a single indicator,such as the familiar Global Moran’s I,Global Geary’s C or Gamma indi-cators of spatial associations (see Legendre 1993,for details).These indicators allow us to establish whether nearby locations tend to have similar (i.e.spatial clustering),random or different values (i.e.spatial dispersion)of a variable on the entire sample.They are referred to as global indicators of spatial autocorrelation,in contrast to the local indicators that we will consider in this paper.Procedurally,global indicators (see Getis 2008)are computed,in order to indicate the degree of regional clustering over the entire distribution of a geophysical variable.Stojanova et al.(2013),as well as Appice et al.(2012)have recently investigated the addition of these indicators to predictive inferences.They show how global measures of spatial autocorrelation can be computed at multiple levels of a hierarchical clustering,in order to improve predictions that are consistently clustered in space.Despite the useful results of these studies,a limitation of global indicators of spatial autocorrelation is that they assume a spatial stationarity across space (Angin and Neville 2008).While this may be useful when spatial associations are studied for the small areas associated to the bottom levels of a hierarchical model,it is not very meaningful or may even be highly misleading in the analysis of spatial associations for large areas associated to the top levels of a hierarchical model.Local indicators offer an alternative to global modeling by looking for “local pat-terns”of spatial dependence within the study region (see Boots 2002for a survey).Unlike global indicators,which return one value for the entire sample of data,local indicators return one value for each sampled location of a variable;this value expresses the degree to which that location is part of a cluster.Widely speaking,a local indi-cator of spatial autocorrelation allows us to discover deviations from global patterns of spatial association,as well as hot spots like local clusters or local outliers.Several local indicators of spatial autocorrelation are formulated in the literature.Anselin’s local Moran’s I (Anselin 1995)is a local indicator of spatial autocorre-lation,that has gained wide acceptance in the literature.It is formulated as follows:I (i )= (z (i )−z )m 2 n j =1,j =i(λ(i j )(z (j )−z )),(1)where z (i )is the value of a variable Z measured at the location i ,z = n i =1z (i )n is themean of the data measured for Z ,λ(i j )is a spatial (Gaussian or bi-square)weight between the locations i and j over a neighborhood structure of the training data and m 2=1n j (z (j )−z )2is the second moment.Anselin’s local Moran’s I is related to the global Moran I as the average of I (i )is equal to the global I,up to a factor of proportionality.A positive value for I (i )indicates that z (i )is surrounded by similar values.Therefore,this value is part of a cluster.A negative value for I (i )indicates that z (i )is surrounded by dissimilar values.Thus,this value is an outlier.Local spatial autocorrelation and interpolative clusteringThe standardized Getis and Ord local GI∗(Getis and Ord1992)is a local indicator of spatial autocorrelation that is formulated as follows:G I∗(i)=1S2n−1nnj=1,j=iλ(i j)2−Λ(i)2⎛⎝nj=1,j=iλi j z(j)−zΛ(i)⎞⎠,(2)whereΛ(i)= nj=1,j=iλ(i j)and S2=nj=1(z(j)−z)2n.A positive value for G I∗(i)indicates clusters of high values around i,while a negative value for G I∗(i)indicates clusters of low values around i.The interpretation of GI∗is different from that of I:the former distinguishes clusters of high and low values,but does not capture the presence of negative spatial auto correlation(dispersion);the latter is able to detect both positive and negative spatial autocorrelations,but does not distinguish clusters of high or low values.Getis and Ord (1992)recommend computing GI∗to look for spatial clusters and I to detect spatial outliers.By following this advice,Holden and Evans(2010)apply fuzzy C-means to GI∗values,in order to cluster satellite-inferred burn severity classes.Scrucca(2005) and Appice et al.(2013c)use k-means to cluster GI∗values,to compute spatially aware clusters of the data measured for a geophysical variable.Measures of both global and local spatial autocorrelation are principally deﬁned for univariate data.However,the integration of multivariate and autocorrelation informa-tion has recently been advocated by Dray and Jombart(2011).The simplest approach considers a two-step procedure,where data areﬁrst summarized with a multivariate analysis such as PCA.In a second step,any univariate(either global or local)spatial measure can be applied to PCA scores for each axis separately.The other approach ﬁnds coefﬁcients to obtain a linear combination of variables,which maximizes the product between the variance and the global Moran measure of the scores.Alterna-tively,Stojanova et al.(2013)propose computing the mean of global measures(Moran I and global Getis C),computed for distinct variables of a vector,as a global indicator of spatial autocorrelation of the vector by blurring cross-correlations between separate variables.Dray et al.(2006)explore the theory of the principal coordinates of neighbor matrices and develop the framework of Morans eigenvector maps.They demonstrate that their framework can be linked to spatial autocorrelation structure functions also in multivariate domains.Blanchet et al.(2008)expand this framework by taking into account asymmetric directional spatial processes.3.2Interpolation theoryStudies on spatial interpolation were initially encouraged by the analysis of ore mining, water extraction or pumping and rock inspection(Cressie1993).In theseﬁelds,inter-polation algorithms are required as the main resource to recover unknown information and account for problems like missing data,energy saving,sensor default,as well as to support data summarization and investigation of spatial correlation between observed data(Lam1983).The interpolation algorithms estimate a geophysical quantity in any geographic location where the variable measure is not available.The interpolated valueA.Appice,D.Malerba is derived by making use of the knowledge of the nearby observed data and,some-times,of some hypotheses or supplementary information on the data variable.The rationale behind this spatially aware estimate of a variable is the property of positive spatial autocorrelation.Any spatial interpolator accounts for this property,including within its formulation the consideration of a stronger correlation between data which are closer than for those that are farther apart.Regression(Burrough and McDonnell1998),Inverse Distance Weighting(IDW) (Shepard1968a),Radial Basis Functions(RBF)(Lin and Chen2004)and Kriging (Krige1951)are the most common interpolation algorithms.These algorithms are studied to deal with the irregular sampling of the investigated area(Isaaks and Srivas-tava1989;Stein1999)or with the difﬁculty of describing the area by the local atlas of larger and irregular manifolds.Regression algorithms,that are statistical interpolators, determine a functional relationship between the variable to be predicted and the geo-graphic coordinates of points where the variable is measured.IDW and RBF,which are both deterministic interpolators,use mathematical functions to calculate an unknown variable value in a geographic location,based either on the degree of similarity(IDW) or the degree of smoothing(RBF)in relation to neighboring data points.Both algo-rithms share with Kriging the idea that the collection of variable observations can be considered as a production of correlated spatial random data with speciﬁc statistical properties.In Kriging,speciﬁcally,this correlation is used to derive a second-order model of the variable(the variogram).The variogram represents an approximate mea-sure of the spatial dissimilarity of the observed data.The IDW interpolation is based on a linear combination of nearby observations with weights proportional to a power of the distances.It is a heuristic but efﬁcient approach justiﬁed by the typical power-law of the spatial correlation.In this sense,IDW accomplishes the same strategy adopted by the more rigorous formulation of Kriging(Li and Revesz2002;Karydas et al.2009; Li et al.2011).Several studies have arisen from these base interpolation algorithms.Ohashi and Torgo(2012)investigate a mapping of the spatial interpolation problem into a multiple regression task.They deﬁne a series of spatial indicators to better describe the spatial dynamics of the variable of interest.Umer et al.(2010)propose to recover missing data of a dense network by a Kriging interpolator.By considering that the compu-tational complexity of a variogram is cubic in the size of the observed data(Cressie 1993),the variogram calculus,in this study,is sped-up by processing only the areas with information holes,rather than the global data.Goovaerts(1997)extends Krig-ing,in order to predict multiple variables(cokriging)measured at the same location. Cokriging uses direct and cross covariance functions that are computed in the sample of the observed data.Teegavarapu et al.(2012)use IDW and1-Nearest Neighbor,in order to interpolate a grid of rainfall data and re-sample data at multiple resolutions. Lu and Wong(2008)formulate IDW in an adaptive way,by accounting for the varying distance-decay relationship in the area under examination.The weighting parameters are varied according to the spatial pattern of the sampled points in the neighborhood. The algorithm proves more efﬁcient than ordinary IDW and,in several cases,also better than Kriging.Recently,Appice et al.(2013b)have started to link IDW to the spatio-temporal knowledge enclosed in speciﬁc summarization patterns,called trend clusters.Nevertheless,this investigation is restricted to univariate data.Local spatial autocorrelation and interpolative clusteringThere are several applications(e.g.Li and Revesz2002;Karydas et al.2009;Li et al.2011)where IDW is used.This contributes to highlighting IDW as a determinis-tic,quick and simple interpolation algorithm that yields accurate predictions.On the other hand,Kriging is based on the statistical properties of a variable and,hence,it is expected to be more accurate regarding the general characteristics of the recorded data and the efﬁcacy of the model.In any case,the accuracy of Kriging depends on a reliable estimation of the variogram(Isaaks and Srivastava1989;¸Sen and¸Salhn 2001)and the variogram computation cost is proportional to the cube of the number of observed data(Cressie1990).This cost can be prohibitive in time-evolving applica-tions,where the statistical properties of a variable may change over time.In the Data Mining framework,the change of the underlying properties over time is usually called concept drift(Gama2010).It is noteworthy that the concept drift,expected in dynamic data,can be a serious complication for Kriging.It may impose the repetition of costly computation of the variogram each time the statistical properties of the variable change signiﬁcantly.These considerations motivate the broad use of an interpolator like IDW that is accurate enough and whose learning phase can be run on-line when data are collected in a stream.3.3Cluster analysisCluster analysis is frequently used in geophysical data interpretation,in order to obtain meaningful and useful results(Song et al.2010).In addition,it can be used as a sum-marization paradigm for data streams,since it underlines the advantage of discovering summaries(clusters)that adjust well to the evolution of data.The seminal work is that of Aggarwal et al.(2007),where a k-means algorithm is tailored to discover micro-clusters from multidimensional transactions that arrive in a stream.Micro-clusters are adjusted each time a transaction arrives,in order to preserve the temporal locality of data along a time horizon.Another clustering algorithm to summarize data streams is presented in Nassar and Sander(2007).The main characteristic of this algorithm is that it allows us to summarize multi-source data streams.The multi-source stream is com-posed of sets of numeric values that are transmitted by a variable number of sources at consecutive time points.Timestamped values are modeled as2D(time-domain)points of a Euclidean space.Hence,the source location is neither represented as a dimension of analysis nor processed as information-bearing.The stream is broken into windows. Dense regions of2D points are detected in these windows and represented by means of cluster feature vectors.Although a spatial clustering algorithm is employed,the spatial arrangement of data sources is still neglected.Appice et al.(2013a)describe a clustering algorithm that accounts for the property of spatial autocorrelation when computing clusters that are compact and accurate summaries of univariate geophysical data streams.In all these studies clustering is addressed as an unsupervised task.Predictive clustering is a supervised extension of cluster analysis,which combines elements of prediction and clustering.The task,originally formulated in Blockeel et al.(1998),assumes:(1)a descriptive space of explanatory variables,(2)a predictive space of target variables and(3)a set of training records deﬁned on both descriptive and predictive space.Training records that are similar to each other are clustered and a predictive model is associated to each cluster.This allows us to predict the unknown。

数据挖掘原理、算法及应用章 (8)

第8章复杂类型数据挖掘 1）以Arc/info基于矢量数据模型的系统为例，为了将空间
数据存入计算机，首先，从逻辑上将空间数据抽象为不同的专题或层，如土地利用、地形、道路、居民区、土壤单元、森林分布等，一个专题层包含区域内地理要素的位置和属性数据。其次，将一个专题层的地理要素或实体分解为点、线、面目标，每个目标的数据由空间数据、属性数据和拓扑数据组成。
第8章复杂类型数据挖掘 2. 空间数据具体描述地理实体的空间特征、属性特征。空
间特征是指地理实体的空间位置及其相互关系；属性特征表示地理实体的名称、类型和数量等。空间对象表示方法目前采用主题图方法, 即将空间对象抽象为点、线、面三类，根据这些几何对象的不同属性，以层（Layer）为概念组织、存储、修改和显示它们，数据表达分为矢量数据模型和栅格数据模型两种。
第8章复杂类型数据挖掘图Fra bibliotek-5 综合图层
第8章复杂类型数据挖掘
图8-4 栅格数据模型
第8章复杂类型数据挖掘
3. 虽然空间数据查询和空间挖掘是有区别的，但是像其他数据挖掘技术一样，查询是挖掘的基础和前提，因此了解空间查询及其操作有助于掌握空间挖掘技术。
由于空间数据的特殊性，空间操作相对于非空间数据要复杂。传统的访问非空间数据的选择查询使用的是标准的比较操作符： “>”、 “<”、 “≤ ”、 “≥ ”、 “≠ ”。而空间选择是一种在空间数据上的选择查询，要用到空间操作符.包括接近、东、西、南、北、包含、重叠或相交等。
不同的实体之间进行空间性操作的时候，经常需要在属性之间进行一些转换。如果非空间属性存储在关系型数据库中，那么一种可行的存储策略是利用非空间元组的属性存放指向相应空间数据结构的指针。这种关系中的每个元组代表的是一个空间实体。

spatio-temporall...

Spatio-Temporal LSTM with Trust Gates for3D Human Action Recognition817 respectively,and utilized a SVM classiﬁer to classify the actions.A skeleton-based dictionary learning utilizing group sparsity and geometry constraint was also proposed by[8].An angular skeletal representation over the tree-structured set of joints was introduced in[9],which calculated the similarity of these fea-tures over temporal dimension to build the global representation of the action samples and fed them to SVM forﬁnal classiﬁcation.Recurrent neural networks(RNNs)which are a variant of neural nets for handling sequential data with variable length,have been successfully applied to language modeling[10–12],image captioning[13,14],video analysis[15–24], human re-identiﬁcation[25,26],and RGB-based action recognition[27–29].They also have achieved promising performance in3D action recognition[30–32].Existing RNN-based3D action recognition methods mainly model the long-term contextual information in the temporal domain to represent motion-based dynamics.However,there is also strong dependency between joints in the spatial domain.And the spatial conﬁguration of joints in video frames can be highly discriminative for3D action recognition task.In this paper,we propose a spatio-temporal long short-term memory(ST-LSTM)network which extends the traditional LSTM-based learning to two con-current domains(temporal and spatial domains).Each joint receives contextual information from neighboring joints and also from previous frames to encode the spatio-temporal context.Human body joints are not naturally arranged in a chain,therefore feeding a simple chain of joints to a sequence learner can-not perform well.Instead,a tree-like graph can better represent the adjacency properties between the joints in the skeletal data.Hence,we also propose a tree structure based skeleton traversal method to explore the kinematic relationship between the joints for better spatial dependency modeling.In addition,since the acquisition of depth sensors is not always accurate,we further improve the design of the ST-LSTM by adding a new gating function, so called“trust gate”,to analyze the reliability of the input data at each spatio-temporal step and give better insight to the network about when to update, forget,or remember the contents of the internal memory cell as the representa-tion of long-term context information.The contributions of this paper are:(1)spatio-temporal design of LSTM networks for3D action recognition,(2)a skeleton-based tree traversal technique to feed the structure of the skeleton data into a sequential LSTM,(3)improving the design of the ST-LSTM by adding the trust gate,and(4)achieving state-of-the-art performance on all the evaluated datasets.2Related WorkHuman action recognition using3D skeleton information is explored in diﬀerent aspects during recent years[33–50].In this section,we limit our review to more recent RNN-based and LSTM-based approaches.HBRNN[30]applied bidirectional RNNs in a novel hierarchical fashion.They divided the entire skeleton toﬁve major groups of joints and each group was fedSpatio-Temporal LSTM with Trust Gates for3D Human Action RecognitionJun Liu1,Amir Shahroudy1,Dong Xu2,and Gang Wang1(B)1School of Electrical and Electronic Engineering,Nanyang Technological University,Singapore,Singapore{jliu029,amir3,wanggang}@.sg2School of Electrical and Information Engineering,University of Sydney,Sydney,Australia******************.auAbstract.3D action recognition–analysis of human actions based on3D skeleton data–becomes popular recently due to its succinctness,robustness,and view-invariant representation.Recent attempts on thisproblem suggested to develop RNN-based learning methods to model thecontextual dependency in the temporal domain.In this paper,we extendthis idea to spatio-temporal domains to analyze the hidden sources ofaction-related information within the input data over both domains con-currently.Inspired by the graphical structure of the human skeleton,wefurther propose a more powerful tree-structure based traversal method.To handle the noise and occlusion in3D skeleton data,we introduce newgating mechanism within LSTM to learn the reliability of the sequentialinput data and accordingly adjust its eﬀect on updating the long-termcontext information stored in the memory cell.Our method achievesstate-of-the-art performance on4challenging benchmark datasets for3D human action analysis.Keywords:3D action recognition·Recurrent neural networks·Longshort-term memory·Trust gate·Spatio-temporal analysis1IntroductionIn recent years,action recognition based on the locations of major joints of the body in3D space has attracted a lot of attention.Diﬀerent feature extraction and classiﬁer learning approaches are studied for3D action recognition[1–3].For example,Yang and Tian[4]represented the static postures and the dynamics of the motion patterns via eigenjoints and utilized a Na¨ıve-Bayes-Nearest-Neighbor classiﬁer learning.A HMM was applied by[5]for modeling the temporal dynam-ics of the actions over a histogram-based representation of3D joint locations. Evangelidis et al.[6]learned a GMM over the Fisher kernel representation of a succinct skeletal feature,called skeletal quads.Vemulapalli et al.[7]represented the skeleton conﬁgurations and actions as points and curves in a Lie group c Springer International Publishing AG2016B.Leibe et al.(Eds.):ECCV2016,Part III,LNCS9907,pp.816–833,2016.DOI:10.1007/978-3-319-46487-950。

人工智能基础(习题卷9)

人工智能基础(习题卷9)第1部分：单项选择题，共53题，每题只有一个正确答案,多选或少选均不得分。

1.[单选题]由心理学途径产生，认为人工智能起源于数理逻辑的研究学派是（）A)连接主义学派B)行为主义学派C)符号主义学派答案:C解析:2.[单选题]一条规则形如：，其中“←"右边的部分称为(___)A)规则长度B)规则头C)布尔表达式D)规则体答案:D解析:3.[单选题]下列对人工智能芯片的表述，不正确的是（）。

A)一种专门用于处理人工智能应用中大量计算任务的芯片B)能够更好地适应人工智能中大量矩阵运算C)目前处于成熟高速发展阶段D)相对于传统的CPU处理器，智能芯片具有很好的并行计算性能答案:C解析:4.[单选题]以下图像分割方法中，不属于基于图像灰度分布的阈值方法的是( )。

A)类间最大距离法B)最大类间、内方差比法C)p-参数法D)区域生长法答案:B解析:5.[单选题]下列关于不精确推理过程的叙述错误的是（）。

A)不精确推理过程是从不确定的事实出发B)不精确推理过程最终能够推出确定的结论C)不精确推理过程是运用不确定的知识D)不精确推理过程最终推出不确定性的结论答案:B解析:6.[单选题]假定你现在训练了一个线性SVM并推断出这个模型出现了欠拟合现象，在下一次训练时，应该采取的措施是（）0A)增加数据点D)减少特征答案:C解析:欠拟合是指模型拟合程度不高，数据距离拟合曲线较远，或指模型没有很好地捕捉到数据特征，不能够很好地拟合数据。

可通过增加特征解决。

7.[单选题]以下哪一个概念是用来计算复合函数的导数？A)微积分中的链式结构B)硬双曲正切函数C)softplus函数D)劲向基函数答案:A解析:8.[单选题]相互关联的数据资产标准，应确保()。

数据资产标准存在冲突或衔接中断时，后序环节应遵循和适应前序环节的要求，变更相应数据资产标准。

A)连接B)配合C)衔接和匹配D)连接和配合答案:C解析:9.[单选题]固体半导体摄像机所使用的固体摄像元件为( )。

数据挖掘第三版第二章课后习题答案

1.1什么是数据挖掘？（a）它是一种广告宣传吗？（d）它是一种从数据库、统计学、机器学和模式识别发展而来的技术的简单转换或应用吗？（c）我们提出一种观点，说数据挖掘是数据库进化的结果，你认为数据挖掘也是机器学习研究进化的结果吗？你能结合该学科的发展历史提出这一观点吗？针对统计学和模式知识领域做相同的事（d）当把数据挖掘看做知识点发现过程时，描述数据挖掘所涉及的步骤答：数据挖掘比较简单的定义是：数据挖掘是从大量的、不完全的、有噪声的、模糊的、随机的实际数据中，提取隐含在其中的、人们所不知道的、但又是潜在有用信息和知识的过程。

数据挖掘不是一种广告宣传，而是由于大量数据的可用性以及把这些数据变为有用的信息的迫切需要，使得数据挖掘变得更加有必要。

因此，数据挖掘可以被看作是信息技术的自然演变的结果。

数据挖掘不是一种从数据库、统计学和机器学习发展的技术的简单转换，而是来自多学科，例如数据库技术、统计学，机器学习、高性能计算、模式识别、神经网络、数据可视化、信息检索、图像和信号处理以及空间数据分析技术的集成。

数据库技术开始于数据收集和数据库创建机制的发展，导致了用于数据管理的有效机制，包括数据存储和检索，查询和事务处理的发展。

提供查询和事务处理的大量的数据库系统最终自然地导致了对数据分析和理解的需要。

因此，出于这种必要性，数据挖掘开始了其发展。

当把数据挖掘看作知识发现过程时，涉及步骤如下：数据清理，一个删除或消除噪声和不一致的数据的过程；数据集成，多种数据源可以组合在一起；数据选择，从数据库中提取与分析任务相关的数据；数据变换，数据变换或同意成适合挖掘的形式，如通过汇总或聚集操作；数据挖掘，基本步骤，使用智能方法提取数据模式；模式评估，根据某种兴趣度度量，识别表示知识的真正有趣的模式；知识表示，使用可视化和知识表示技术，向用户提供挖掘的知识1.3定义下列数据挖掘功能：特征化、区分、关联和相关性分析、分类、回归、聚类、离群点分析。

梯度稀疏化、量化、低秩分解、知识蒸馏

梯度稀疏化、量化、低秩分解、知识蒸馏梯度稀疏化（Gradient Sparsity）梯度稀疏化是一种优化方法，旨在通过减少神经网络中参数的数量，从而提高计算效率和模型的泛化能力。

在深度学习中，神经网络的参数通常是通过反向传播算法来进行更新的，而梯度稀疏化则是在这个过程中对梯度进行稀疏化处理。

稀疏化梯度可以通过将梯度中的小值置为零来实现，从而减少了网络中需要更新的参数数量。

这样一来，在网络的训练过程中，可以减少计算量和内存消耗，提高训练速度和模型的泛化能力。

量化（Quantization）量化是一种将模型中的浮点数参数转换为定点数参数的技术。

在深度学习中，神经网络通常使用浮点数来表示参数，但是浮点数的存储和计算成本高。

而量化则是通过将浮点数参数映射为定点数参数，从而减少了参数的存储和计算成本。

量化可以通过将参数的取值范围划分为一定数量的区间，然后用一个定点数来表示每个区间的平均值来实现。

这样一来，可以大大减少参数的位数，提高计算效率和模型的速度。

低秩分解（Low-rank Decomposition）低秩分解是一种将高维参数矩阵分解为低秩矩阵的方法。

在深度学习中，神经网络的参数矩阵通常是高维的，而高维参数矩阵的存储和计算成本高。

而低秩分解则是通过将高维参数矩阵分解为多个低秩矩阵的乘积，从而减少了参数的存储和计算成本。

低秩分解可以通过奇异值分解（Singular Value Decomposition, SVD）等方法来实现。

这样一来，可以减少参数的维度，提高计算效率和模型的速度。

知识蒸馏（Knowledge Distillation）知识蒸馏是一种将大型复杂模型中的知识转移到小型简单模型中的方法。

在深度学习中，大型复杂模型往往具有更好的性能，但是其参数量大，计算复杂度高。

而知识蒸馏则是通过将大型模型的知识（如概率分布、类别信息等）传递给小型模型，从而提高小型模型的性能。

知识蒸馏可以通过在训练过程中引入额外的损失函数来实现，其中包括目标模型的输出和大型模型的输出之间的差异。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

X. Li, O.R. Zaiane, and Z. Li (Eds.): ADMA 2006, LNAI 4093, pp. 263 – 270, 2006.© Springer-Verlag Berlin Heidelberg 2006Mining Spatial-temporal Clusters from Geo-databasesMin Wang, Aiping Wang, and Anbo LiCollege of Geography Science,Nanjing Normal University, Nanjing, 210097, China sysj0918@Abstract. In order to mine spatial-temporal clusters from geo-databases, twoclustering methods with close relationships are proposed, which are both basedon neighborhood searching strategy, and rely on the sorted k -dist graph toautomatically specify their respective algorithm arguments. We declare themost distinguishing advantage of our clustering methods is they avoidcalculating the spatial-temporal distance between patterns which is a tough job.Our methods are validated with the successful extraction of seismic sequencefrom seismic databases, which is a typical example of spatial–temporal clusters.1 IntroductionClustering is a primary data mining method for structure or knowledge discovery in spatial databases [1][2]. In many circumstances, spatial data also contain temporal information. To consider temporal factor into spatial clustering can help us to discover the real underlying distribution rules in many spatial mining problems.Seismic sequences are a typical and good example of spatial-temporal clusters. In seism, a seismic sequence is defined as a group of seismic events which have close relationships between each other and occur densely both in space and time. In this paper, we will discuss and design spatial clustering methods to discover this kind of clusters in geo-databases.Similar work can be found in [3][4]. In [3], Golfarelli et al calculate the similarities of patterns in each feature dimension, multiply these similarities as the total pattern similarities, and then group those patterns less than certain similarity threshold as a cluster. In [4], Galic et al. normalize each feature dimension then clustering patterns with K-Means. In seism research, Wardlaw et al [5] propose spatial-temporal distance (D st ) between two seismic events, spatial-temporal converting index (C ) to find spatial-temporal seismic clusters. In their methods, D ST and C are defined as:222)(t C d D ST Δ+=. (1)t d D C ST Δ−=/22. (2)In (1), (2) d is the spatial while Δt is time distance between two seismic-events. To different seismic regions or belts, they give different C . But they can not give the physical meanings of these parameters, which is regrettable.264 M. Wang, A. Wang, and A. LiAll these mentioned spatial-temporal clustering methods in nature are to modify the scale relationships between each feature dimension, or in other words, the contribution ratios of each feature dimension to calculate the distance between patterns in clustering. But the selection of such relationship is rather subjective and difficult to be endowed with scientific meanings.In this paper, two clustering methods are proposed with close relationships and respective advantages and disadvantages. One is spatial-temporal grid method (ST-GRID, in abbreviation); the other is ST-DBSCAN. The main ideas of ST-GRID is: to partition spatial, temporal dimensions into a multi-dimension grid with different precisions, allocate patterns into the grid cells, and then extract and merge spatial-temporal dense regions as the final clusters. ST-DBSCAN is the extended form of DBSCAN [6] to spatial-temporal clustering problems. Both methods are based on the neighborhood searching strategy, and rely on the sorted k-dist graph [6] to find their respective algorithm arguments needed. We declare the advantages of our methods are: it only need one scan of the whole data set then are very efficient, and it avoids calculating the spatial-temporal distance between patterns which is very difficult. Our methods are validated with the successful extraction of seismic sequences from geo-databases.2 Spatial-temporal Clustering Methods2.1 Sorted k-Dist GraphIn ST-GRID, because of the different metrics of space and time, it’s not suitable to partition the spatial, temporal dimension of the grid with same precision or same cell size. The crux is on how to specify automatically the two precisions which is completed with the sorted k-dist graph.The principle of the sorted k-dist graph can be described as follows: in spatial clustering, we often regard those isolated patterns away from the clusters as noises. Since the distance between noises is relatively longer than that between clusters, we can often remove most noises by some distance threshold. If we take samples in the areas where the clusters and noises locate, calculate the distance from each sample point to its k-th nearest neighbor (the distance is called the k-dist value of that point), and sort the points of their k-dist values in descending order, then we can draw a graph called the sorted k-dist graph. It is obvious that the non-noises (sample points which are not noises) will have relatively smaller k-dist values, and what we want to do is to find the threshold of the 4-dist separating the non-noises and the noises in the sorted 4-dist graph. As depicted in Figure 1, the 4-dist values drop rapidly to the left of the crossing point of the reticle, but the drop to the right becomes much smoother. Such a 4-dist value becomes an appropriate threshold for separating the non-noises from the noises. In ST-GRID, we can set the grid size to less than half of this 4-dist value, which will basically satisfy the need of noise removal. This is because: if we regard the round regions around a point as a dense region which contains k+1 points with radius R, then we can also specify the border length of each grid cell to 2R, in which the area of each cell will be close to the circle with radius R. If the points of the cell are more than K+1, then the cell is ‘dense’, and should be allocated into one cluster.Mining Spatial-temporal Clusters from Geo-databases 265Fig. 1. A point set with its sorted 4-dist graph2.2 ST-GRID MethodIn ST-GRID, we specify the precisions of spatial, temporal dimensions with the sorted k-dist graph. We draw the sorted k-dist graph for the spatial, temporal dimensions, and then find their respective distance thresholds between noises and non-noises. The dense thresholds are same: k+1.The algorithm of ST-GRID:Input: spatial, temporal cell border lengths which are specified with the sorted k-dist graph of the grid,data set;Output: clusters.Construct a multi-dimension grid covering the wholespatial-temporal feature space;Allocate every data point into these cells, and count the points of each cell;Extract dense regions with threshold k+1;Merge neighbor cells and mark them as one cluster;Output these clusters.In ST-GRID, the merging stage can be completed with the depth-first searching strategy. To a m-dimension grid, if we regard a pair of neighbors as two cells with only one dimension different, then every cell will have 2m neighbors except those boundary cells. ST-GRID will begin its depth-first, dense neighbor searching with an arbitrary dense cell until no new dense cells can be included within this searching, and mark these cells as one cluster. The next searching will begin with an un-searched dense cell, until all the dense cells are visited.2.3 ST-DBSCANDBSCAN is a good clustering method for clustering clusters with non-sphere shapes. To extent DBSCAN to find spatial-temporal clusters, we separate its original argument, the neighborhood radius ε into two: the spatial neighborhood radiusεs and temporal neighborhood radius εt.266 M. Wang, A. Wang, and A. LiBase on this, only if point p is inside the εs-neighborhood and εt-neighborhood of point q, point p can then be called ‘spatial-temporal directly density-reachable’ from point q. Similar as this, the other concepts of DBSCAN should also be extended accordingly.Same as ST-GRID, εs,εt are calculate with the sorted k-dist graph. Draw the sorted k-dist graph for spatial and temporal dimensions; find the two distance thresholds between noises and non-noises, which are equal to εs,εt , the another argument MinPts of DBSCAN equals to k. With these extensions, the searching neighborhood will be extended to spatial-temporal feature space. The core points would be those with more than MinPts neighbors within their spatial-temporal neighborhood (round area with radius εs in space and εt in time).3 ExperimentsThe experimental data are extracted from the database of ‘integrating seismic catalog in China and neighbor countries’ compiled by the China’s State Key Laboratory of Resources and Environmental Systems from mainly various seismic catalogs published by the Chinese national seismic bureau [7][8]. This database stores 620,000 seismic entries.We first extract 6927 seismic events with magnitude ≥ 2.5 in North of China (37°- 41°N, 113°- 121°E) from year 1900, which include three seismic sequences: Xingtai, Bohai, Tangsan sequence. In Figure 2, the three ellipses from up to down are Tangsan, Bohai, Xingtai sequences. Each sequence not only distributes densely in space, but occupies its dense time ranges with different lengths. We will try to extract these sequences with ST-GRID and ST-DBSCAN, and compare their respective performances.We first take samples in the testing area and calculate the sorted k-dist graph to get the needed inputs of ST-GRID and ST-DBSCAN. The rectangles in Figure 2 are the sample areas, which include a dense area and a noise area, with 643 seismic events in total. Calculating the sorted 4-dist graph of the sample areas, we get spatial distance threshold 6000m, temporal distance threshold 610d. Inputs of ST-GRID are: the spatial precision=6000×2=12000m, temporal precision=610×2=1220d. The spatial-temporal grid covering the whole testing area is separated into 57×40×28 cells (longitude, latitude, time), and the dense threshold is 5. Inputs of ST-DBSCAN are: εs=6000m, εt=610d, MinPts=4.ST-GRID extracts 17 while ST-DBSCAN extracts 33 clusters. Both include the three 3 sequences we care about, with small distribution differences in both space and time. The thick border polygons in Figure 2 are the extracted spatial areas of our methods, while the thin border polygons are the real spatial areas of the three sequences. We can find they are very close to their counterparts, while our areas are a little bit smaller than the real. It may be caused by that we only extract partly seismic events from the whole data set.Both methods give coincident results in temporal clustering with the real time distribution (See Table 2). But we find both methods override their actual timeMining Spatial-temporal Clusters from Geo-databases 267 ranges (for example, the end time of Tangsan sequence). To ST-GRID, the time range of each cell is 1220d (610*2), about 4 years. If there exist clusters and noises which occur within 4 years and fall into same cell, ST-GRID can not distinguish them. ST-DBSCAN has the similar disadvantages. Besides, some dense areas may be divided into several cells, which causes some cells covering the brim of clusters to be discarded. We call this shortcoming the roughness of ST-GRID while ST-DBSCAN is free of this. The small differences of the division of spatial-temporal dense areas between the two methods are caused by their neighborhood searching strategies, rectangle cells and round searching areas, and the ‘roughness’ of ST-GRID.We select more sequences, which are Haicheng, Songpan, Yanyuan sequences to validate our methods ulteriorly. We also use the sorted 4-dist graph to calculate the parameters of both methods which are list in Table 1. See Table 2 for the outputs of the time ranges. All these results validate our methods once again.Table 1. Parameters of our methods in extracting Haicheng, Songpan, Yanyuan sequences Sequence Datanumbers Spatialdistancethreshold Time distance thresholdCell numbers of ST-GRID Inputs of ST-DBSCAN Haicheng sequence1311 4655m 447d 36×28×17 εs =4655m, εt =447d Songpan sequence996 6490m 798d 15×14×34 εs =6490m, εt =798d Yanyuan sequence 939 12606m 239d 17×10×35 εs =12606m,εt =239dFig. 2. Spatial boundaries of Tangsan, Bohai, Xingtai sequences268 M. Wang, A. Wang, and A. LiTable 2. Time ranges of the sequences SequenceTime range ofactual clusters Time range of ST-GRID Time range of ST-DBSCANTime range ofsequencesTangsan 1976-1985 1974-1986 1975-1978 1976-1980 Bohai 1969-1972 1969-1973 1969-1971 1969-1972 Xingtai 1966-1971 1965-1973 1965-1971 1966-1985 Haicheng 1975-1983 1975-1983 1975-1983 1975-1983Songpan 1973-1976 1973-1978 1973-1978 1973-1976Yanyuan 1976-1981 1976-1978 1976-1981 1972-1981Fig. 3. Spatial boundary of Songpansequence sequence Fig. 4. Spatial boundary of HaichenFig. 5. Spatial boundary of Yanyuan sequenceMining Spatial-temporal Clusters from Geo-databases 269 Table 3. Comparison between the two clustering methodsAlgorithm Time efficiency Space efficiency Precision InputsST-GRID High Need store grid Relatively high 3ST-DBSCAN Relatively high Only need store points High 34 ConclusionsIn this paper, two clustering methods to extract spatial-temporal clusters from geo-databases are discussed and validated with the successful extraction of seismic sequences from seismic databases. From these experiments, we can find the core idea of our two methods is neighborhood searching in same. The differences are: ST-GRID searches between neighborhood cells while ST-DBSCAN searches the neighborhood of points. One groups the cells while the other groups the points. Table 3 sums up both methods advantages and disadvantages. ST-GRID only needs one scan of the whole data set, which is of linear time complexity and high performing speed. Because ST-DBSCAN involves neighborhood searching of points, its time complexity will reach O(n2) if without any spatial indices. It’s not very good in time efficiency, and the improvement is to introduce R*- tree index structure [9] into the method. But ST-GRID needs additional disk space to store the grid structure, which is of lower space efficiency than ST-DBSCAN. Because of the shortcomings of ‘roughness’, the clustering precision of ST-GRID is a little lower than ST-DBSCAN. Users can select one method according to their needs in spatial-temporal analysis. Because the dense thresholds of both methods are global and unique, in many circumstances, they will blur many actual spatial distributing patterns. For further studies, we will pay attention to find some local-adaptive rules to specify automatically these important parameters.AcknowledgmentThis work is supported by Chinese National Natural Science Foundation (No.40401039) and startup fund for excellent scholars fetched in of Nanjing Normal University (No. 2006105XGQ0035).References1.Koperski K., Adhikary J., and Han J., 1996: Spatial Data Mining: Progress and ChallengesSurvey Paper. Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada2.Ester, M., Kriegel, H.-P., Sander, J. and Xu, X., 1998: Clustering for Mining in LargeSpatial Databases. Special Issue on Data Mining, KI-Journal, 12, 18-243.Golfarelli M., Rizzi S., 2000. Spatial-Temporal Clustering of Tasks for Swap-BasedNegotiation Protocols in Multi-Agent Systems. Proceedings 6th International Conference on Intelligent Autonomous Systems.172-179270 M. Wang, A. Wang, and A. Li4.Galic S., Loncaric S. and Tesla E.N., 2001. Cardiac Image segmentation using spatial-temporal clustering. Proceedings of SPIE Medical Imaging, San Diego5.Wardlaw R.L, Frohlich C. and Davis, S.D., 1990. Evaluation of precursory seismicquiescence in sixteen subduction zones using single-link cluster analysis. PAGEOPH:134 6.Ester M., Kriegel H.-P., Sander J., Xu X.. 1996. A Density-Based Algorithm forDiscovering Clusters in Large Spatial Databases with Noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96)7.The seismic analysis and forecasting center, 1980. China Seismological Bureau, TheSeismic Catalog in East of China, Beijing: The Earthquake Publishing House.8.The seismic analysis and forecasting center, 1989. China Seismological Bureau, TheSeismic Catalog in West of China, Beijing: The Earthquake Publishing House.9.Beckmann N., Kriegel H.P., Schneider R. and Seeger B., 1990. The R*-tree: An Efficientand Robust Access Method for Points and Rectangles. Proc. ACM SIGMOD Int. Conf. On Management of Data, 322-331。