Efficient mining of partial periodic patterns in time series database

合集下载

并行序列模式挖掘

并行序列模式挖掘研究概况1序列模式挖掘问题R.Agrawal等人在1995年首先提出了序列模式挖掘的概念[1]，其问题描述如下。

在由多个交易组成的交易数据库中，某个交易描述了某个顾客在某时间购买物品的集合。

物品的集合叫做项集。

相同顾客在不同交易中包含的项集的子集，按时间先后关系排列称为一个序列。

每个子集称为一个元素(element)。

并且给定由用户确定的最小支持度阈值(min_support threshold)，序列模式挖掘就是去发现所有子序列的出现频率不小于给定的最小支持度的频繁子序列。

2序列模式挖掘研究概况在序列模式挖掘的问题被提出以后，人们开始在时间序列数据库中挖掘序列模式和其他频繁模式的算法方面不断地进行研究和改进。

现有的序列模式挖掘算法主要可以分为两类:第一类是基于Apriori特性的算法[2]。

它是由R.Agrawal和R.srikant在1994年提出的。

其基本思想是任一个频繁模式的子模式必定是频繁的。

基于这一特性，人们提出了一系列类APriori的序列模式挖掘算法，这些算法中有采用水平数据格式(horizontal data format)的算法，如AprioriAll算法[1]、GSP算法[3]、PSP算法[4]等;有采取垂直数据格式(vertica data format)的算法，如SPADE算法[5]、SPAM算法[6]等。

第二类是J.Han等人提出的基于模式增长(pattern-growth)策略的算法，如Freespan算法[7]、Prefixspan算法[8],[9]等。

另外，C.Antunes 等人提出了SPARSE算法[10]。

在很多序列模式挖掘任务中，用户不需要找出数据库中所有可能存在的序列模式，而是加入一定的约束，找出用户感兴趣的序列模式[11],[12]，Agrawal 等人将序列模式挖掘问题加以泛化，引入了时间约束、滑动时间窗口(sliding time window)和分类约束，并提出了GSP算法[3]。

电力系统可投稿的SCI期刊及其评述

[1-50]《电力系统研究》Electric Power Systems Research (Switzerland)刊载发电、输配电以及电力应用方面的原始论文。

高价刊。

《IEEE电力系统汇刊》IEEE TRANSACTIONS ON POWER SYSTEMS (USA)刊载电力系统包括发电和输配电系统的技术条件、规划、分析、可靠性、运行以及经济性方面的论文。

平均3个月的审稿周期《IEEE 智能电网汇刊》IEEE Transactions on Smart Grid《英国电气工程师学会志：发电、输电与配电》IEE PROCEEDINGS-GENERATION TRANSMISSION AND DISTRIBUTION (England)《国际电力与能源系统杂志》International Journal of Electrical Power and Energy Systems (England)主要发表电力与能源系统的理论和应用问题的论文、评论和会议报告，涉及发电和电网规划、电网理论、大小型系统动力、系统控制中心、联机控制等。

EUROPEAN TRANSACTIONS ON ELECTRICAL POWER (2013年更名为International Transactions on Electrical Energy Systems)投稿回复比较慢，审稿周期不详。

《电力部件与系统》ELECTRIC POWER COMPONENTS AND SYSTEMS (USA) 刊载电力系统的理论与应用研究论文。

内容包括电机的固态控制，新型电机，电磁场与能量转换器，动力系统规划与保护，可靠性与安全等。

《电机与动力系统》ELECTRIC MACHINES AND POWER SYSTEMS (USA)《英国电气工程师学会志：电力应用》IEE PROCEEDINGS-ELECTRIC POWER APPLICATIONS (England)《IEEE电力系统计算机应用杂志》IEEE COMPUTER APPLICATIONS IN POWER (USA)刊载计算机在电力系统设计、运行和控制中应用方面的研究论述。

A_review_on_time_series_data_mining

A review on time series data miningTak-chung FuDepartment of Computing,Hong Kong Polytechnic University,Hunghom,Kowloon,Hong Konga r t i c l e i n f oArticle history:Received19February2008Received in revised form14March2010Accepted4September2010Keywords:Time series data miningRepresentationSimilarity measureSegmentationVisualizationa b s t r a c tTime series is an important class of temporal data objects and it can be easily obtained from scientiﬁcandﬁnancial applications.A time series is a collection of observations made chronologically.The natureof time series data includes:large in data size,high dimensionality and necessary to updatecontinuously.Moreover time series data,which is characterized by its numerical and continuousnature,is always considered as a whole instead of individual numericalﬁeld.The increasing use of timeseries data has initiated a great deal of research and development attempts in theﬁeld of data mining.The abundant research on time series data mining in the last decade could hamper the entry ofinterested researchers,due to its complexity.In this paper,a comprehensive revision on the existingtime series data mining researchis given.They are generally categorized into representation andindexing,similarity measure,segmentation,visualization and mining.Moreover state-of-the-artresearch issues are also highlighted.The primary objective of this paper is to serve as a glossary forinterested researchers to have an overall picture on the current time series data mining developmentand identify their potential research direction to further investigation.&2010Elsevier Ltd.All rights reserved.1.IntroductionRecently,the increasing use of temporal data,in particulartime series data,has initiated various research and developmentattempts in theﬁeld of data mining.Time series is an importantclass of temporal data objects,and it can be easily obtained fromscientiﬁc andﬁnancial applications(e.g.electrocardiogram(ECG),daily temperature,weekly sales totals,and prices of mutual fundsand stocks).A time series is a collection of observations madechronologically.The nature of time series data includes:large indata size,high dimensionality and update continuously.Moreovertime series data,which is characterized by its numerical andcontinuous nature,is always considered as a whole instead ofindividual numericalﬁeld.Therefore,unlike traditional databaseswhere similarity search is exact match based,similarity search intime series data is typically carried out in an approximatemanner.There are various kinds of time series data related research,forexample,ﬁnding similar time series(Agrawal et al.,1993a;Berndtand Clifford,1996;Chan and Fu,1999),subsequence searching intime series(Faloutsos et al.,1994),dimensionality reduction(Keogh,1997b;Keogh et al.,2000)and segmentation(Abonyiet al.,2005).Those researches have been studied in considerabledetail by both database and pattern recognition communities fordifferent domains of time series data(Keogh and Kasetty,2002).In the context of time series data mining,the fundamentalproblem is how to represent the time series data.One of thecommon approaches is transforming the time series to anotherdomain for dimensionality reduction followed by an indexingmechanism.Moreover similarity measure between time series ortime series subsequences and segmentation are two core tasksfor various time series mining tasks.Based on the time seriesrepresentation,different mining tasks can be found in theliterature and they can be roughly classiﬁed into fourﬁelds:pattern discovery and clustering,classiﬁcation,rule discovery andsummarization.Some of the research concentrates on one of theseﬁelds,while the others may focus on more than one of the aboveprocesses.In this paper,a comprehensive review on the existingtime series data mining research is given.Three state-of-the-arttime series data mining issues,streaming,multi-attribute timeseries data and privacy are also brieﬂy introduced.The remaining part of this paper is organized as follows:Section2contains a discussion of time series representation andindexing.The concept of similarity measure,which includes bothwhole time series and subsequence matching,based on the rawtime series data or the transformed domain will be reviewed inSection3.The research work on time series segmentation andvisualization will be discussed in Sections4and5,respectively.InSection6,vary time series data mining tasks and recent timeseries data mining directions will be reviewed,whereas theconclusion will be made in Section7.2.Time series representation and indexingOne of the major reasons for time series representation is toreduce the dimension(i.e.the number of data point)of theContents lists available at ScienceDirectjournal homepage:/locate/engappaiEngineering Applications of Artiﬁcial Intelligence0952-1976/$-see front matter&2010Elsevier Ltd.All rights reserved.doi:10.1016/j.engappai.2010.09.007E-mail addresses:cstcfu@.hk,cstcfu@Engineering Applications of Artiﬁcial Intelligence24(2011)164–181original data.The simplest method perhaps is sampling(Astrom, 1969).In this method,a rate of m/n is used,where m is the length of a time series P and n is the dimension after dimensionality reduction(Fig.1).However,the sampling method has the drawback of distorting the shape of sampled/compressed time series,if the sampling rate is too low.An enhanced method is to use the average(mean)value of each segment to represent the corresponding set of data points. Again,with time series P¼ðp1,...,p mÞand n is the dimension after dimensionality reduction,the‘‘compressed’’time series ^P¼ð^p1,...,^p nÞcan be obtained by^p k ¼1k kX e ki¼s kp ið1Þwhere s k and e k denote the starting and ending data points of the k th segment in the time series P,respectively(Fig.2).That is, using the segmented means to represent the time series(Yi and Faloutsos,2000).This method is also called piecewise aggregate approximation(PAA)by Keogh et al.(2000).1Keogh et al.(2001a) propose an extended version called an adaptive piecewise constant approximation(APCA),in which the length of each segment is notﬁxed,but adaptive to the shape of the series.A signature technique is proposed by Faloutsos et al.(1997)with similar ideas.Besides using the mean to represent each segment, other methods are proposed.For example,Lee et al.(2003) propose to use the segmented sum of variation(SSV)to represent each segment of the time series.Furthermore,a bit level approximation is proposed by Ratanamahatana et al.(2005)and Bagnall et al.(2006),which uses a bit to represent each data point.To reduce the dimension of time series data,another approach is to approximate a time series with straight lines.Two major categories are involved.Theﬁrst one is linear interpolation.A common method is using piecewise linear representation(PLR)2 (Keogh,1997b;Keogh and Smyth,1997;Smyth and Keogh,1997). The approximating line for the subsequence P(p i,y,p j)is simply the line connecting the data points p i and p j.It tends to closely align the endpoint of consecutive segments,giving the piecewise approximation with connected lines.PLR is a bottom-up algo-rithm.It begins with creating aﬁne approximation of the time series,so that m/2segments are used to approximate the m length time series and iteratively merges the lowest cost pair of segments,until it meets the required number of segment.When the pair of adjacent segments S i and S i+1are merged,the cost of merging the new segment with its right neighbor and the cost of merging the S i+1segment with its new larger neighbor is calculated.Ge(1998)extends PLR to hierarchical structure. Furthermore,Keogh and Pazzani enhance PLR by considering weights of the segments(Keogh and Pazzani,1998)and relevance feedback from the user(Keogh and Pazzani,1999).The second approach is linear regression,which represents the subsequences with the bestﬁtting lines(Shatkay and Zdonik,1996).Furthermore,reducing the dimension by preserving the salient points is a promising method.These points are called as perceptually important points(PIP).The PIP identiﬁcation process isﬁrst introduced by Chung et al.(2001)and used for pattern matching of technical(analysis)patterns inﬁnancial applications. With the time series P,there are n data points:P1,P2y,P n.All the data points in P can be reordered by its importance by going through the PIP identiﬁcation process.Theﬁrst data point P1and the last data point P n in the time series are theﬁrst and two PIPs, respectively.The next PIP that is found will be the point in P with maximum distance to theﬁrst two PIPs.The fourth PIP that is found will be the point in P with maximum vertical distance to the line joining its two adjacent PIPs,either in between theﬁrst and second PIPs or in between the second and the last PIPs.The PIP location process continues until all the points in P are attached to a reordered list L or the required number of PIPs is reached(i.e. reduced to the required dimension).Seven PIPs are identiﬁed in from the sample time series in Fig.3.Detailed treatment can be found in Fu et al.(2008c).The idea is similar to a technique proposed about30years ago for reducing the number of points required to represent a line by Douglas and Peucker(1973)(see also Hershberger and Snoeyink, 1992).Perng et al.(2000)use a landmark model to identify the important points in the time series for similarity measure.Man and Wong(2001)propose a lattice structure to represent the identiﬁed peaks and troughs(called control points)in the time series.Pratt and Fink(2002)and Fink et al.(2003)deﬁne extrema as minima and maxima in a time series and compress thetime Fig.1.Time series dimensionality reduction by sampling.The time series on the left is sampled regularly(denoted by dotted lines)and displayed on the right with a largedistortion.Fig.2.Time series dimensionality reduction by PAA.The horizontal dotted lines show the mean of each segment.1This method is called piecewise constant approximation originally(Keoghand Pazzani,2000a).2It is also called piecewise linear approximation(PLA).Tak-chung Fu/Engineering Applications of Artiﬁcial Intelligence24(2011)164–181165series by selecting only certain important extrema and dropping the other points.The idea is to discard minor ﬂuctuations and keep major minima and maxima.The compression is controlled by the compression ratio with parameter R ,which is always greater than one;an increase of R leads to the selection of fewer points.That is,given indices i and j ,where i r x r j ,a point p x of a series P is an important minimum if p x is the minimum among p i ,y ,p j ,and p i /p x Z R and p j /p x Z R .Similarly,p x is an important maximum if p x is the maximum among p i ,y ,p j and p x /p i Z R and p x /p j Z R .This algorithm takes linear time and constant memory.It outputs the values and indices of all important points,as well as the ﬁrst and last point of the series.This algorithm can also process new points as they arrive,without storing the original series.It identiﬁes important points based on local information of each segment (subsequence)of time series.Recently,a critical point model (CPM)(Bao,2008)and a high-level representation based on a sequence of critical points (Bao and Yang,2008)are proposed for ﬁnancial data analysis.On the other hand,special points are introduced to restrict the error on PLR (Jia et al.,2008).Key points are suggested to represent time series in (Leng et al.,2009)for an anomaly detection.Another common family of time series representation approaches converts the numeric time series to symbolic form.That is,ﬁrst discretizing the time series into segments,then converting each segment into a symbol (Yang and Zhao,1998;Yang et al.,1999;Motoyoshi et al.,2002;Aref et al.,2004).Lin et al.(2003;2007)propose a method called symbolic aggregate approximation (SAX)to convert the result from PAA to symbol string.The distribution space (y -axis)is divided into equiprobable regions.Each region is represented by a symbol and each segment can then be mapped into a symbol corresponding to the region inwhich it resides.The transformed time series ^Pusing PAA is ﬁnally converted to a symbol string SS (s 1,y ,s W ).In between,two parameters must be speciﬁed for the conversion.They are the length of subsequence w and alphabet size A (number of symbols used).Besides using the means of the segments to build the alphabets,another method uses the volatility change to build the alphabets.Jonsson and Badal (1997)use the ‘‘Shape Description Alphabet (SDA)’’.Example symbols like highly increasing transi-tion,stable transition,and slightly decreasing transition are adopted.Qu et al.(1998)use gradient alphabets like upward,ﬂat and download as symbols.Huang and Yu (1999)suggest transforming the time series to symbol string,using change ratio between contiguous data points.Megalooikonomou et al.(2004)propose to represent each segment by a codeword from a codebook of key-sequences.This work has extended to multi-resolution consideration (Megalooi-konomou et al.,2005).Morchen and Ultsch (2005)propose an unsupervised discretization process based on quality score and persisting states.Instead of ignoring the temporal order of values like many other methods,the Persist algorithm incorporates temporal information.Furthermore,subsequence clustering is a common method to generate the symbols (Das et al.,1998;Li et al.,2000a;Hugueney and Meunier,2001;Hebrail and Hugueney,2001).A multiple abstraction level mining (MALM)approach is proposed by Li et al.(1998),which is based on the symbolic form of the time series.The symbols in this paper are determined by clustering the features of each segment,such as regression coefﬁcients,mean square error and higher order statistics based on the histogram of the regression residuals.Most of the methods described so far are representing time series in time domain directly.Representing time series in the transformation domain is another large family of approaches.One of the popular transformation techniques in time series data mining is the discrete Fourier transforms (DFT),since ﬁrst being proposed for use in this context by Agrawal et al.(1993a).Raﬁei and Mendelzon (2000)develop similarity-based queries,using DFT.Janacek et al.(2005)propose to use likelihood ratio statistics to test the hypothesis of difference between series instead of an Euclidean distance in the transformed domain.Recent research uses wavelet transform to represent time series (Struzik and Siebes,1998).In between,the discrete wavelet transform (DWT)has been found to be effective in replacing DFT (Chan and Fu,1999)and the Haar transform is always selected (Struzik and Siebes,1999;Wang and Wang,2000).The Haar transform is a series of averaging and differencing operations on a time series (Chan and Fu,1999).The average and difference between every two adjacent data points are computed.For example,given a time series P ¼(1,3,7,5),dimension of 4data points is the full resolution (i.e.original time series);in dimension of two coefﬁcients,the averages are (26)with the coefﬁcients (À11)and in dimension of 1coefﬁcient,the average is 4with coefﬁcient (À2).A multi-level representation of the wavelet transform is proposed by Shahabi et al.(2000).Popivanov and Miller (2002)show that a large class of wavelet transformations can be used for time series representation.Dasha et al.(2007)compare different wavelet feature vectors.On the other hand,comparison between DFT and DWT can be found in Wu et al.(2000b)and Morchen (2003)and a combination use of Fourier and wavelet transforms are presented in Kawagoe and Ueda (2002).An ensemble-index,is proposed by Keogh et al.(2001b)and Vlachos et al.(2006),which ensembles two or more representations for indexing.Principal component analysis (PCA)is a popular multivariate technique used for developing multivariate statistical process monitoring methods (Yang and Shahabi,2005b;Yoon et al.,2005)and it is applied to analyze ﬁnancial time series by Lesch et al.(1999).In most of the related works,PCA is used to eliminate the less signiﬁcant components or sensors and reduce the data representation only to the most signiﬁcant ones and to plot the data in two dimensions.The PCA model deﬁnes linear hyperplane,it can be considered as the multivariate extension of the PLR.PCA maps the multivariate data into a lower dimensional space,which is useful in the analysis and visualization of correlated high-dimensional data.Singular value decomposition (SVD)(Korn et al.,1997)is another transformation-based approach.Other time series representation methods include modeling time series using hidden markov models (HMMs)(Azzouzi and Nabney,1998)and a compression technique for multiple stream is proposed by Deligiannakis et al.(2004).It is based onbaseFig.3.Time series compression by data point importance.The time series on the left is represented by seven PIPs on the right.Tak-chung Fu /Engineering Applications of Artiﬁcial Intelligence 24(2011)164–181166signal,which encodes piecewise linear correlations among the collected data values.In addition,a recent biased dimension reduction technique is proposed by Zhao and Zhang(2006)and Zhao et al.(2006).Moreover many of the representation schemes described above are incorporated with different indexing methods.A common approach is adopted to an existing multidimensional indexing structure(e.g.R-tree proposed by Guttman(1984))for the representation.Agrawal et al.(1993a)propose an F-index, which adopts the R*-tree(Beckmann et al.,1990)to index theﬁrst few DFT coefﬁcients.An ST-index is further proposed by (Faloutsos et al.(1994),which extends the previous work for subsequence handling.Agrawal et al.(1995a)adopt both the R*-and R+-tree(Sellis et al.,1987)as the indexing structures.A multi-level distance based index structure is proposed(Yang and Shahabi,2005a),which for indexing time series represented by PCA.Vlachos et al.(2005a)propose a Multi-Metric(MM)tree, which is a hybrid indexing structure on Euclidean and periodic spaces.Minimum bounding rectangle(MBR)is also a common technique for time series indexing(Chu and Wong,1999;Vlachos et al.,2003).An MBR is adopted in(Raﬁei,1999)which an MT-index is developed based on the Fourier transform and in(Kahveci and Singh,2004)which a multi-resolution index is proposed based on the wavelet transform.Chen et al.(2007a)propose an indexing mechanism for PLR representation.On the other hand, Kim et al.(1996)propose an index structure called TIP-index (TIme series Pattern index)for manipulating time series pattern databases.The TIP-index is developed by improving the extended multidimensional dynamic indexﬁle(EMDF)(Kim et al.,1994). An iSAX(Shieh and Keogh,2009)is proposed to index massive time series,which is developed based on an SAX.A multi-resolution indexing structure is proposed by Li et al.(2004),which can be adapted to different representations.To sum up,for a given index structure,the efﬁciency of indexing depends only on the precision of the approximation in the reduced dimensionality space.However in choosing a dimensionality reduction technique,we cannot simply choose an arbitrary compression algorithm.It requires a technique that produces an indexable representation.For example,many time series can be efﬁciently compressed by delta encoding,but this representation does not lend itself to indexing.In contrast,SVD, DFT,DWT and PAA all lend themselves naturally to indexing,with each eigenwave,Fourier coefﬁcient,wavelet coefﬁcient or aggregate segment map onto one dimension of an index tree. Post-processing is then performed by computing the actual distance between sequences in the time domain and discarding any false matches.3.Similarity measureSimilarity measure is of fundamental importance for a variety of time series analysis and data mining tasks.Most of the representation approaches discussed in Section2also propose the similarity measure method on the transformed representation scheme.In traditional databases,similarity measure is exact match based.However in time series data,which is characterized by its numerical and continuous nature,similarity measure is typically carried out in an approximate manner.Consider the stock time series,one may expect having queries like: Query1:ﬁnd all stocks which behave‘‘similar’’to stock A.Query2:ﬁnd all‘‘head and shoulders’’patterns last for a month in the closing prices of all high-tech stocks.The query results are expected to provide useful information for different stock analysis activities.Queries like Query2in fact is tightly coupled with the patterns frequently used in technical analysis, e.g.double top/bottom,ascending triangle,ﬂag and rounded top/bottom.In time series domain,devising an appropriate similarity function is by no means trivial.There are essentially two ways the data that might be organized and processed(Agrawal et al., 1993a).In whole sequence matching,the whole length of all time series is considered during the similarity search.It requires comparing the query sequence to each candidate series by evaluating the distance function and keeping track of the sequence with the smallest distance.In subsequence matching, where a query sequence Q and a longer sequence P are given,the task is toﬁnd the subsequences in P,which matches Q. Subsequence matching requires that the query sequence Q be placed at every possible offset within the longer sequence P.With respect to Query1and Query2above,they can be considered as a whole sequence matching and a subsequence matching,respec-tively.Gavrilov et al.(2000)study the usefulness of different similarity measures for clustering similar stock time series.3.1.Whole sequence matchingTo measure the similarity/dissimilarity between two time series,the most popular approach is to evaluate the Euclidean distance on the transformed representation like the DFT coefﬁ-cients(Agrawal et al.,1993a)and the DWT coefﬁcients(Chan and Fu,1999).Although most of these approaches guarantee that a lower bound of the Euclidean distance to the original data, Euclidean distance is not always being the suitable distance function in speciﬁed domains(Keogh,1997a;Perng et al.,2000; Megalooikonomou et al.,2005).For example,stock time series has its own characteristics over other time series data(e.g.data from scientiﬁc areas like ECG),in which the salient points are important.Besides Euclidean-based distance measures,other distance measures can easily be found in the literature.A constraint-based similarity query is proposed by Goldin and Kanellakis(1995), which extended the work of(Agrawal et al.,1993a).Das et al. (1997)apply computational geometry methods for similarity measure.Bozkaya et al.(1997)use a modiﬁed edit distance function for time series matching and retrieval.Chu et al.(1998) propose to measure the distance based on the slopes of the segments for handling amplitude and time scaling problems.A projection algorithm is proposed by Lam and Wong(1998).A pattern recognition method is proposed by Morrill(1998),which is based on the building blocks of the primitives of the time series. Ruspini and Zwir(1999)devote an automated identiﬁcation of signiﬁcant qualitative features of complex objects.They propose the process of discovery and representation of interesting relations between those features,the generation of structured indexes and textual annotations describing features and their relations.The discovery of knowledge by an analysis of collections of qualitative descriptions is then achieved.They focus on methods for the succinct description of interesting features lying in an effective frontier.Generalized clustering is used for extracting features,which interest domain experts.The general-ized Markov models are adopted for waveform matching in Ge and Smyth(2000).A content-based query-by-example retrieval model called FALCON is proposed by Wu et al.(2000a),which incorporates a feedback mechanism.Indeed,one of the most popular andﬁeld-tested similarity measures is called the‘‘time warping’’distance measure.Based on the dynamic time warping(DTW)technique,the proposed method in(Berndt and Clifford,1994)predeﬁnes some patterns to serve as templates for the purpose of pattern detection.To align two time series,P and Q,using DTW,an n-by-m matrix M isﬁrstTak-chung Fu/Engineering Applications of Artiﬁcial Intelligence24(2011)164–181167constructed.The(i th,j th)element of the matrix,m ij,contains the distance d(q i,p j)between the two points q i and p j and an Euclidean distance is typically used,i.e.d(q i,p j)¼(q iÀp j)2.It corresponds to the alignment between the points q i and p j.A warping path,W,is a contiguous set of matrix elements that deﬁnes a mapping between Q and P.Its k th element is deﬁned as w k¼(i k,j k)andW¼w1,w2,...,w k,...,w Kð2Þwhere maxðm,nÞr K o mþnÀ1.The warping path is typically subjected to the following constraints.They are boundary conditions,continuity and mono-tonicity.Boundary conditions are w1¼(1,1)and w K¼(m,n).This requires the warping path to start andﬁnish diagonally.Next constraint is continuity.Given w k¼(a,b),then w kÀ1¼(a0,b0), where aÀa u r1and bÀb u r1.This restricts the allowable steps in the warping path being the adjacent cells,including the diagonally adjacent cell.Also,the constraints aÀa uZ0and bÀb uZ0force the points in W to be monotonically spaced in time.There is an exponential number of warping paths satisfying the above conditions.However,only the path that minimizes the warping cost is of interest.This path can be efﬁciently found by using dynamic programming(Berndt and Clifford,1996)to evaluate the following recurrence equation that deﬁnes the cumulative distance gði,jÞas the distance dði,jÞfound in the current cell and the minimum of the cumulative distances of the adjacent elements,i.e.gði,jÞ¼dðq i,p jÞþmin f gðiÀ1,jÀ1Þ,gðiÀ1,jÞ,gði,jÀ1Þgð3ÞA warping path,W,such that‘‘distance’’between them is minimized,can be calculated by a simple methodDTWðQ,PÞ¼minWX Kk¼1dðw kÞ"#ð4Þwhere dðw kÞcan be deﬁned asdðw kÞ¼dðq ik ,p ikÞ¼ðq ikÀp ikÞ2ð5ÞDetailed treatment can be found in Kruskall and Liberman (1983).As DTW is computationally expensive,different methods are proposed to speedup the DTW matching process.Different constraint(banding)methods,which control the subset of matrix that the warping path is allowed to visit,are reviewed in Ratanamahatana and Keogh(2004).Yi et al.(1998)introduce a technique for an approximate indexing of DTW that utilizes a FastMap technique,whichﬁlters the non-qualifying series.Kim et al.(2001)propose an indexing approach under DTW similarity measure.Keogh and Pazzani(2000b)introduce a modiﬁcation of DTW,which integrates with PAA and operates on a higher level abstraction of the time series.An exact indexing approach,which is based on representing the time series by PAA for DTW similarity measure is further proposed by Keogh(2002).An iterative deepening dynamic time warping(IDDTW)is suggested by Chu et al.(2002),which is based on a probabilistic model of the approximate errors for all levels of approximation prior to the query process.Chan et al.(2003)propose aﬁltering process based on the Haar wavelet transformation from low resolution approx-imation of the real-time warping distance.Shou et al.(2005)use an APCA approximation to compute the lower bounds for DTW distance.They improve the global bound proposed by Kim et al. (2001),which can be used to index the segments and propose a multi-step query processing technique.A FastDTW is proposed by Salvador and Chan(2004).This method uses a multi-level approach that recursively projects a solution from a coarse resolution and reﬁnes the projected solution.Similarly,a fast DTW search method,an FTW is proposed by Sakurai et al.(2005) for efﬁciently pruning a signiﬁcant number of search candidates. Ratanamahatana and Keogh(2005)clariﬁed some points about DTW where are related to lower bound and speed.Euachongprasit and Ratanamahatana(2008)also focus on this problem.A sequentially indexed structure(SIS)is proposed by Ruengron-ghirunya et al.(2009)to balance the tradeoff between indexing efﬁciency and I/O cost during DTW similarity measure.A lower bounding function for group of time series,LBG,is adopted.On the other hand,Keogh and Pazzani(2001)point out the potential problems of DTW that it can lead to unintuitive alignments,where a single point on one time series maps onto a large subsection of another time series.Also,DTW may fail to ﬁnd obvious and natural alignments in two time series,because of a single feature(i.e.peak,valley,inﬂection point,plateau,etc.). One of the causes is due to the great difference between the lengths of the comparing series.Therefore,besides improving the performance of DTW,methods are also proposed to improve an accuracy of DTW.Keogh and Pazzani(2001)propose a modiﬁca-tion of DTW that considers the higher level feature of shape for better alignment.Ratanamahatana and Keogh(2004)propose to learn arbitrary constraints on the warping path.Regression time warping(RTW)is proposed by Lei and Govindaraju(2004)to address the challenges of shifting,scaling,robustness and tecki et al.(2005)propose a method called the minimal variance matching(MVM)for elastic matching.It determines a subsequence of the time series that best matches a query series byﬁnding the cheapest path in a directed acyclic graph.A segment-wise time warping distance(STW)is proposed by Zhou and Wong(2005)for time scaling search.Fu et al.(2008a) propose a scaled and warped matching(SWM)approach for handling both DTW and uniform scaling simultaneously.Different customized DTW techniques are applied to theﬁeld of music research for query by humming(Zhu and Shasha,2003;Arentz et al.,2005).Focusing on similar problems as DTW,the Longest Common Subsequence(LCSS)model(Vlachos et al.,2002)is proposed.The LCSS is a variation of the edit distance and the basic idea is to match two sequences by allowing them to stretch,without rearranging the sequence of the elements,but allowing some elements to be unmatched.One of the important advantages of an LCSS over DTW is the consideration on the outliers.Chen et al.(2005a)further introduce a distance function based on an edit distance on real sequence(EDR),which is robust against the data imperfection.Morse and Patel(2007)propose a Fast Time Series Evaluation(FTSE)method which can be used to evaluate the threshold value of these kinds of techniques in a faster way.Threshold-based distance functions are proposed by ABfalg et al. (2006).The proposed function considers intervals,during which the time series exceeds a certain threshold for comparing time series rather than using the exact time series values.A T-Time application is developed(ABfalg et al.,2008)to demonstrate the usage of it.Fu et al.(2007)further suggest to introduce rules to govern the pattern matching process,if a priori knowledge exists in the given domain.A parameter-light distance measure method based on Kolmo-gorov complexity theory is suggested in Keogh et al.(2007b). Compression-based dissimilarity measure(CDM)3is adopted in this paper.Chen et al.(2005b)present a histogram-based representation for similarity measure.Similarly,a histogram-based similarity measure,bag-of-patterns(BOP)is proposed by Lin and Li(2009).The frequency of occurrences of each pattern in 3CDM is proposed by Keogh et al.(2004),which is used to compare the co-compressibility between data sets.Tak-chung Fu/Engineering Applications of Artiﬁcial Intelligence24(2011)164–181 168。

基于关联规则的数据挖掘算法及其应用的开题报告

基于关联规则的数据挖掘算法及其应用的开题报告一、选题背景和意义：随着互联网时代的到来，数据量不断增长，信息爆炸的问题愈发突出。

为了从数据中挖掘出有用的知识，需要用到数据挖掘技术。

关联规则挖掘算法是数据挖掘中一项重要的技术之一，主要用于发现数据集中的关联项和频繁项集，以支持决策和预测。

随着数据量和数据类型的不断增加，关联规则算法也面临着越来越大的挑战。

本文选取基于关联规则的数据挖掘算法及其应用作为研究对象，旨在深入了解关联规则挖掘算法的原理和特点，以及相关的应用场景。

该研究将有助于提高数据挖掘技术在实际应用中的效率和准确性，为企业和机构提供更准确的决策支持。

二、研究内容和方法：1. 研究背景和意义：重点介绍数据挖掘技术在互联网时代的应用和发展趋势，分析关联规则挖掘算法在数据挖掘中的重要性和应用场景。

2. 关联规则挖掘算法：介绍Apriori算法和FP-Growth算法等关联规则挖掘算法的原理和特点，并比较各算法之间的优缺点。

3. 应用案例分析：以电子商务领域为例，通过实际的数据挖掘案例，探讨关联规则挖掘算法的应用方法和效果，并评估算法的准确性和效率。

4. 研究总结和展望：总结关联规则挖掘算法的特点和应用价值，探讨其未来在数据挖掘领域的发展方向和趋势。

三、预期成果：本研究的预期成果为：1. 对关联规则挖掘算法的原理和特点进行深入探讨，比较各算法之间的优缺点。

2. 经过应用案例分析，评估关联规则挖掘算法的准确性和效率。

3. 提供对于数据挖掘在实际应用中的一定指导意义和支持。

四、研究计划：1. 第一周：进行文献查阅，确定研究方向和内容。

2. 第二周：深入研究关联规则挖掘算法的原理和特点。

3. 第三周：比较各种关联规则挖掘算法，选择适合的算法。

4. 第四周：通过实际应用案例，评估算法的准确性和效率。

5. 第五周：总结研究成果，撰写开题报告初稿。

6. 第六周：进行报告修改和完善，最终完成开题报告。

五、研究难点和风险：本研究的难点主要在于：1. 关联规则挖掘算法的理解和应用需要较强的数学基础和编程能力。

洛河组砂岩含水层下大采高工作面导水断裂带演化规律

移动扫码阅读DO !: 10.13347 / j . cnki . mkaq . 2021.03.006杨玉亮，徐祝贺.洛河组砂岩含水层下大采高工作面导水断裂带演化规律[j ].煤矿安全，2021,52(3)：30-35,42.YANG Yuliang, XU Zhuhe. Evolution law of water-conducting fault zone in large* mining height working faceunder sandstone aquifer of Luohe Formation[ J ]. Safety in Coal Mines, 2021, 52(3): 30-35, 42.洛河组砂岩含水层下大采高工作面导水断裂带演化规律杨玉亮h 2,徐祝贺^(I.山西大同大学煤炭工程学院，山西大同037003:2.中国矿业大学（北京）能源与矿业学院，北京100083)摘要：针对旬耀矿区厚煤层大采高工作面上覆洛河组砂岩含水层下的安全采煤问题，采用理论分析、相似材料试验和数值模拟研究了某矿1109大采高工作面覆岩破断及裂隙演化规律。

研究表明：上覆岩层经历了直接顶破断、基本顶初次破断与周期破断、亚关键层初次破断与周期破断5个阶段，破断发生引起工作面煤壁上方裂隙密度和开度发生跃变，采空区覆岩裂隙经历孕育、产生、张开、闭合、压实5个动态阶段；从开切眼到充分采动过程中，在裂隙带的上部、工作面煤壁上方及开切眼上方裂隙较为发育，裂隙区近似“抛物线”状，采空区中部覆岩裂隙闭合而边界处裂隙不易闭合；工作面充分采动后，导水断裂带发育高度82~85 m ，未导通上覆洛河组砂岩。

关键词：洛河组砂岩；大采高工作面；导水断裂带；相似材料试验；数值模拟中图分类号：P 641文献标志码：A文章编号：1003-496X (2021)03-0030-06Evolution law of water-conducting fault zone in large mining height working face undersandstone aquifer of Luohe FormationYANG Yuliang1-2, XU Zhuhe1-2(\.School o f Coal Engineering, Shanxi Dalong University, Dalong 037003, China ：2. School o f Energy cmd Mining Engineering,Chinn University o f Mining & Technologyi n B e i j i n g 100083, China)Abstract ： Aiming at the problem of safe c oal mining under the sandstone aquifer covering Luohe Formation in the thick coal seam mining face of Xunyao Mining A re a ,丨hf 〇reti(.al analysis, similar material test and numerical simulation were used to study the overlying rock fracture and fracture evolution of large mining height in 1 109 mining face. The results indicate that: under the condition of fully mechanized caving mining, overburden strata has experienced 5 changes, immediate roof broken, basic roof broken, basic roof periodic broken, key strata broken and key strata periodic broken. The density and width of fracture above working face will have a leap when the overburden rock is broken. In this j)roc-ess, the mining -induc ed fractures also have experienced 5 dynamic changes: preparation, production, stretch, close and compaction; many cracks will generate on fracture zone top, working face top and open -off cuts top in the course from open -off cuts to the full mining stage. The crack area whic h spreads around compacting area is parabolic shape. Namely, the cracks in central zone of overburden strata will close easily but the cracks in border zone will not; after the working face is fully excavated, the height of the water—conducting fissure zone is about 82 m to 85 m, and the sandstone overlying Luohe Formation is not connected.Key words ： Luohe sandstone formation; high mining height working face; water flowing fractured zone; similar material test; numerical simulation基金项目：20丨9年大同市重点研发计划资助项目（2019025);2017 年山西大同大学科学研究资助项目（2017K 3 )“三下一上”压煤问题-直是制约我国煤矿安全高效发展的重大技术难题，其中水体及含水层下开•30 •采尤为显著白恶系洛河组砂岩裂隙含水层覆盖地区对水体下安全采煤受到不同程度的威胁。

一种基于MDL的日志序列模式挖掘算法

第47卷第2期Vol.47No.2计算机工程Computer Engineering2021年2月February2021一种基于MDL的日志序列模式挖掘算法杜诗晴1，王鹏2，汪卫2（1.复旦大学软件学院，上海201203；2.复旦大学计算机科学技术学院，上海201203）摘要：日志数据是互联网系统产生的过程性事件记录数据，从日志数据中挖掘出高质量序列模式可帮助工程师高效开展系统运维工作。

针对传统模式挖掘算法结果冗余的问题，提出一种从时序日志序列中挖掘序列模式（DTS）的算法。

DTS采用启发式思路挖掘能充分代表原序列中事件关系和时序规律的模式集合，并将最小描述长度准则应用于模式挖掘，设计一种考虑事件关系和时序关系的编码方案，以解决模式规模爆炸问题。

在真实日志数据集上的实验结果表明，与SQS、CSC与ISM等序列模式挖掘算法相比，该算法能高效挖掘出含义丰富且冗余度低的序列模式。

关键词：数据挖掘；日志分析；事件关系；最小描述长度准则；序列模式开放科学（资源服务）标志码（OSID）：中文引用格式：杜诗晴，王鹏，汪卫.一种基于MDL的日志序列模式挖掘算法［J］.计算机工程，2021，47（2）：118-125.英文引用格式：DU Shiqing，WANG Peng，WANG Wei.A MDL-based pattern mining algorithm for log sequences［J］. Computer Engineering，2021，47（2）：118-125.A MDL-based Pattern Mining Algorithm for Log SequencesDU Shiqing1，WANG Peng2，WANG Wei2（1.Software School，Fudan University，Shanghai201203，China；2.School of Computer Science，Fudan University，Shanghai201203，China）【Abstract】Logs contain rich information about procedural events generated in Internet systems，and the mining of high-quality sequence modes from log data can improve the efficiency of system operation and maintenance.To address the problem of redundant results of traditional pattern mining algorithms，this paper proposes a Discovering sequential patterns from Temporal log Sequences（DTS）algorithm.DTS heuristically discovers the set of patterns that can best represent the event relationships and temporal regularities in the original sequence.At the same time，DTS applies the Minimum Description Length（MDL）principle to pattern mining，and proposes an encoding scheme that considers event relationships as well as temporal relationships to solve pattern explosion.Experimental results on real log datasets show that compared with SQS，CSC，ISM and other sequential pattern mining algorithms，the proposed algorithm is capable of efficiently mining meaningful sequential patterns with low redundancy.【Key words】data mining；log analysis；event relationships；Minimum Description Length（MDL）principle；sequential patterns DOI：10.19678/j.issn.1000-3428.00571810概述日志数据记录了互联网系统运行时的状态以及任务的开始与结束等重要事件，其易于获取且含有丰富的信息，已经成为系统运维领域的重要数据源。

Data Mining - Concepts and Techniques CH05

www.cs.sfu.ca, /~hanj
2013年7月21日星期日 Data Mining: Concepts and Techniques
1
Chapter 5: Concept Description: Characterization and Comparison

What is concept description?
Data generalization and summarization-based characterization

Analytical characterization: Analysis of attribute relevance
Perform generalization by attribute removal or attribute generalization. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interactive presentation with users
2 3 4
Conceptual levels

Approaches: Data cube approach(OLAP approach) Attribute-oriented induction approach（面向属性的归纳）
Data Mining: Concepts and Techniques
Concept Description vs. OLAP

Concept description: can handle complex data types of the attributes and their aggregations a more automated process OLAP: restricted to a small number of dimension and measure types user-controlled process

01-2020-0453-高富强-工作面坚硬顶板水力压裂处理对采动应力影响的数值模拟研究(1)

讨卸压机理。数值模拟结果显示：对工作面顶板的水力压裂处理，主要是弱化顶板的完整性，
使其在工作面推进后能够及时垮落，这种局部垮落，虽然不会显著影响支承压力分布特征，但
局部直接顶的及时垮落会引起上覆基本顶和关键层的及时断裂和垮塌，影响采场覆岩大结构，
进而影响支承压力分布特征。对工作面顶板进行持续压裂，可以降低初次垮落步距及平均来
煤矿长臂开采中常常面临坚硬顶板难垮落的问题。煤层上覆岩层包含完整性较好、坚硬且厚的
收稿日期：2020-11-14
修回日期：2021-02-01
责任编辑：许书阁
基金项目：国家自然科学基金资助项目( 51774185 )
作者简介：高富强( 1981— )，男，河南扶沟人，副研究员，博士。E-mail：gaofuqiang@tdkcsj. com
1 数值模拟
1.1 数值计算模型随着采煤工作面不断地推进，在基本顶初次来
压以后，裂隙带岩层形成的结构将始终经历“稳定—失稳—再稳定”的变化过程，这种变化是决定采动应力演化的基础。基于连续体方法的有限元和有限差分，由于不能真实考虑岩层的断裂问题，无法很好地模拟这种采空区覆岩的周期性破坏。传统的离散元方法，可以考虑岩层的断裂，但是断裂位置都必须预先设置，虽然可以通过设置足够多的、随机分布的预置路径( 如UDEC-Voronoi或UDEC Trigon )使得岩层的断裂位置尽量不受预置网格的影响，但是精细的网格处理需要以牺牲模型的计算效率为代价。而且，对于UDEC，当模拟岩层的块体发生大变形脱离周围块体时，模型的计算效率急剧降低，需要数十万的运算时步才能使得断裂的块体塌落并压实稳定。以笔者的经验，基于有限元/ 离散元耦合的ELFEN数值模拟软件是模拟采空区覆岩周期性垮落的有效工具[7-8]。在ELFEN中，单元在破坏前处于弹性状态，服从有限元法则，破坏后才利用离散元法则，这样就大大加强了模型的计算效率。而且，裂纹的扩展可以穿过单元格进行，这在很大程度上降低了预置网格对裂纹扩展的影响。为此，本文选择 ELFEN 数值模拟软件进行研究。

12103综采面顶板运动规律实测与分析

收稿日期2019-04-11作者简介赵剑（1988-），男，汉族，陕西府谷人，中专，助理工程师，研究方向：采掘工程。

12103综采面顶板运动规律实测与分析赵剑（山西忻州神达金山煤业有限公司，山西忻州 719400）摘要神达金山煤业12103综采面回采时矿压显现剧烈，支护工作阻力增加。

为有效减缓生产中矿压显现带来的危害，对12103综采面顶板运动规律进行实测分析，提出减缓顶板初次来压强度及顶板控制管理的方法。

在工作面推进距离切眼35m 时强制放顶，成功减轻初次来压的影响，在周期来压将来临时做好支护与顶板管理，保证了回采安全。

关键词综采矿压规律分析中图分类号 TD327.2 文献标识码 B doi:10.3969/j.issn.1005-2801.2019.10.002Measurement and Analysis of Roof Movement Law of the 12103 Fully Mechanized Coal FaceZhao Jian(Shenda Jinshan Coal Industry Co., Ltd., Shanxi Xinzhou 719400)Abstract : The mining pressure of 12103 fully mechanized face in Shenda Jinshan Coal Mine shows intense, and the support resistance increases. In order to effectively alleviate the hazards caused by the occurrence of ground pressure in production, the roof movement law of 12103 fully mechanized mining face was measured and analyzed, and the methods of reducing the initial roof weighting strength and roof control and management were put forward. The forced roof caving at 35m distance from the cut hole in the working face successfully alleviates the influence of the initial weighting pressure, and temporarily completes the support and roof management in the future of the periodic weighting pressure to ensure the safety of mining.Key words : fully mechanized mining rock pressure law analysis12号煤层位于山西组中下部，煤厚1.09~2.76m ，平均2.2m ，煤层倾角7°~9°。

Ekip UP+ 低压电站保护器说明书

—
White paper
Ekipelay designed for low voltage power-stations, as well as mining and oil&gas plants
1
— Contents
002 – 003 Applications 004 – 009 Solutions 005– 007 Safety protections 008– 009 Connectivity and asset
They protect feeders (mains, incomings, departures), generators (diesel gensets, co-generators, wind and mini-hydro turbines), motors and transformer/busbars. Feeder and generator relays represent more than 65% of low voltage share.
For example, with a color touch-screen display and a graphics-friendly Ekip Connect commissioning tool, Ekip UP+ does not require skilled engineers to manage protection relays.
These digital units do not require voltage transformers1 nor use traditional current transformers, so they use less wiring and fewer components than conventional relays developed for medium- voltage applications, saving substantial space.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Efﬁcient Mining of Partial Periodic Patterns in Time Series DatabaseIn ICDE99Jiawei Han School of Computing Science Simon Fraser Universityhan@cs.sfu.caGuozhu DongDepartment of Computer Science and EngineeringWright State Universitygdong@Yiwen YinSchool of Computing ScienceSimon Fraser Universityyiweny@cs.sfu.caAbstractPartial periodicity search,i.e.,search for partial peri-odic patterns in time-series databases,is an interesting data mining problem.Previous studies on periodicity search mainly considerﬁnding full periodic patterns,where every point in time contributes(precisely or approximately)to the periodicity.However,partial periodicity is very common in practice since it is more likely that only some of the time episodes may exhibit periodic patterns.We present several algorithms for efﬁcient mining of par-tial periodic patterns,by exploring some interesting proper-ties related to partial periodicity,such as the Apriori prop-erty and the max-subpattern hit set property,and by shared mining of multiple periods.The max-subpattern hit set property is a vital new property which allows us to derive the counts of all frequent patterns from a relatively small subset of patterns existing in the time series.We show that mining partial periodicity needs only two scans over the time series database,even for mining multiple periods.The performance study shows our proposed methods are very efﬁcient in mining long periodic patterns.Keywords.Periodicity search,partial periodicity,time-series analysis,data mining algorithms.1.IntroductionFinding periodic patterns in time series databases is an important data mining task with many applications.Manystudies on time series data mining:Most concentrate onsymbolic patterns,although some consider numerical curvepatterns in time series.Agrawal and Srikant[3]devel-oped an Apriori-like technique[2]for mining sequentialpatterns.Mannila et al.[10]consider frequent episodes insequences,where episodes are essentially acyclic graphsof events whose edges specify the temporal before-and-after relationalship but without timing-interval restrictions.Inter-transaction association rules proposed by Lu et al.[9]are implication rules whose two sides are totally-orderedepisodes with timing-interval restrictions(on the events inthe episodes and on the two sides).Bettini et al.[5]con-sider a generalization of inter-transaction association rules:these are essentially rules whose left-hand and right-handsides are episodes with time-interval restrictions.However,unlike ours,periodicity is not considered in these studies.Similar to our problem,the mining of cyclic associationrules by¨Ozden,et al.[12]also considers the mining ofsome patterns of a range of possible periods.Observe thatcyclic association rules are partial periodic patterns withperfect periodicity in the sense that each pattern reoccurs inevery cycle,with conﬁdence.The perfectness in peri-odicity leads to a key idea used in designing efﬁcient cyclicassociation rule mining algorithms:As soon as it is knownthat an association rule does not hold at a particular in-stant of time,we can infer that cannot have periods whichinclude this time instant.For example,if the maximum pe-riod of interest is and it is discovered that does nothold in theﬁrst time instants,then cannot have anyperiods.This idea leads to the useful“cycle-elimination”strategy explored in that paper.Since real life patterns areusually imperfect,our goal is not to mine perfect periodicityand thus“cycle-elimination”based optimization will not beconsidered here.An Apriori-like algorithm has been proposed for miningimperfect partial periodic patterns with a given(single)pe-riod in a recent study by two of the current authors[7].Itis an interesting algorithm for mining imperfect partial pe-riodicity.However,with a detailed examination of the datacharacteristics of partial periodicity,we found that Aprioripruning in mining partial periodicity may not be as effectiveas in mining association rules.Our study has revealed the following new characteristicsof partial periodic patterns in time series:The Apriori-likeproperty among partial periodic patterns still holds for anyﬁxed period,but it does not hold for patterns between dif-ferent periods.Furthermore,there is a strong correlationIf is a singleton we will omit the brackets,e.g.,we write as.4-pattern);and and are two of the subpatterns of.The frequencythe string s is true in, andwhere is the maximum number of periods of length contained in the time series(i.e.,is the positive integer such that).Each segment of the form,where,is called a period segment.We say a pattern is true in the period segment or the period segment matches,if,for each position,either is or all the letters in occur in the set of features in the segment.Thus,if is a subpattern of,then the set of sequences that can match is a subset of sequences that can match.Example2.1For example,is a pattern of period; its frequency count in the feature series is 2;and its conﬁdence is..The mining of frequent partial periodic patterns in a time series is to discover,possibly with some restric-tions,all the frequent patterns of the series for one period or a range of speciﬁed periods.More speciﬁcally,the input to mining includes:A time series.A speciﬁed period;or a range of periods speciﬁed by two integers and.An integer indicating that the ratio of the lengths of and the patterns must be at least.This will ensure that the patterns mined would be of value to the application at hand.Remark:Sometimes the derivation of the feature series from the original data series is quite involved,and the inter-action of the periodic patterns with the derivation of features may lead to improved performance.Hence it is worthwhile to combine the mining of the features from the datasets with the mining of the patterns,as is the case for the mining of cyclic association rules[12].For our work on the mining of frequent partial periodic patterns though,this interaction is not useful for achieving computational advantage and thus we will assume that we are dealing with the feature time series in our study.3Methods for mining partial periodicity in time seriesIn this section,we explore methods for mining partial periodicity in a time series,proceeding from mining par-tial periodicity for a single given period to mining partial periodicity for a speciﬁed range of periods(i.e.,multiple periods).3.1Mining partial periodicity for single period3.1.1Single-period apriori methodA popular key idea used in the efﬁcient mining of associa-tion rules is the Apriori property discovered in[2]:If one subset of an itemset is not frequent,then the itemset itself cannot be frequent.This allows us to use frequent itemsets of size asﬁlters for candidate itemsets of size.Interestingly,for each period,the property supporting the Apriori“trick”still holds:Property3.1[Apriori on periodicity]Each subpattern of a frequent pattern of period is itself a frequent pattern of period.The proof is based on the fact that patterns are more restric-tive than their subpatterns.Suppose is a subpattern of a frequent pattern.Then is obtained from by changing some set of letters to a subset or.Hence is more restric-tive than and thus the frequency count of is greater than or equal to that of.Thus is frequent as well.An algorithm for mining partial periodic patterns for a givenﬁxed period based on this Apriori“trick”was pre-sented in[7].We include a simplied version here for the sake of completeness.Algorithm3.1[Single-period Apriori]Find all partial pe-riodic patterns for a given period satisfying a given con-ﬁdence threshold min,where is the maximum number of periods.2.Find all frequent-patterns of period,for from2upto,based on the idea of Apriori,and terminate immedi-ately when the candidate frequent-pattern set is empty. Analysis.Number of scans over the time series.Step1of the algorithm needs to scan the time series once.Step2needsto scan up to times in the worst case.Thus the total number of scans is no more than the period.Space needed.(1)At Step1,suppose there exist a total of distinct features at positions in, where is the number such that.We need units of space to hold the counts.In the worst case when every feature is distinct in the entire time series ,we need units of space.After Step1,we only need units of space to keep,the set of frequent-patterns in.(2)At Step2,the maximum number of candidate subpatterns that we may generate is.Considering that we still need space to keep the set of frequent1-patterns,the total amount of space needed is in the worse case in this computation.However,the average case should be much smaller than the worst case since if every feature is distinct in the time series,then there is no need to ﬁnd periodic patterns.The existence of any periodicity in the time series will reduce the memory needed.The slow reduction of the set of candidate frequent-patterns as grows makes the Apriori pruning of Algorithm 3.1less attractive.Is there a better way?set is bounded by the formula,, where is the total number of periods in,and is the set of frequent1-patterns.Using this formula,we can calculate the bound of the maximal buffer size needed in the processing:Given the set of frequent1-patterns,,the maximal(additional)buffer size needed for registering the counts of all the maximal subpatterns of is.This property is very useful in practice.For example,if we found500frequent1-patterns when calculating yearly periodic patterns for100years,the buffer size needed is at most100;on the other hand,if we found8frequent 1-patterns for calculating weekly periodic patterns for100 years,the buffer size needed is at most. We can always select the smaller one in estimating the max-imal buffer size needed in computation.Before turning to our hit-set based algorithm,we exam-ine the probability distributions of maximal subpatterns of .Heuristic3.1[Popularity of longer subpatterns]The probability distribution of the maximal subpatterns ofis usually denser for longer subpatterns(i.e.,with the-length closer to)than the shorter ones.This heuristic can be observed in Example3.1.From the ex-ample,we have,but.In most cases,the existence of a short max-subpattern indicates that the nonexistence of some non--letter,which reduces the chance for the corresponding non-letter patterns to reach high conﬁdence.Thus we have the heuristic.This heuristics will imply that the number of nodes in the tree data structure of the next section is usually small. It is also useful for efﬁcient buffer management:In order to reduce the overall cost of access,the longer subpatterns should be arranged to be more easily accessible(such as put in main memory)than the shorter ones.We now present a main algorithm for mining partial pe-riodic patterns for a given period,which is based on the discussions above.Algorithm3.2[Max-subpattern hit-set]Find all the par-tial periodic patterns for a given period in a time-series, based on the max-subpattern hit-set,for a given minIn comparison with Algorithm3.1,Algorithm3.2re-duces the total number of scans of the time series from (the length of the period)to2,and it also uses much less buffer space in the computation in most cases.This can also be seen from the following observation:Suppose the hit subpattern for a period segment is,which is not in the hit set yet.We need only one unit space to reg-ister the string and its count1.However,for the Apriori technique,the candidate2-patterns to be generated will be,3-patterns to be generated will be,and the 4-patterns will be,plus we have to update the count associated with each of them.Thus,it is expected that the max-subpattern hit set method may have better performance in most cases.We will compare the performance of the two algorithms in Section5.3.2Mining partial periodicity with multiple peri-odsMining partial periodicity for a given period covers a good set of applications since people often like to mine peri-odic patterns for natural periods,such as annually,quarterly, monthly,weekly,daily,or hourly.However,certain patterns may appear at some unexpected periods,such as every11 years,or every14hours.It is interesting to provide facilities to mine periodicity for a range of periods.To extend partial periodicity mining from one period to multiple periods,one might wish to extend the idea of Apri-ori to computing partial periodicity among different peri-ods,that is,to use the patterns of small periods asﬁl-ters for candidate patterns of periods of the form for an integer.This will work if all frequent patterns of period are frequent patterns of period.Unfortu-nately,this is not the case.For example,for the time series,,and.Suppose the conﬁdence threshold is.If we use from partial periodic patterns of period asﬁlter for candi-date partial periodic patterns of period,we will miss the partial periodic pattern.Given that we cannot extend the Apriori“trick”to mul-tiple periods,one obvious way to mine partial periodic pat-terns for a range of periods is to repeatedly apply the single-period algorithm for each period in the range.Algorithm3.3[Looping over single period computa-tion]Find all the partial periodic patterns for a set of periods in a given range of interest,,in the time-series, with the given minAlgorithm3.3provides an iterative method for mining partial periodicity for multiple periods.However,when the number of periods is large,we still need a good number of scans to mine periodicity for multiple periods.An im-provement to the above method is to maximally explore the mining of periodicity for multiple periods in the same scan, which leads to the shared mining of periodicity for multiple periods,as illustrated below.Algorithm3.4[Shared mining of multiple periods] Shared mining of all the partial periodic patterns for a set of periods in a given range of interest,,in time-series,with the given minAlgorithm3.4explores shared processing at mining par-tial periodicity for multiple periods.The advantage of the method is that we only need two scans of time series for mining partial periodicity for multiple periods.The over-head of the method is that although it reduces the number of scans to2,it will require more space in the process-ing of each scan than the multiple scan method because it needs to register the corresponding counts for each period (for).However,since the shared features will share the space as well(with counts incremented),and there should be many shared features in periodicity search(oth-erwise,why mining periodicity?),the space required will hardly approach the worst case.Therefore,it should still be an efﬁcient method in many cases for mining partial period-icity with multiple periods.4Derivation of all partial patternsIn this section,we examine the implementation consid-erations of our proposed algorithms.Algorithm3.1is an Apriori-like algorithm which can be implemented similarly as other Apriori-like algorithms for mining association rules (e.g.[2]).Algorithm3.2forms the basis for all the three remaining algorithms and requires new tricks to achieve ef-ﬁciency,and thus our discussion is focused on its efﬁcient implementation.Algorithm3.2consists of two steps:Step1,scan the time series once andﬁnd frequent1-pattern set;and Step2,scan the time series one more time,collect the setof the max-subpatterns hit in,and derive the set of fre-quent patterns.The implementation of Step1is straight-forward and has been discussed in the presentation of Al-gorithm3.1.However,Step2is nontrivial and needs somegood data structure to facilitate the storage of the set of max-subpatterns hit in and the derivation of the set of frequentpatterns.A new data structure,called max-subpattern tree,is de-signed to facilitate the registration of the hit count of eachmax-subpattern and derivation of the set of frequent pat-terns,as illustrated in Figure1.Its design is now outlined.The max-subpattern tree takes the candidate max-patternas the root node,where each subpattern of with one non-letter missing is a direct child node of the root.The tree expands recursively,according to the following rules.A node,if containing more than2non-letters,may have a set of children,each of which is a subpattern of with one more non-letter missing.Notice that a node containing only2non-letters will not have any childrensince every frequent-1pattern is already in.Importantly,we do not create a node if neither the node nor its descen-dant(s)containing more than1non-letter is hit in. Each node has a“count”ﬁeld(which registers the number of hits of the current node),a parent link(which is nil for the root),and a set of child links;each child link points a child and is associated with a corresponding missing letter.A link can be nil when the corresponding child has not beenhit.Notice that a non-letter position of a max-subpattern in a max-subpattern tree may contain a set of letters,which matches the set of letters at the position in a period segment. For example,for=,the max-subpattern of the period segment is,and the segment will contribute one count to this node.The update of the max-subpattern tree is performed asfollows.Algorithm4.1[Insertion in the max-subpattern tree] Insert a max-subpattern found during the scan of into the max-subpattern tree.Method.1.Starting from the root of the tree,ﬁnd the correspondingnode by checking the missing non-letter in order.For example,for a max-pattern node in a tree with the root,,there are two letters, and,missing.The node can be found by(1)following the link(marked as“”in Figure1)to, and then(2)following the link to,as shown in Figure1.2.If the node is found,increase its count by1.Other-wise,create a new node(with count1)and its missingIn general,to insert a subpattern we need to both locate the position and update the count of the node if the node is found,or otherwise insert one or several new nodes. Example4.1Let Figure1be the current max-subpattern tree.To insert a(max)subpattern into the tree,wesearch the tree starting with the root,. Theﬁrst non-letter missing is and the second non-letter missing is.Thus weﬁrst follow the branch to node,and then follow the branch.Since the node is located,its count is incremented by1.be directly linked to a node.For example,in Figure1,the node is linked to only one parent but not the other(note:this missing link is marked by a dashed line in the Figure).In general,the set of reachable ancestors of a node in a max-subpattern tree is the set of all the nodes in, which are proper superpatterns of.It can be computed as follows:(1)derive a list of missing letters from based on,which is roughly the position-wise difference,(2) the set of linked ancestors consists of those patterns whose missing letters form a proper preﬁx of,and(3)the set of not-linked ancestors are those patterns whose missing let-ters form a proper sublist(but not preﬁx)of. Example4.2We compute the set of reachable ancestors for a node in a max-subpattern tree with root.The list of missing non-letters is.Thus,the set of linked ancestors is(1)(miss-ing nothing,which is the root);(2)(i.e.,missing,which is the node);and(3)(i.e.,missing,then missing,which is the node).The set of not-linked ancestors is:(corresponding to the missing letter pattern),(corresponding to), (corresponding to),and(corresponding to). In other words,one can follow the links whose mark is not in ordered way(to avoid visiting the same node more than once)and collect all the non-nodes reached in.We illustrate how to derive the frequent-patterns for from the max-subpattern tree.Example4.3Let Figure1be the derived max-subpattern tree,andFrom the above example,one can see that there are many frequent-patterns with small that can be generated from a max-subpattern tree.In practical applications,people may only be interested in the set of maximal frequent patterns instead of all frequent patterns,where a set of maximal fre-quent patterns is a subset of the frequent pattern set and every other pattern in the set is a subpattern of an element in the set.For example,if the set of frequent pattern(for )is,the set of maximal frequent patterns is.If a user is interested in deriving the set of maximal fre-quent patterns,the MaxMiner algorithm developed by Ba-yardo[4]is a good candidate.The success of this algorithmstems from generating new candidates by joining frequentitemsets and looking head.However,it still requires to scan up to period times in the worst case.The mixture of max-subpattern hit set method and the MaxMiner can getrid of this problem and will be more efﬁcient than pure MaxMiner.The details of the new method will be exam-ined in future research.5Performance studyIn this section we report a performance study which compares the performance of the periodicity mining algo-rithms proposed in this paper.In particular,we give a per-formance comparison between the single-period Apriori algorithm(Algorithm3.1)(or simply called Apriori),and the max-subpattern hit-set algorithm(Algorithm3.2)(or simply hit-set)applied to a single period.This comparison indicates that there is a signiﬁcant gain in efﬁciency by max-subpattern hit-set over Apriori. Since there is more gain when applied to multiple pe-riods by using max-subpattern hit-set,it is clear that max-subpattern hit-set is the winner.The performance study is conducted on a Pentium166 machine with64megabytes main memory,running in Win-dows/NT.The program is written in Microsoft/VisualC++.5.1Testing DatabasesEach test time series is a synthetic time-series databases generated using a randomized periodicity data generation algorithm.From a set of features,potentially frequent1-patterns are composed.The size of the potentially frequent 1-patterns is determined based on a Poisson distribution. These patterns are generated and put into the time-series according to an exponential distribution.LENGTHMAX-PAT-LENGTHfrequent patternsthe number of frequent1-patterns Table1.Parameters of synthetic time seriesThe basic parameters used to generate the synthetic databases are listed in Table1.The parameters of LENGTH (the length of time series)and(a period)are independently chosen.The parameters of MAX-PAT-LENGTH(the max-imal-length of frequent patterns)and(the number of frequent1-patterns)are for aﬁxed,and they are con-trolled by the choice of some appropriate conﬁdence thresh-old.We found that other parameters,such as the number of features occurring at aﬁxed position and the number of fea-tures in the time series,do not have much impact on the performance result and thus they are not considered in thetests.5.2Performance comparison of the algorithmsFigure2shows there is a signiﬁcant efﬁciency gain bymax-subpattern hit-set over Apriori.In thisﬁgure,themaximal pattern length(the maximal-length of frequentpartial periodic patterns)grows from to.The otherparameters are kept constant:and. We run two sets of tests,one with the length of the timeseries being and the other being.As we can see,the running time of max-subpattern hit-set is almost constant for both cases,while Apriori is almost linear.When MAX-PAT-LENGTH is,the gain by max-subpattern hit-set over Apriori is about double.We expect this gain will increase for larger MAX-PAT-LENGTH.Max-Pat-Length Time246810HitSet500k 100020003000400050006000(seconds)7000Apriori 500kHitSet100kApriori 100kFigure 2.Performance gain when MAX-PAT-LENGTH increases:,.It is important to note that,the gain shown in Figure2is done by keeping everything in memory,and by considering only one period.In general,this will be unlikely the case, and max-subpattern hit-set will perform even better than Apriori for the following reasons:In general,the time series of features may need to be stored on disk,due to factors such as each may con-tain thousands of features and the length of the time series can be longer.When the time series is stored on disk, there would be a large amount of extra disk-IO associ-ated with Apriori,but not with max-subpattern hit-set since it only requires two scans.Even when the time series is not stored on disk,Apriori will need to go over this huge sequence many more times than max-subpattern hit-set.Thus max-subpattern hit-set will be far better than Apriori.When there are a range of periods to consider, max-subpattern hit-set canﬁnd all frequent patterns in two scans but Apriori will require many more scans,depending on the number of periods and the -length of the maximal frequent patterns.Hence max-subpattern hit-set will be again far better than Apriori.6ConclusionsWe have studied efﬁcient methods for mining partial pe-riodicity in time series database.Partial periodicity,which associates periodic behavior with only a subset of all the time points,is less restrictive than full periodicity and thus covers a broad class of applications.By exploring several interesting properties related to par-tial periodicity,including the Apriori property,the max-subpattern hit set property,and shared mining of multiple periods,a set of partial periodicity mining algorithms are proposed,with their relative performance compared.Our study shows that the max-subpattern hit set method,which needs only two scans of the time series database,even for mining multiple periods,offers excellent performance.Our study has been conﬁned to mining partial periodic patterns in one time series for categorical data with sin-gle level of abstraction.However the method developed here can be extended for mining multiple-level,multiple-dimensional partial periodicity and for mining partial peri-odicity with perturbation and evolution.For mining numerical data,such as stock or power con-sumptionﬂuctuation,one can examine the distribution of numerical values in the time-series data and discretize them into single-or multiple-level categorical data.For min-ing multiple-level partial periodicity,one can explore level-shared mining byﬁrst mining the periodicity at a high level, and then progressively drilling-down with the discovered periodic patterns to see whether they are still periodic at a lower level.Perturbation may happen from period to period which may make it difﬁcult to discover partial periodicity in many applications.For mining partial periodicity with perturba-tion,one method is to slightly enlarge the time slot to be examined.Partial periodic patterns with minor perturbation are likely to be caught in the generalized time slot.Another method is to include the features happening in the time slots surrounding the one being analyzed.We can further employ regression technique to reduce the noise of perturbation.There are still many issues regarding partial periodicity mining which deserve further study,such as further explo-ration of shared mining for mining periodicity with multiple periods,mining periodic association rules based on partial periodicity,and query-and constraint-based mining of par-tial periodicity[11].We are studying these problems and implementing our algorithms for mining partial periodicity in a data mining system and will report our progress in the future.References[1]R.Agrawal,G.Psaila,E.L.Wimmers,and M.Zait.Query-ing shapes of histories.In Proc.21st Int.Conf.Very Large Data Bases,pages502–514,Zurich,Switzerland,Sept.1995.[2]R.Agrawal and R.Srikant.Fast algorithms for mining as-sociation rules.In Proc.1994Int.Conf.Very Large Data Bases,pages487–499,Santiago,Chile,September1994. [3]R.Agrawal and R.Srikant.Mining sequential patterns.InProc.1995Int.Conf.Data Engineering,pages3–14,Taipei, Taiwan,March1995.[4]R.J.Bayardo.Efﬁciently mining long patterns fromdatabases.In Proc.1998ACM-SIGMOD Int.Conf.Manage-ment of Data,pages85–93,Seattle,Washington,June1998.[5] C.Bettini,X.Sean Wang,and S.Jajodia.Mining temporalrelationships with multiple granularities in time sequences.Data Engineering Bulletin,21:32–38,1998.[6]J.Han and Y.Fu.Discovery of multiple-level associa-tion rules from large databases.In Proc.1995Int.Conf.Very Large Data Bases,pages420–431,Zurich,Switzerland, Sept.1995.[7]J.Han,W.Gong,and Y.Yin.Mining segment-wise periodicpatterns in time-related databases.In Proc.1998Int’l Conf.on Knowledge Discovery and Data Mining(KDD’98),New York City,NY,August1998.[8]H.J.Loether and D.G.McTavish.Descriptive and Inferen-tial Statistics:An Introduction.Allyn and Bacon,1993. [9]H.Lu,J.Han,and L.Feng.Stock movement and n-dimensional inter-transaction association rules.In Proc.1998SIGMOD Workshop on Research Issues on Data Min-ing and Knowledge Discovery(DMKD’98),pages12:1–12:7,Seattle,Washington,June1998.[10]H.Mannila,H Toivonen,and A.I.Verkamo.Discover-ing frequent episodes in sequences.In Proc.1st Int.Conf.Knowledge Discovery and Data Mining,pages210–215, Montreal,Canada,Aug.1995.[11]R.Ng,kshmanan,J.Han,and A.Pang.Ex-ploratory mining and pruning optimizations of constrained associations rules.In Proc.1998ACM-SIGMOD Int.Conf.Management of Data,pages13–24,Seattle,Washington, June1998.[12] B.¨Ozden,S.Ramaswamy,and A.Silberschatz.Cyclic as-sociation rules.In Proc.1998Int.Conf.Data Engineering (ICDE’98),pages412–421,Orlando,FL,Feb.1998.。