EFFICIENT WEB USAGE MINING BASED ON FORMAL

合集下载

一种基于用户浏览行为的数据采集方法

一种基于用户浏览行为的数据采集方法吴国芳;张万礼【期刊名称】《洛阳师范学院学报》【年(卷),期】2014(000)008【摘要】提出了一种基于用户浏览行为的客户端数据采集方法，普通浏览者访问图书馆网络监测系统服务器端网页时，浏览器自动运行嵌入在网页中的JavaScript代码。

通过JavaScript代码的运行，普通浏览者与采集中心进行连接和数据传送，实时地采集用户行为数据，实时地统计汇总网站的最新状态数据。

该数据采集方法对被监测网站服务器运行效率没有影响，其优点在绍兴图书馆网络监测系统得到了验证。

%User behavior data collection , the basis of discovery and mining of user behavior patterns , is also the most important but most difficult part of Web usage record mining .Server-based data acquisition is limited be-cause server performance is affected .This paper proposes a client-side data collection method on the basis of user browsing behaviors .When users view the web pages on the server of library monitoring system , the browser auto-matically run JavaScript code embedded in the Web page , which sets up connection and data transfer between the user and the data collection center , which gathers user behavior data and update the statistics of website traffic in real time.This data collection method has no effect on the performance of monitored server and its advantages have been verified in Shaoxing library network monitoring system .【总页数】4页(P76-79)【作者】吴国芳;张万礼【作者单位】绍兴职业技术学院信息工程学院，浙江绍兴312000;宿州学院信息工程学院，安徽宿州234000【正文语种】中文【中图分类】TP393【相关文献】1.一种高效的用户浏览行为采集方法 [J], 张玉芳;张艳华;熊忠阳2.一种基于客户端的用户浏览行为的采集方法 [J], 吴琪3.用户浏览行为数据采集方法综述 [J], 刘洪涛;张平;黄智兴;程静;刘革平4.一种基于客户端的用户浏览行为的采集方法 [J], 吴琪;5.一种基于客户端的用户浏览行为的采集方法 [J], 吴琪因版权原因，仅展示原文概要，查看原文内容请购买。

一种分布式智能推荐系统的设计_陶剑文

// 设置聚簇标识符为 h
clust = h; F =Φ;
While h <> NULL do
for all (i ∈ C s.t. h <> i & Whi > minfreq) do
remove(C, i); push (F, i); //从 C 中移除节点 i，将其插入 F
ret_val[i] = clust;
资源数据库
数据挖掘 Agent
Web Server
数据库
在线处理部分
图 1 MASWSIRS 系统结构
知识库
文本文件
离线处理部分
离线部分主要基于 WUM(Web usage mining)技术对 Web Server 的历史数据(log 文件、用户注册信息、其它文本数据等)进行分析处理，以发现用户的使用模式，构建一个基于用户的在线知识库供在线部分查询参考[4]。
系统标识用户请求 URL u 和该用户所属的会话，通过会话标
识符，系统识别用户所来自的 URL v 的标识符，根据当前会
话特征，系统自动更新知识库并产生推荐信息。在会话标识
符及用户会话整个过程中，所访问的页面的标识符被存储到
一个简单的映射数组中，会话标识符用于访问该数组的键值。
URL 标识符到 URL 的映射关系被存储到一个字符串数据结
少图遍历的时间而定义的一个系统阈值，也可以由外部指定。
系统认为只有大于 minclustersize 值的聚簇才是有意义的遍历
聚簇，顶点数低于 minclustersize 值的聚簇将被系统忽略。
3.1 MASWSIRS 实现算法描述某个新用户请求到达 Web 服务器，后台的知识库将被更

MODES v1.1 用户手册说明书

----------------------- MODES v1.1 User Manual -----------------------ReferencesMining Coherent Dense Subgraphs Across Massive Biological Networks for Functional DiscoveryHaiyan Hu1, Xifeng Yan2, Yu Huang1, Jiawei Han2, and Xianghong Jasmine Zhou11 Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA2 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801IntroductionMODES stands for Mining Overlapping DENSE Subgraphs. The input graph for MODES is an unweighted graph Ĝ=(V, Ê) where an edge e(u,v) connects vertices u and v (u, v∈V). MODES is developed based on HCS (Mining Highly Connected Subgraphs) (Hartuv & Shamir, 2000), with two new features: (1) MODES is efficient in identifying dense subgraphs; and more importantly, (2) MODES can discover overlapping subgraphs. The algorithm behind it is described in the related paper (see REFERENCES).PlatformsMODES was developed and tested on Linux (Debian and Redhat) using gcc2.95, and should be able to run on most UNIX systems.Usagemodes [command-line options] <input-files>Command-Line Options-m k(run_mode)There are 2 running modes available for MODES. Valid k values are:(1) k=0, is to find all clusters(2) k=1 is to find all clusters containing gene x-i str(inputfile)The path and name of the input overlapped frequency graph file, which is in thematrix format currently.-n k(gene_num)This parameter specifies the gene number from the inputfile, i.e. the dimension of the input matrix file.-o str(outputfile_name_prefix)This is the prefix of the output clusters file. Thus the output file containing thefirst order clusters would be outputfile_name_prefixFO, while the final output file containing the second order clusters would be outputfile_name_prefixSO.-g k(min_graph_size)This parameter specifies the minimum node number requirement of the outputsubgraph. Default value is 5.-e k(bottom_edge_freq)This argument specifies the minimum edge weight required to be kept as an edge in the input graph. Default value is 6.-d f(density_cutoff_order1)This argument specifies the minimum density requirement for the dense subgraph generated. Default value is 0.5.-s k(the maximum node number to apply min-cut)This paprameter specifies the maximum number of nodes in a graph whenperforming min-cut algorithm instead of normal-cut algorithm. Default value is80.-c f(connect perc restoring the condensed cluster)This argument controls the connectivity percentage requirement for keeping anode when restore a subgraph from a condensed cluster node. Default value is 0.6. -x k(genex)This argument specifies the gene (index), the clusters containing which is to bediscovered when running modes with run_mode as 0.Note: The maximum gene num MODESv1.1 can handle is 65535.Input-FilesThe input graph could be in three formats: matrix format, edge format, and another is edge list format.Note: In the examples below, the symbol “|” represent a Tab separator, and “|_|” represents a space separator.(a) Matrix formatThe input graph prototype is an integer symmetric matrix with dimension asgenenumber × gene number. The intersection of ith gene row and jth genecolumn is the number of datasets in which this gene pair significant correlated in terms of Jackknife correlation. Or other interested relation frequency defined byuser. If your input summary graph prototype is in the matrix format, you need to specify –n gene number in the command line. The example of this file is in~/MODES/data/input/summaryG500.txt.(b) Edge formatThe input graph is a set of weighted edges. The format is:Node I1 | Node J1 | WeightNode I2 | Node J2 | WeightNode I3 | Node J3 | Weight….Since MODESv1.0 is applied on unweighted graph, the weight value is not really used in MODES. Or other interested relation frequency defined by user. If your input graph is in the edge format, you need to specify –y edge number in thecommand line.(c) Edge List formatThe input graph is a set of edges. The format is:Node I1 |_| Node J1Node I2 |_| Node J2Node I3 |_| Node J3….If your input graph is in the edge list format, you need to specify –y edge number in the command line.Output-FilesThe clustering results are in the output file user specified.The format is:Cluster index | node number n in this cluster | edge number m in this cluster | gene 1’s index | gene 2’s index | ...| gene n’s index.EXAMPLESmodes -m run_mode -i myinputfile -n genenum -o outputfile -g min_graph_size -e bottom_edge_freq -d density_cutoff_order1 -s the maximum node number of a first order subgraph -c connect perc restoring the condensedcluster -x genexThe initial try could be the following command:./modes -m 0 i ../data/input/g1.matrix -n 10 -o ../data/output/g1.matrix.out4 -g 4 -e 1 -d 0.9Example for running mode at 0:modes –m 0 –i myinputfile –n genenum –o myoutputfile –g 5 –e 6 –d 4 –s 80 –c 0.6This will set the minimum output graph size as 5, the edge support threshold as >=6, the dense subgraph cut off as 0.4, and the maximum number of nodes in a graph when performing min-cut algorithm instead of normal-cut algorithm is 80. This will generate the dense subgraph file as myoutputfile.Example for running mode at 1:modes –m 1 –i myinputfile -n genenum –o myoutputfileprefix –g 5 –e 6 –d 4 –s 80 –c 0.6 –x 21This will set the minimum output graph size as 5, the edge support threshold as >=6, the first order dense subgraph cut off as 0.4, and the maximum number of nodes in a graph when performing min-cut algorithm instead of normal-cut algorithm is 80. This will generate the subgraph file containing gene 21 as myoutputfile.NoteThis is MODES version 1.1. Testing hasn’t been exhaustive. Feedback and application description are always welcome. Contact ************** for bugs and questions about MODES.ContactsXianghong Jasmine ZhouAssistant ProfessorProgram in Molecular and Computational BiologyUniversity of Southern CaliforniaOffice: DRB291 Phone: 213-740-7055 Fax: 213-740-2437Email: **************。

Web Usage Mining for E-Business Applications

user profiles and/or user ratings meta-data, page attributes, page content, site structure

14
Data collection
15
15
Web server
Client (Browser)
Proxy
15
What‟s in a typical Web server log …
3
4
What Web pages answer my information need?
4
4
What Web pages are “good“ (better than others)?
5
5
5
6
What should I buy?
6
6
7
7
CRM questions example: Why go to a shop ...
9
10
Web Usage Mining: Basics and data sources
10
Definition of Web usage mining:

discovery of meaningful patterns from data generated by client-server transactions on one or more Web servers automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies e-commerce and product-oriented user events (e.g., shopping cart changes, ad or product click-throughs, purchases) user profiles and/or user ratings meta-data, page attributes, page content, site structure

10-web mining

The process of applying data mining techniques to the discovery of knowledge from Web data. (应用数据挖掘技术从应用数据挖掘技术从web数据中发现知识的过程数据中发现知识的过程). 应用数据挖掘技术从数据中发现知识的过程 Web Data (web数据数据) 数据 Webpages: text, image, audio, video, structured & semistructured data,.. Web structure: Intra-page, Inter-page and Internet structures Usage data: Server Level, Client Level and proxy Level Collection
7
Web Usage Mining (2)
Outlier analysis
© Wu Yangyang 8
Search Engine Log Mining
用户查询日志 Mon Sep 1 00:00:00 2003 202.206.xxx.xxx 成人高考 cache 6 用户点击日志 Mon Sep 1 00:00:00 2003 202.206.xxx.xxx 成人高考 16 //点击时间点击时间 //用户用户IP 用户 //查询词查询词 //点击页面的排序点击页面的排序 //查询时间查询时间 //用户用户IP 用户 //查询词查询词 //是否是否cache命中是否命中 //查看页面的序号查看页面的序号
General Access Pattern Tracking (Impersonalized)

管理学专业英语第四版下Unit 6 Big Data_The Managment Revolution

Big Data: The Management Revolution
管理学专业英语教程（第四版·下）
星蓝海学习网
Outlines
1
Introduction
2 Dimensions of Big Data
3 Five Management Challenges

SAS added two additional dimensions to big data: variability and complexity. Variability refers to the variation in data flow rates. Complexity refers to the number of data sources.
❖ Better Pricing Harnessing big data collected from customer interactions allows firms to price appropriately and reap the rewards.
❖ Cost Reduction Big data analytics leads to better demand forecasts, more efficient routing with visualization and real-time tracking during shipments, and highly optimized distribution network management.

Popular Big Data Techniques (1)
❖ As data become cheaper, the complements to data become more valuable.

基于加权关联规则挖掘的 Web 日志最新频繁信息页面(IJMECS-V11-N10-5)

I.J. Modern Education and Computer Science, 2019, 10, 41-46Published Online October 2019 in MECS (/)DOI: 10.5815/ijmecs.2019.10.05Recent and Frequent Informative Pages from Web Logs by Weighted Association Rule MiningDr. SP. MalarvizhiAssociate Professor, Sri Vasavi Engineering College, Tadepalligudem, Andhra Pradesh, India.Email: spmalarvizhi1973@srivasaviengg.ac.inReceived: 17 July 2019; Accepted: 26 August 2019; Published: 08 October 2019Abstract—Web Usage Mining provides efficient ways of mining the web logs for knowing the user’s behavioral patterns. Existing literature have discussed about mining frequent pages of web logs by different means. Instead of mining all the frequently visited pages, if the criterion for mining frequent pages is based on a weighted setting then the compilation time and storage space would reduce. Hence in the proposed work, mining is performed by assigning weights to web pages based on two criteria. One is the time dwelled by a visitor on a particular page and the other is based on recent access of those pages. The proposed Weighted Window Tree (WWT) method performs Weighted Association Rule mining (WARM) for discovering the recently accessed frequent pages from web logs where the user has dwelled for more time and hence proves that these pages are more informative. WARM’s significance is in page weight assignment for targeting essential pages which has an advantage of mining lesser quality rules.Index Terms—Web logs, Web Mining, Page Weight Estimation, Weighted Minimum Support, WARM, WWT.I.I NTRODUCTIONThe Literature shows that Data Mining is a field which gains a rapid growth in recent days. Association Rule Mining (ARM) of this field plays a vital role in research [1]. Frequent itemset mining uses ARM algorithms to get the association amongst items based on user defined support and confidence [2]. Existing literature say that from the foremost frequent itemset mining algorithms like Apriori and FP-growth, many algorithms have so far been evolved. ARM gains application in business management and marketing.In case of WARM, every individual item is assigned a weight based on its importance and hence priority is given for target itemsets for selection rather than occurrence rate [3,4,5,6].Motive behind WARM is to mine lesser number of quality based rules which are more informative.Web log mining also known as web usage mining , a category of web mining is the most useful way of mining the textual log of web servers to enhance the website services. These servers carry the user’s interactions with the web [7, 8].This paper provides a method of WARM for mining the more informative pages from web logs by a technique called WWT where weight is assigned based on time dwelled by the visitor on a page and the recent access. Log is divided into n windows and weights are provided for each window. Last window is the recently accessed one and carries more weight. Along with the window weight the time dwelled on a page also adds to the priority of a target page to be mined.Rest of the paper is arranged as follows. Section 2 appraises related works of Frequent Patterns and WARM for Web log. Section 3 explains about the proposed system of how to preprocess the web logs, weight assignment techniques, WWT structure and WARM. Section 4 bears the experimental evaluation. Section 5 offers conclusion.II.R ELATED W ORKTo obtain the frequent pages from web logs and to provide worthy information about the users FP-growth algorithm is used [9].Web site structure and web server’s performance can be improved by mining the frequent pages of the web logs to cater the needs of the web users [10].A measure called w-support uses link based models to consider the transaction’s quality than preassigned weights [6].ARM does not take the weights of the items into consideration and assumes all items are equally important, whereas WARM reveals the importance of the items to the users by assigning a weight value to each item [11]. An efficient method is used for mining weighted association rules from large datasets in a single scan by means of a data structure called weighted tree [3].Wei Wang et al [4] proposed an effective method for WARM. In this method a numerical value is assigned for every item and mining is performed on a particular weight domain. F.Tao et al [5] discusses about weighted setting for mining in a transactional dataset and how to discover important binary relationships using WARM. Frequently visited pages that are recently used shows user’s habit and current interest. These pages may be mined by WARM techniques and can be made available in the cache of the server to make the web access speedy [12]. Here the web log is divided into several windowsand weight is assigned for each window. The latest window which carries the recently accessed pages possesses more weight.To deliver the frequent pages that acquire the interestingness of the users an enhanced algorithm is proposed by Yiling Yang et al. Content-link ratio and the inter-linked degree of the page group are the two arguments made use of in support calculation [13].Weight of each item is derived by using HITS model and WARM is performed [14]. This model produces quality oriented rules of lesser number to improve the accuracy of classification.Vinod Kumar et al [15] have contributed a strategy which aims in discovering the frequent pagesets from weblogs. It focuses on fuzzy utility based webpage sets mining from weblog databases. It involves downward closure property.K.Dharmarajan et al [16] have provided valuable information about users’ interest to obtain frequent access patterns using FP growth algorithm from the weblog data. Proposed system performs enhanced WARM for obtaining the frequent informative pages from textual web logs of servers. WWT method used in the system involves tree data structure. Database has to be divided into several windows and weights have to be assigned to the windows with the latest window having more weightage and then WWT is constructed.III. P ROPOSED S YSTEMExisting literature discusses about mining the frequent pagesets and association rules based on user defined minimum support and confidence. Proposed system aims at achieving quality rules from web logs using WARM. Weights for web pages visited by users are assigned based on factors like frequent access, time dwelled by visitors on web pages and how far the pages are recently used. Weighted Window Tree proposed here uses WARM to discover such recently used frequent pages from web log. Fig.1 shows the work flow process involved in the system.Fig.1. Work FlowA. Web Log PreprocessingWeb log is preprocessed for removing the duplicates, images, invalid and irrelevant data by cleaning. Data preprocessing is a more difficult task, but it serves reliability and data integrity necessary for frequent pattern discovery. Preprocessing takes about 80% of thetotal effort used up for mining [17]. Preprocessed web log then consists of attributes like IP address, Session ID, URLs of the pages visited by the user in that session ID, Requested date and time of each page and time dwelled on each page.The arrangement of data in the web log will be like a relational database model. The details of the pages requested by the users are stored consecutively in the web log. For each page requested all the attributes are updated. The log stored during a particular period from which the frequent informative pages are to be mined forms the back end relational database. Table 1 shows a part of the web log of an educational institution (EDI). Every requested page is reorganized as a separate entry with a distinct session ID. In the table session ID 3256789 under IP Address 71.82.20.69 has recorded three pages.Table 1. Partial web log of edi data setB. Windows and WeightsFrequent pages may have regularly occurred in the previous stage of a particular duration of a web log than in the later stage. In order to categorize most recently and least recently accessed frequent web pages WWT arrangement is introduced.The log for a particular duration is first divided into N number of windows, probably of equal size. While splitting windows, care should be taken to see that the pages of same session ID are not getting divided into two different windows and belong to a same window. To resolve this, division of windows is made based on total number of sessions available in the log divided by number of windows needed. This gives equal number of sessions S per window except for the latest window sometimes, which carries remainder number of sessions got by the division.Windows are given index numbers from bottom to top as 1 to N. Last window at the bottom of the log consisting of recently accessed web pages has index as 1 and is given a highest weight (hWeight) lying between 0 to 1, 0<hWeight < 1. Window 2, just before window 1 from bottom will have a weight less than that of window 1. Equation (1) is made use of to provide the weight for windows. It is understandable that window Index i < window index j and from equation (1) it is obvious thatweight i < weight j , i.e. when window index raises, window weight reduces [12]. Sessions of each window are renumbered separately from 1 to S as shown in Table.3.() ii weight hWeight = (1)Where i is the window index from 1 to N. After assigning weights for individual windows, each individual page URL in the windows gets assigned with weight equal to the weight of the window. Hence a particular URL, lying in the last window would have gained more weight than the same URL lying on other windows, which clearly shows the importance of the pages recently accessed.Same URL appearing in different windows may have different page dwelling time. The total URL weight of every individual URL is then found based on the occurrence rate of the page by using equation (2), which adds up all the products of individual window weight and dwelling time T of one particular URL lying in several windows. If URL Є i th window and j th session of i th window then,11(*)N SURL i ij i j W weight T ===∑∑ (2)In equation (2) N is the total number of windows, S is total number of sessions in i th window, weight i is the weight of i th window found in equation (1) which assigns the same weight to all the URLs in that window irrespective of sessions and T ij is the total dwelling time of a page in j th session of i th window. Number of products of (weight i *Tij ) incurred is the count of number of occurrences of the particular page URL. To calculate the average page weight or pageset (set of pages) weight, equation (3) is used.11()mps URL kk psW Wn ==∑ (3)In equation (3) W ps is the weight of a pageset ps and it is the ratio of sum of total weights of individual pages belonging to that page set to the number of visitors of that pageset n ps (number of session IDs in which the pageset is occurring) and m is the number of pages in the pageset. W URL is found using equation (2) for those records of ps having non zero entries for all the pages of the pageset. For 1-pageset n ps =n p and W ps =W p i.e. the individual page weights and hence m=1.C. Weighted Window Tree (WWT) and WARM WWT method consists of the following steps. (i) Constructing WWT. (ii) Compressing the tree.(iii) Mining recently and frequently visited pagesets. (iv) WARM.Table 2 shows a partial sample data log with dwelling time of the pages for few session IDs. Log is divided into windows and window weights are allotted. A relational database as shown in table 3 is constructed for the data in table 2 which aids in constructing the tree. Pages are numbered as p1, p2, p3,… from the last record of the last window (index 1) of the web log dataset towards the top window of a web log considered for a duration. When an URL gets repeated it is given the same page number.Table 2. Partial sample data logTable 3. Sample relational databaseThe entries under the pages of the sample relation are the product of dwelling time of the particular page by the particular visitor and the window weight in which the page is lying. Lowest window is the latest window and is given index 1. Weight of first window is assumed 0.5 (between 0 to 1). Using equation (1) weight of 2nd window is calculated as 0.25.1) Constructing Weighted Window TreeBy one scan of the data base with divided windows and weight, the Weighted Window Tree as shown in fig.2 can be constructed. Ellipses of the tree are called as page nodes, which consists of Page URL and rectangles are called as SID (Session ID) nodes which consist of SID and weight. There are 2 pointers for every page node. One pointer is directed towards the next page node and the other towards the successive SID nodes consisting of that Page URL [3]..Fig.2. WWT for the sample databaseOnce the tree is constructed, it has to be compressed to make it prepared for mining using user defined weighted minimum support (w ms)to obtain the recent and frequent informative pages.2)Tree compressionFrequent pagesets are mined based on user defined weighted minimum support W ms. Let W ms=1.5 for the sample relation. If individual page weight W p≥W ms, then the page is said to be frequent. By equation (2) and (3) the average page weights of the individual pages are calculated as shown below.W p1 = (2+1+3)/3 = 2W p2 = (0.75+0.5+1+3.5)/4 = 1.44W p3 = (1+1+2.5)/3 = 1.5W p4 = (7.5)/1=7.5.Pages p1, p3 and p4 are alone frequent-1 pages, as they have weights more than W ms. Hence the page node p2 and its attached SID nodes are removed from the tree since it is infrequent and the pointer is made to point directly to p3 from p1. This tree after reducing the infrequent page nodes and its relevant branches of weight nodes is called as compressed tree.3)Mining recently and frequently visited pagesets WWT method covers both user’s habit and interest. Habit is one which is regularly used (frequently used pages) and interest is one which changes according to time (recently used pages). High w ms leads to mining of less number of more recently used frequent pages, which shows probably users timely interest and low w ms leads to mining of more number of less recently used frequent pages, which shows users regular habits. More recently used pages can be kept in server’s cache to speed up web access.In literature WARM does not satisfy the downward closure property. It is not necessary that all the subsets of a frequent pageset should be frequent, because the logic of frequent pageset handles weighted support. But the proposed WWT method overrides this. For an m-pageset to be frequent, all the individual pages in the non zero records of m pageset have to be individually frequent. All the frequent-1 pages are put in a set named {F1} and the non empty subsets of it are obtained except the 1 element subsets and power set. For our example {F1}={p1,p3,p4} and such non empty subsets of it are {p1,p3}, {p3,p4} and {p1,p4}. The pageset weight W ps of all the non empty subsets obtained from {F1} is calculated using equation (3) and they are checked for whether frequent or not. For example for ps={p1,p3}, W p1p3=(1/n p1p3)(W p1+W p3). n p1p3 is the number of visitors who have visited both p1 and p3. It is found by the intersection of the session IDs of the visitors of p1 and p3. Hence {1,3,4}∩{2,3,4} gives {3,4} i.e. only the non zero entry records of all the pages of the pageset {p1,p3}. Two visitors (3rd and 4th visitor) have visited p1 and p3, out of a total of five visitors. Therefore W p1p3= (1+1+3+2.5)/2 = 3.75>1.5. Individual weights of p1 and p3 for 3rd and 4th visitor’s records are W p1=(1+3)/2=2>1.5 and W p3=(1+2.5)/2=1.75>1.5. Hence pageset {p1,p3} is a frequent pageset according to downward closure property. Similarly all other subsets have to be checked.4)WARMWeighted association rules are those strong rules obtained from the frequent pagesets by prescribing a user defined minimum weighted confidence C min. Let ‘x’ be one of the frequent pagesets and‘s’ a subset of x, then if W x/W s ≥ c min, then a rule of the form s ═> x-s is obtained. Let us consider for example the frequent pageset x={p1,p3} and s={p1}, the subset of x. If C min= 1.5, then since W p1p3/Wp1 = 3.75/2 = 1.875 >1.5, p1 ═> p3 is a strong association rule.IV.E XPERIMENTAL E VALUATIONFor evaluating the performance of the proposed method, experiments are performed on two different datasets. One is an EDI dataset (Educational Institution) and the other is msnbc dataset available in UCI machine learning repository from Internet Information Server (IIS) logs for . Comparison is made between weighted tree [3] and our proposed WWT methods, in terms of speed and space for various W ms. Both the data are extracted for one full day duration. Speed is calculated in terms of CPU execution time by including stubs in the program.Experiments were performed on an Intel Core I5, 3.2 GHz processor machine with 2GB RAM and 500 GB hard disk with Windows XP platform. WWT algorithm is implemented in Java.The proposed WWT method shows an enhanced performance with less execution time and less space than the Weighted Tree method and the experimental results are presented below from figure 3 to figure 6 for both datasets. It means it produces less number of frequent pagesets and hence takes less time (speed) and space. Comparison of the results for execution time (Speed) of both Weighted Tree (WT) and Weighted Window Tree (WWT) methods are shown in fig.3 and fig.4 and that for comparison of number of pagesets (Space) generated by both the methods are given in fig.5 and fig.6. It is seen that, as w ms increases the execution time and spacereduces more for WWT than for WT. WWT is also well scalable when input data set size increases.Fig.3. Comparison of Execution time for WT and WWT for msnbcdatasetFig.4. Comparison of Execution time for WT and WWT for EDI datasetFig.5. Comparison of Number of pagesets generated by WT and WWTfor msnbc datasetFig.6. Comparison of Number of pagesets generated by WT and WWTfor EDI datasetFig.3 implies that WWT has its ececution time reduced nearly by 50% than WT for msnbc dataset. Fig.4 implies that the execution time of WWT for EDI dataset has reduced one fourth than WT. This clearly reflects that the removal of insignificant pagesets by WWT reduces the execution time. Fig.5 and fig.6. reveal that number of frequent pagesets mined using WWT considerably reduces because of window weightage technique.V. C ONCLUSIONMethod for mining recent and frequent informative pages from web logs based on window weights and dwelling time of the pages is discussed. This utilizes Weighted Window Tree arrangement for Weighted Association Rule Mining and it is seen more efficient than Weighted Tree from experimental evaluation by means of speed and space. The system covers less recently used frequent pages from earlier stages and more recently used frequent pages from later stages.This method finds its main application in mining the web logs of educational institutions, to observe the surfing behaviour of the students. By mining the frequent pages using WWT, most significant pages and websites can be identified and useful informative pages can be accounted for taking future decisions. This technique can be used for mining any organizational weblog to know the employees browsing behaviour.The limitation of WWT lies in availing at a better weight allotting scheme which can reduce the execution time and number of pages mined even more better than WWT.R EFERENCES[1] Qiankun Zhao, Sourav S. Bhowmic, “Association R uleMining: A Survey” Technical Report, CAIS, Nanyang Technological University, Singapore, No. 2003116 , 2003. [2] Han. J and M. Kamber(2004), “Data Mining Concepts andTechniques”: San Francisco, CA:. Morgan Kaufmann Publishers.[3] Preetham kumar and Ananthanara yana V S, “Discovery ofWeighted Association Rules Mining”, 978-1-4244-5586-7/10/$26.00 C 2010 IEEE, volume 5, pp.718 to 722.[4] W. Wang, J. Yang and P. Yu, “Efficient mining ofweighted association rules (WAR)”, Proc. of the ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 270-274, 2000.[5] F.Tao, F.Murtagh, M .Farid, “Weighted Association RuleMining using Weighted Support and Significance framework”, SIGKDD 2003.[6] Ke Sun and Fengshan Bai, “Mining weighted AssociationRules without Preassigned Weights”, IEEE Transactions on Knowledge and Data Engineering, Vol. 20, No. 4, pp.489-495, April 2008.[7] Hengshan Wang, Cheng Yang and Hua Zeng, “Design andImplementation of a Web Usage Mining Model Based on Fpgrowth and Prefixspan”, Communications of the IIMA, 2006 Volume 6 Issue2, pp.71 to 86.[8] V.Chitraa and Dr. Antony Selvadoss Davamani, “A Surveyon Preprocessing Methods for web Usage Data”, (IJCSIS)International Journal of Computer Science and Information Security,Vol. 7, No. 3, pp.78-83, 2010.[9]Rahul Mishra and Abha Choubey, “ Discovery of FrequentPatterns from web log Data by using FP Growth algorithm for web Usage Mining”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 9, pp.311-318, Sep 2012.[10]Renáta Iváncsy and István Vajk, “Frequent Pattern Miningin web Log Data”, Acta Polytechnica Hungarica Vol. 3, No.1, 77-90, 2006.[11]P.Velvadivu and Dr.K.Duraisamy, “An OptimizedWeighted Association Rule Mining on Dynamic Content”, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 5, pp.16-19, March 2010.[12]Abhinav Srivastava, Abhijit Bhosale, and Shamik Sural,“Speeding Up web Access Using Weighted Association Rules”, S.K. Pal et al. (Eds.): PReMI 2005, LNCS 3776, pp. 660–665, 2005. _c Springer-Verlag Berlin Heidelberg 2005.[13]Yiling Yang, Xudong Guan, Jinyuan You,”EnhancedAlgorithm for Mining the Frequently Visited Page Groups”, Shanghai Jiaotong University, China.[14]S.P.Syed Ibrahim and K.R.Chandran, “compact weightedclass association rule mi ning using information gain”, International Journal of Data Mining & knowledge Management Process (IJDKP) Vol.1, No.6, November 2011.[15]Vinod Kumar and Ramjeevan Singh Thakur, “High FuzzyUtility Strategy based Webpage Sets Mining from Weblog Database”, International Journal of Intelligent Engineering and Systems, Vol.11, No.1, pp.191-200, 2018. [16]K.Dharmarajan and Dr.M.A.Dorairangaswamy, “WebUsage Mining: Improve the User Navigation Pattern using FP-Growth algorithm”, Elysium Journal of Engineering Research and Management, Vol.3, Issue 4, August 2016.[17]Liu Kewen, “Analysis of preprocessing methods for webUsage Data”, 2012 International conference on measurement, Information and Control (MIC), School of Computer and Information Engineering, Harbin University of Commerce, China.Author’s ProfileDr.SP.Malarvizhi received the BEdegree in Electrical and ElectronicsEngineering from Annamalai University,India in 1994, ME degree in ComputerScience and Engineering from AnnaUniversity of Technology, Coimbatore,India in 2009 and received the Ph.D.degree in Data Mining from Anna University Chennai in 2016. She is currently working as an Associate Professor in CSE department at Sri Vasavi Engineering College, Andhra Pradesh since 2017. She has participated and published papers in many National and Internal Conferences and also published 7 papers in National and International journals. Her research interests are Data Mining, Big Data and Machine Learning.How to cite this paper: SP. Malarvizhi, " Recent and Frequent Informative Pages from Web Logs by Weighted Association Rule Mining", International Journal of Modern Education and Computer Science(IJMECS), Vol.11, No.10, pp. 41-46, 2019.DOI: 10.5815/ijmecs.2019.10.05。

Web Usage Mining在远程教育中的应用研究

发规则来帮助识别用户：不同的Ｉ地址代表着不同的用户；Ｐ当户；Ｐ在Ｉ地址相同，用户使用的操作系统或浏览器也相同时，而
自上个世纪９年代开始。着Ｉｔｒｅ的日益普及，络已Ｏ随ｎｅｎｔ网
务器记录文件结合起来，其过程可分为以下４阶段：数据收个集、据预处理、式发现和模式分析。数模
２１数据收集．
正是在这样的历史背景下，基于Ｗｅ挖掘技术的现代远程ｂ教育的研究与应用在辅助日常教学过程中起到了不可代替的作用，但是却存在一个共同的问题：少个性化的服务。学校只缺
集到的数据的数量多少和质量好坏直接关系到后继过程的成
不完整的、余的、冗错误的数据，以如果不对原始Ｗｅ所ｂ日志文件进行修改。会影响模式发现和模式分析过程的质量。对数将
据进行提取、分解、合并和修改，主要包含以下几个阶段：据数清洗、户识别、用会话识别、径补充和事件识别。路数据清洗。目的是根据要求，删除无关紧要的数据，并某合
录删除。
Ｗｅ文档服务器中发现与抽象信息的过程。通过对Ｗｅ文档中ｂｂ大量信息的分析，以记录和追踪用户的浏览行为，可例如：哪个
网页最受欢迎，户喜欢使用哪类浏览器（Ｉ浏览器或邀游用如Ｅ浏览器）操作系统（ｎｏ或Ｌｎｘ）等等。以了解学生学和Ｗｉｄｗｉｕ等，

面向流数据的DPFP-Stream算法的设计与实现

面向流数据的DPFP-Stream算法的设计与实现孙杜靖;李玲娟;马可【期刊名称】《计算机技术与发展》【年(卷),期】2017(027)007【摘要】Finding frequent patterns in a continuous stream of transactions is critical for many applications such as retail market data analysis,network monitoring,web usage mining and stock market prediction.Even though numerous frequent pattern mining algorithms have been developed over the past decade,new solutions for handling stream data are still required due to the continuous,unbounded and ordered sequence of data elements generated at a rapid rate in a data stream.As a result,the knowledge embedded in a data stream is more likely to be changed as time goes by.Therefore,extracting frequent patterns from data at multiple time granularities and monitoring the gradual changes of frequent patterns can enhance the analysis of online data streams.Based on efficient FP-tree structure,according to the ideas of tilted-time windows and MapReduce,the DPFP-stream is proposed and implemented in Storm.The data resource of it uses Kafka and stores middle result into Redis.Extensive experiment shows that the algorithm proposed is highly efficient in terms of time complexity when finding recent frequent patterns from a high-speed data stream.With the application of the algorithm in real-time computing,it can not only process high speed stream,but also monitor thechange of frequent patterns with tilted-time windows.%从海量数据中发现频繁模式一直是数据挖掘研究的热点,在零售市场数据分析、网络监控、网络使用挖掘和股票市场的预测等领域中也有着广泛的应用.尽管在过去的十年里,很多学者提出了许多基于静态数据集的频繁模式挖掘算法,而由于流数据持续、无限、有序而高速产生的特性,在流数据中隐藏的数据知识很可能随着时间的推移而产生变化,因而基于流数据的频繁模式挖掘应不同于以往基于静态数据集的频繁模式挖掘算法.为了更好地分析在线流数据,基于不同的时间粒度从流数据中抽取频繁模式并且监控频繁模式的变化,基于高效的FP-tree结构,借助倾斜时间窗口和MapReduce的思想,提出了针对数据流的频繁模式挖掘算法DPFP-stream.并将该算法在Storm 平台上实现,算法数据源采用Kafka,并将中间结果存入内存数据库Redis中.通过大量的实验表明,该算法从高速的数据流中发现频繁模式的效率很高且性能稳定.在海量数据实时计算中,采用该算法,不仅能应对高速的数据流,而且能监控不同时间粒度的频繁模式的变化过程.【总页数】5页(P29-33)【作者】孙杜靖;李玲娟;马可【作者单位】南京邮电大学计算机学院,江苏南京 210003;南京邮电大学计算机学院,江苏南京 210003;南京邮电大学计算机学院,江苏南京 210003【正文语种】中文【中图分类】TP301.6【相关文献】1.SHELL:一种面向流数据的实时基数估计算法 [J], 刘尚东;张殿超;尧海昌;姚橹;叶青;季一木;王汝传2.面向流数据的决策树分类算法并行化 [J], 季一木;张永潘;郎贤波;张殿超;王汝传3.面向流数据的分布式时序同步系统的设计与实现 [J], 黄伟健;胡怀湘4.一种面向物流数据分析的路径序列挖掘算法ImGSP [J], 胡孔法;张长海;陈崚;达庆利5.面向实时流数据处理的边缘计算资源调度算法 [J], 查满霞;祝永晋;朱霖;张伯雷;钱柱中因版权原因，仅展示原文概要，查看原文内容请购买。

英语读书笔记6篇

英语读书笔记6篇读书笔记，是指人们在阅读书籍或文章时，遇到值得记录的东西和自己的心得、体会，随时随地把它写下来的一种文体。

以下是英语读书笔记，欢迎阅读。

英语读书笔记1 this paper is a review about web usage mining. it introduced web usage mining in detail . although it is a little old for it was published in XX its contents are very useful today .It is organized according to the sequence of web usage mining and the six main parts are introduction which tells me what is web usage mining the sources and abstraction of web data the three steps of web usage mining taxonomy and project survey websift overview privacy issues .英语读书笔记2 Peter and Susan arrived at their hotel in Lea-on-Sea. They always visit a beautiful island every year. But this time, they meet a man who pretends to be Peter. He has the same face as Peter by his mask. He is Stephen Griggs. He killed Susan and takes out his mask, and then gives Peter the gun! In this way, peter was caught by the police.In my opinion the success of this paper dues to three reasons . the first reason is the profound computer knowledgeowned by the authors the third and fourth parts are most important . it had a list of existing project about web usage mining which i saw many times in other papers but this paper is the one creating this list .Besides it has been referred for more then twenty times . as we all know that the higher the referred number is the more important the paper is so i consider this paper to be an important and successful one in this region.英语读书笔记3 To Wang Lun is written by Li Bai who among other poets stands out in the halls of glory.One day, Li Bai goes on abroad. He is about to sail when there’s stamping and singing on shore.Oh! Here comes Wang Lun to see him off, who is Li Bai’s best friend. Li Bai is very excited to see his best friend at this leaving moment. But he is sad, either. So he can’t say a simple sentence. He knows that words can’t express their friendship. Although the Peach Blossom Pool is one thousand feet deep, it can’t match Wang Lun’s love for him.英语读书笔记4 Sense and Sensibility was the first Jane Austen published. Though she initially called it Elinor and Marianne, Austen jettisoned both the title and the epistolary mode in which it was originally written, but kept the essentialtheme: the necessity of finding a workable middle ground between passion and reason. The story revolves around the Dashwood sisters, Elinor and Marianne.Whereas the former is a sensible, rational creature, her younger sister is wildly romantic——a characteristic that offers Austen plenty of scope for both satire and compassion. Commenting on Edward Ferrars, a potential suitor for Elinor's hand, Marianne admits that while she "loves him tenderly," she finds him disappointing as a possible lover for her sister.This article is from internet, only for studying!英语读书笔记5 This year's Spring Festival, I read" the old man and the sea", is a famous American writer Hemingway wrote. I admire the old fisherman in the novel very much, he lets me understand that a person must have unremittingly spirit, could succeed.This novel is a description of a nearly sixty years old, alone in fishing, caught a big fish, but can not draw. The old fisherman and fish deal after several days, they find it is a big Malin fish over several times their own boats, while knowing that it is difficult to win, does not give up yet.英语读书笔记6 The book I read this week entittled the Red and the Black, which is undoubtedly among the bestliturature works. As a person who haven’t had much knowledge in literature, any comments on this work seem to be rather naive and even sort of reckless. But this is the only book I have access to recently. With much deliberation, I would like to write down following words.The book mainly tells a story about a man named Julien, who came from a carpenter’s family and got a shrewd father and two elder brothers characterized by huge size and rudeness. Though he wasn’t quite athletic, he got talent with his memory. He could recite the whole contents of the Bible, which made him admired by people around him and win him a job as a family teacher. And this is the part where our story begins.Under his gorgeous face, there exists a relentless lust for fortune, power and status, all of which derive from his persuit of happiness, dignity and freedom and finally contributed to his downfall. He worshiped Nepolen the most, and he is willing to die a thousand times to win his success. He desires for love, yet sometimes he used it as a tool to obtain his status and power. However, he had never given up showing his sympathy for the poor and contempt to the Noble.His pureness blended together with ambition and feltself-abased and conceited at the same time. As a young man, he ignored what’s really important for him—being loved. And he just realized this until death was around the corner. The story is a tragedy, for he hasn’t got what he wanted in the end and he sacrificed too much for hiscontemporary fame and fortune. At the last few hours of his life, those happy times he spent with his loved ones gave him courage to face death penalty with calm . when he realized that he didn’t have to tolerate the world filled with fakeness or put a mask on his face, he died with ease. Julien stands for the group of people coming from the low class in the society and hope to achieve success with their own efforts.These people usually don’t have the necessary conditions to become successful, but they will by no means stop trying or let God tells them what their lives should be like, instead, they are willing to sacrifice anything to realize their dreams and earn their dignity. For these people, no matter what the consequences their struggle may be, I should forever pay my tribute to them. To some extent, they are the makers of the world.来源网络搜集整理，仅作为学习参考，请按实际情况需要自行编辑。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

EFFICIENT WEB USAGE MINING BASED ON FORMALCONCEPT ANALYSISD. VASUMATHI 1 & Dr. A GOVARDHAN 21 Associate Professor, Department of Computer Science and Engineering, JNTU College of Engineering.Kukatpally, Hyderabad-500 085, Andhra Pradesh, India.2 Principal, JNTU College of Engineering. Jagityala, Karimnagar(Dist), Andhra Pradesh, India.E-mail: vasukumar_devara@yahoo.co.in1ABSTRACTWeb usage mining attempts to discover useful knowledge from the secondary data obtained from the interactions of the users with the web. Web usage mining has become very critical for effective web site management, creating adaptive web sites, business and support services, personalization and so on. Web usage mining aims to discover interesting user access patterns from web logs. Formal Based Concept Analysis(FBCA) is an effective data analysis technique based on ordered lattice theory. Formal Based Concept can then be generated and interpreted from the concept lattice using FBCA.FBCA has been applied to a wide range of domains including conceptual clustering, information retrieval and knowledge discovery.In this paper, we propose a novel FBCA approach for web usage mining. In our approach, the FBCA technique is applied to mine association rules from web usage lattice constructed from web logs. The discovered knowledge(association rules) can then be used for practical web applications such as web recommendation and personalization. We apply the FBCA-mined association rules to web recommendation and compare its performance with that of classical Apriority -mined rules. The results indicate that the proposed FBCA approach not only generates far fewer rules than Apriority-based algorithms, the generated rules are also of comparable quality with respect to three objective performance measures.Key words: Web intelligence, Knowledge discovery, Data mining, Association Rules, Formal basedconcept Analysis, Web usage mining and Knowledge-based Systems.1. INTRODUCTIONThe WWW continues to grow at an amazing rate as an information gateway and as a medium for conducting business. Web mining is the extraction of interesting and useful knowledge and implicit information from artifacts or activity related to the WWW. Based on several research studies we can broadly classify web mining into three domains: content, structure and usage mining. Web content mining is the process of extracting knowledge from the content of the actual web documents (text content, multimedia etc.). Web structure mining is targeting useful knowledge from the web structure, hyperlink references and so on. Web usage mining attempts to discover useful knowledge from the secondary data obtained from the interactions of the users with the web. Web usage mining has become very critical for effective web site management, creating adaptive web sites, business and support services, personalization and network traffic flow analysis.Web usage mining also known as web log mining, aims to discover interesting and frequent user access patterns from web browsing data stored in the log files of web/proxy servers or browsers. The mined patterns can facilitate web recommendations, adaptive web sites, and personalized web search and surfing.Various data mining techniques such as statistical analysis, association rules, clustering, classifications and sequential pattern mininghave been used for mining web usage logs. Statistical techniques are the most prevalent typical extracted statistics include the most frequently accessed pages, average page viewing time, and average navigational path length. Association rule mining can be used to find related pages that are most often accessed together in a single session. Clustering is commonly used to group users with similar browsing preferences or web pages with semantically related content. Classification is similar to clustering, except that a new user (page) is classified into a pre-existing class/category off users (pages) based on profile (content). Sequential pattern mining involves identifying access sequences that are frequently exhibited across different users. All of the aforementioned techniques have been successfully deployed in various web-mining applications such as web recommendation systems, whereby web pages likely to be visitedby users in the near future are recommended or pre-fetched.Formal based Concept Analysis is a data analysis technique based on ordered lattice theory. It defines a formal context in the form of a concept lattice which is a conceptual hierarchical structure representing relationships and attributes in a particular domain. Formal concepts can then be generated and interpreted from the concept lattice using FBCA. FBCA has been applied to a wide range of domains including conceptual clustering, information retrieval, and knowledge discovery[15&16].In this paper, we propose a novel web usage mining approach using FBCA. In particular, we are interested in mining FBCA based association rules from web logs as FBCA can be used to compute frequent patterns efficiently and accurately. In fact the number of FBCA generated association rule sis far fewer than that generated using classical Apriori-based algorithms. Moreover, the FBCA rules are comparable in quality. A side benefit is that association rules can be visualized directly from the concept lattice. These features will be illustrated and elaborated in more details in the section on performance evaluation [17].2 FORMAL BASED CONCEPT ANALYSIS Formal Based Concept Analysis (FBCA) [1] is a formal data analysis technique based on ordered lattice theory. It defines formal context, which is a structure that represents relationships and attributes in a domain. From the formal context, FBCA is then able to generate formal concepts and interpret the corresponding concept lattice that represents the concept hierarchy. FBCA has been applied for mining association rules and conceptual clustering of data in knowledge discovery in databases [2]. In addition, FBCA has also been applied to information retrieval [3&4].This section discusses the basic concepts on Formal Based Concept Analysis (FBCA) [5]. Definition 2.1A formal context is a triple K = (G, M, I) where G is a set of objects, M is a set of attributes, and I ⊆ G × M is a binary relation between G and M. An object g in a relation I with attribute m is denoted as gIm or (g, m) Є I and read as “the object g has the attribute m”.“Table 1” A cross table of a formal context.Aformal context can be represented by a cross table with rows labeled by the object names and columns labeled by the attribute names. A cross in row g and column m means that the object g has the attribute m.In Table 1, the formal context has three objects representing three documents, namely D1, D2 and D3, and three attributes representing three research topics, namely “Data Mining”, “Clustering” and “Fuzzy Logic”. The symbol “X” is used to indicate that the object has the corresponding attribute.For example, document D1 has the attribute “Data Mining”, which implies that the document D1 belongs to the research topic “Data Mining”.Definition 2.2Given a formal context (G, M, I), we define A’ = {m ∈ M | ∀ g ∈ A: (g,m) ∈Ι} for a set A⊆ G (the set of attributes common to the objects in A) and B’ = {g ∈G | ∀ m ∈ B: (g, m) ∈ I }DataMiningClusteringFuzzyLogicD1 XD2 X XD3 Xfor a set B ⊆ M (the set of objects which have all attributes in B ).Definition 2.3A formal concept of a formal context (G , M , I ) is a pair (A ,B ) with A ⊆ G , B ⊆ M , A’ = B and B’ = A . The sets A and B are called the extent and intent of the formal concept (A , B ) respectively.Definition 2.4 Let (A 1, B 1) and (A 2, B 2) be two formal concepts of a formal context (G , M , I ), (A 1, B 1) is called the sub concept of (A 2, B 2) and denoted as (A 1, B 1) £ (A 2, B 2), if and only if A 1 ⊆ A 2 (⇔ B 2 ⊆ B 1). Equivalently, (A 2, B 2) is called the super concept of (A 1,B 1). The relation £ is called the hierarchical order (or simply order ) of the formal concepts. The set of all formal concepts of (G , M , I ) ordered in this way is called the concept lattice of the formal context (G , M , I ) which is denoted as ℜ(G,M,I).The concept lattice of the formal context in Table 1 is given in Figure 1. This concept lattice generated from the given formal context is a complete lattice, with one concept as its lower bound called minimum and one concept as its upper bound called super mum . Figure 1 also shows the intent (given at the left of each node) and the extent (given at the right of each node) of every concept. For example, the intent andextent of the concept at the top left node are {Data Mining} and {D1, D2} respectively.“Figure 1” The concept lattice of the formal context in Table 1.“Figure 2” Overview of the FBCA-based web usage mining approach.3 FBCA-BASED WEB USAGE MININGFigure 2 gives an overview of the proposed FBCA-based web usage mining technique for association access patterns. The proposed approach consists of the following steps:(1) Preprocessing;(2) Constructing Web Usage Context;(3) Constructing Web Usage Lattice; and(4) Mining Association Access Pattern Rules from the Web Usage Lattice.3.1 PreprocessingThe preprocessing step aims to preprocess the original web logs to identify all web access sessions. For web server logs, all users’ access activities of a website are recorded by the Web server of the website. Each user access record contains the client IP address, request time, requested URL, HTTP status code, etc. Users are treated as anonymous since the IP addresses are not mapped to any user-identifiable profile database.As discussed before, web logs can be regarded as a collection of sequences of access events from one user or session in timestamp ascending order. Preprocessing tasks [6] including data cleaning, user identification, and session identification can be applied to the original webLog files to obtain all web access sessions.3.2 Constructing Web Usage ContextThis step will construct the web usage context (the formal context of web logs) based on all web access sessions obtained from the preprocessing step. Let M be a set of unique access events, which represents web resources accessed by users, i.e. web pages, URLs, or topics. A web access session can be denoted as S = {(e1, t1), (e2, t2),….., (en , tn)}, Where e ∈M and ti is the request time of ei for 1 ≤I≤n. Note that it is not necessary that ei ≠ ej for i ≠j in S, that is repeat of items is allowed.For the web usage context (G, M, I), the set of objects G consists of all web access sessions in web logs for a website, and the attribute set M consists of all web resources in the website. The relation I indicates the web resources of the website that are accessed by the sessions. If (g, m) ∈I, it says that the user in web access session g has accessed the content of the web resource m. However, it is important to note that it is not necessary mean that all web pages accessed by the user are of interest to him, as some intermediate pages might need to be accessed first before reaching the target web pages. As such, we set a duration threshold dmin as a constraint to filter out non-targeted access events. The duration di of each access event eiis simply estimated as di (ti+1 - ti). Obviously, the last access event en in each session has no “tn+1” for estimating the duration. There are two ways to tackle this problem. The first way is to discard this event, and it is unacceptable, as useful data will be lost. The second way is to estimate the duration. Here, we have used the average duration in the relevant session as the estimated duration for the last access event, i.e. dn = (d1+d2+ … + dn-1)/(n-1). Then, all access events whose durations are less than the predefined duration threshold d dmin are regarded as not useful and are discarded.After this step, the web usage context is constructed. Table 2 shows an example of a web usage context. Each row represents a web access session(S), and each column represents a web page access(PA).“Table 2” An example Web Usage Context3.3 Constructing Web Usage LatticeBased on the web usage context, a concept lattice can be constructed. We refer this concept latticeto as Web Usage Lattice for the web logs. Some efficient approaches [7&8] for computing concept lattice from formal context have been proposed, which can be used for constructing the web usage lattice from the web usage context. In addition, incremental approaches [9&10] can also be used for building the web usage lattice incrementally. In this research, the TITANIC algorithm [7] is adopted for constructing the web usage lattice. It is based on data mining techniques with a level-wise PA1 PA2 PA3 PA4 PA5 PA6 S1 X X X X S2 X X X S3 X X X S4 X X X X X S5 X Xapproach and is much more efficient than other lattice construction algorithms, especially for weakly correlated data with a large number of objects.Figure 3 shows the web usage lattice of the web usage context given in Table 2. The web usage lattice can be treated as a conceptual model of web logs which can then be processed by the FBCA-based technique for discovering interesting and frequent user access patterns from web usage data. The next section willdiscuss the mining of association rules based on the web usage lattice.“Figure 3” The web usage lattice of the web usage context in Table 2.4 MINING FCA-BASED ASSOCIATIONACCESS PATTERN RULESAssociation rule mining [11] searches for interesting relationships among items in a given data set. Given a set of items I = {I 1, I 2, …, Im } and A database of transactions D = {t 1, t 2, …, ⎜tn } Where ti = {Ii 1, Ii 2, …, Iik } and Iij ∈ I , an association rule is an implication of the form X => Y where X , Y ⊂ I are sets of items called itemsets and X ∩ Y = ∅. X is called the antecedent and Y is called the consequent . We are generally not interested in all implications but only those that are important. Here, two features called support and confidence are commonly used to measure the importance of association rules.The support for an association rule X =>Y is the percentage of transactions in the database that contain X ∪ Y . The confidence for an association rule X =>Y is the ratio of the number of transactions that contain X ∪ Y to the number of transactions that contain X .Rules that satisfy both a minimum support threshold (MinSup ) and a minimum confidence threshold (MinConf ) are called strong rules.Association access pattern rules can be discovered based on Formal Concept Analysis. They are defined as follows.Definition 4.1Let M be a set of attributes of a formal context K = (G , M , I ). An association rule is a pair X => Y with X , Y ⊆ M . Its support is defined bysup (X => Y)= | (X∪Y)′⎜ and its confidence⎜G⎜by conf (X => Y)= | (X ∪Y) ′⎜⎜Χ′⎜Let B ⊆M and MinSup∈[0, 1]. The support of the attribute set (also called itemset) B in formal context K = (G, M, I) is sup ( B)=⎜B′⎜⎜G⎜B is said to be a frequent attribute set if sup(B) ≥MinSup. A concept is called frequent concept if its intent is frequent.Most traditional association rule mining algorithms employ a support-confidence framework. However, such approaches suffer from the same problem in which a large numberof rules are usually returned. In fact, there are many redundancies among the returned association rules. In other words, despite using the minimum support and confidence thresholds to help weed out or exclude the exploration of uninteresting rules, many rules that are not interesting may still be produced. Mining association rules using Formal Concept Analysis can significantly reduce the number of rules without compromising on the quality. This approach extracts only a small subset from all association rules, called basis, from which all other rules can be derived.In [12&13] different bases for association rules were introduced, which prune redundant rules, but from which all valid rules can still be derived. The computation of the bases does not require all frequent itemsets, but only frequent concept intents. In this research, the FBCA-based association access pattern rules generated for web usage mining are based on the Duquenne-Guigues basis [12] for exact association rules (i.e., association rules with 100% confidence) and the Luxenburger basis [12] for approximate association rules (i.e., association rules with less than 100% confidence). We define FBCA-based association access pattern rules as follows.Definition 4.3 Given a formal context K = (G, M, I), FBCA-based association access pattern rules consistof two kinds of rules:(1) FBCA-based exact rules B1=> B2, whereB1 and B2 are frequent nonempty concept intents, and the concept (B1’, B1) has concept (B2’, B2) as its only immediate super concept; (2) FBCA-based approximate rules B1 => B2, where B1 and B2 are frequent nonempty concept intents, and the concept (B1’, B1) is an immediate subconcept of (B2’, B2).Given the minimum support MinSup and minimum confidence MinConf, for all rules B1 => B2 (B1, B2 ≠∅), sup(B1=> B2) ≥M inSup and conf(B1=> B2) ≥M inConf.From the above definition, it is obvious that each FBCA-based exact rule corresponds exactly to one edge that connects the subconcept with its only super concept in a concept lattice, and each FBCA-based approximate rule corresponds exactly to one edge that connects the super concept with one of its sub concepts in the concept lattice.For example, in the web usage lattice shown in Figure.3, the edge from the concept node {PA2, PA6} to {PA6} represents an exact rule PA2=>PA6 with support = 60% and confidence = 100%, and the edge from the concept node {PA6}to {PA2, PA6} represents an approximate rule PA6=>PA2 with support= 60% and confidence = 75%. The computation for the support and confidence is given in Figure 4.Several approaches [12&13] have been proposed for mining association rules based on Formal Concept Analysis. Most of them compute association rules from a formal context directly. Since the web usage lattice has been constructed as a model of web logs, we propose an approach for mining association access pattern rules directly from the web usage lattice. The algorithm for mining FBCA-based association access pattern rules from the web usage lattice is given in Figure 4.Using the algorithm given in Figure 4, all FBCA-based association access pattern rules which are given in Figure 3 can be mined from the web usage lattice. Table 3 shows the mining results with MinSup =40% and MinConf = 50%. A total of 12 FBCA-based associationaccess pattern rules including 3 exact rules and 9 approximate rules are generated._______________________________________ Algorithm: FBCA-based Association Access Pattern Rule Mining_______________________________________ Input:1: WUL – Web Usage Lattice, which is based on formal context K = (G, M, I), the object set G consists of all user access sessions in web logs, and the attribute set M consists of all web pages. The relation I indicates the web pages that are accessed by the access sessions.2: NL = {N1, N2, …, Nm} – a set of concept nodes in WUL, where Ni = 〈Ai, Bi, Pi〉,Ai ⊆G is the extent of Ni , Bi ⊆M is the intent of Ni, Pi = {Ni1, Ni2, …, Nip} ⊆N L is the immediate parent nodes of Ni.3: MinSup – minimum support threshold.4: MinConf – minimum confidence threshold. Output:1: ARS = {AR1, AR2, …, ARn} – a set of FBCA-based association access pattern rules,where ARi = (Xi⇒Yi, Support, Confidence), Xi, Yi⊂M, and Xi ∩Y i = ∅.Process:1: Initialize ARS = ∅.2: For each Ni ∈N L, if PAi=∅a nd Sup = |Ai| / |G| ≥M inSup, doa) If | PAi | = 1 and Bi1≠∅ , doInsert ((Bi -Bi1) ⇒Bi1, Sup, 100%) into ARS as a FBCA-based exact rule.b) For each Nij ∈P Ai, if Bij ≠∅a nd Conf = |Ai| / |Aij| ≥M inConf, doInsert (Bij ⇒ (Bi -Bij), Sup, Conf) into ARS as a FBCA-based approximate rule.3: Return ARS. “Figure 4” The algorithm for mining FBCA-based association access pattern rules.For comparison, we also mine all conventional association rules using the traditional Apriori-based algorithm [14] from the web usage context given in Table 2.The results are listed in Table 4. A total of 32 rules are generated. Among them, there are 9 exact rules (with 100% confidence) and 23 approximate rules (less than 100% confidence).It is important to note that the number of FBCA-based association access pattern rules is small compared with the number of Apriori-based association rules.5. PERFORMANCE EVALUATIONIn this section, we compare the performance ofthe FCA mined rules with that of the Apriori-based algorithm on web recommendation.5.1 Web Recommendationweb recommendation systems based on web usage mining techniques. The goal of web recommendation is to determine which web pages are more likely to be accessed next by the current user in the near future. One of the motivations of building the web recommendation system is that it can be used to study the performance of the different web usage mining techniques. Association rules arewell suited for web recommendations. The recommendation engine generates recommendation (links) by matching the users recent browsing history against the discovered association rules. Here ,we define three measures to evaluate the performance of FCA versus Apriori objectively.“Table 3” FBCA-based association access pattern rules mined from the web usage lattice in Figure 4.3 (MinSup = 40%, MinConf = 50%).No FBCA-based AssociationRulesSupport Confidence1 PA2═> PA 6 60% 100% 2 PA4═> PA 3 60% 100% 3 PA1═> PA2^PA6 40% 100%4 PA 6 ═> PA 2 60% 75%5 PA6 ═> PA 3 60% 75%6 PA3═> PA 6 60% 75% 7 PA3═> PA 4 60% 75% 8 PA 2 ^ PA 6 ═> PA 1 40% 67%9 PA 2 ^ PA 6 ═> PA 3 40% 67%10 PA 3 ^ PA 6 ═> PA 2 40% 67%11 PA 3 ^ PA 6 ═> PA 4 40% 67%12 PA 3 ^ PA 4 ═> PA 6 40% 67% 5. 2. Performance EvaluationIn this section, we discuss the performance evaluation of the proposed SWABRS system.5.2.1 Evaluation MeasuresThe following measures are used for evaluating the performance of the proposed approach.Let WAS = a1a2…akak+1…an be a web access sequence.For the prefix sequence WASprefix = a1a2…ak (k ³ MinLength), we generate a recommendation rule RR = {e1, e2, …, em} using the Pattern-tree, where all events are ordered by their support.¾If ak+1 Î RR, we label the recommendation rule as correct, andincorrectotherwise.¾If there exists ai Î RR (k+1 ≤i ≤k+1+m, m > 0), we deem therecommendation rule as m-stepsatisfactory. Otherwise, we label it as m-step unsatisfactory.Let R = {RR1, RR2, …, RRn} be a set of recommendation rules, where RR i (1 ≤ i≤ n) is a recommendation rule.|R| = n is the total number of recommendation rules in R.Definition 5.2.2Let Rc be the subset of R that consists of all correct recommendation rules.The overall web recommendation precision is defined asPrecision = ⎮Rc⎮________⎮R ⎮This precision measures how probable a user will access one of the recommended pages.Definition 5.2.3Let Rs(m) be the subset of R that consists of allm-step satisfactory recommendation rules.The m-step satisfaction for web recommendationis defined as Satisfaction (m) = ⎮ Rs (m)⎮_____________⎮R⎮The m-step satisfaction is a very important evaluation measure for web recommendation. Actually, the next web page accessed by a user may not be the target page that the user wants. In many cases, a user has to access some intermediate pages before reaching the target page. Therefore, it is not appropriate if we only use the precision measure to evaluate a web recommendation system. The m-step satisfaction gives the precision that the recommended pages will be accessed in the near future (within m steps). Clearly, the satisfaction and precision measures are equivalent for m = 1. In order to realistically evaluate our web recommendation system, m has been set with an appropriate value m = 5 to indicate that the recommended page should be accessed in the near future.Definition 5.2.4Let Rn be the subset of R that consists of all nonempty recommendation rules. Applicability = ⎜Rn ⎜_________⎜ R ⎜The applicability of web recommendation is defined as As the Pattern-tree only stores web access sub-sequences accessed frequently by users (with a support of at least MinSup), the recommendation rules generation approach is unable to find recommended pages if the current access sequence does not include a frequent suffix sequence, in which case the generated recommendation rule is empty.Therefore, the applicability measure gives a rough idea of how often recommendations will be generated. Some parameters such as MinSup in the proposed approach can influence the applicability of web recommendation. We will discuss these parameters in the next section. Generally, the smaller the MinSup is, the more applicable the web recommendation is. But, this comes at the expense of increased sequentialpattern mining and Pattern-tree construction cost.5.3 Performance Results We use the same experimental environment that was used in SWABRS for measuring the performance of the AWABRS system. As before, we used the two datasets from MicrosoftAnonymous Web Data () for training and testing the web recommendation application. The training dataset for constructing the web usage lattice has a total of 5,000 sessions, with each session containing from 1 up to 35 page references from a total of 294 pages.Note that we only use the 2,213 valid web access sequences that have more than two items.The test dataset has a total of 32,711 sessions and includes 8,969 valid web access sequences as the inputs to web recommendation. In the experiment, all Apriori-based association rules and FBCA-based association access pattern rules, with minimum support count as 10, are mined from the training dataset to generate recommendation rules for the testing dataset5(a)“Figure 5 (a,b&c)” Performance of Apriori-based association rules Vs FBCA_based association access pattern rules for web recommendation..5 (b)5 (c)6. CONCLUSIONSWe proposed a Formal Based Concept Analysis (FBCA) approach for web usage mining and evaluated it against the classical Apriori algorithm. We found the FBCA – mined rules to be 70% smaller with no noticeable performance penalty. The FBCA approach is therefore an effective and efficient tool for generating web recommendations.REFERENCES[1] Wille R. (1982). Restructuring LatticeTheory: An Approach Based on Hierarchiesof Concepts. In I. Rival. Editor, Ordered sets.Boston-Dordrecht: Reidel, pp. 455-470. [2] Hereth J., Stumme G., Wille U., and WilleR. (2000). Conceptual Knowledge Discovery and Data Analysis. InProceedings of InternationalConferenceononceptualStructures, pp. 421-437.[3] Kosala R., and Blockeel H., (2000). WebMining Research: A Survey. In ACMSIGKDD Explorations, Vol. 2, pp. 1-15. [4] Kim M. and Compton P., (2001). FormalConcept Analysis for Domain-SpecificDocument Retrieval Systems. In Proceedings of the 13th Australian JointConference on Artificial Intelligence(AI'01), Adelaide Australia, pp. 237-248.[5] Ganter B., and Wille R., (1999). FormalConcept Analysis: Mathematical Foundations. Springer, Heidelberg, 1999.[6] Cooley R., Mobasher B., and Srivastava J.(1999). Data Preparation for Mining WorldWide Web Browsing Patterns. In Journal ofKnowledge and Information Systems, Vol. 1,No.1.[7] Stumme G., Taouil R., Bastide Y., PasquierN., and Lakhal L., (2002a). ComputingIceberg Concept Lattices with Titanic. InJournal on Knowledge and Data Engineering(KDE), Vol. 42, No. 2, pp. 189-222.[8] Nourine L., and Raynaud O., (1999). A FastAlgorithm for Building Lattices. InInformation Processing Letters 71, pp. 199-204[9] Godin R., and Missaoui R., (1994). AnIncremental Concept Formation Approachfor Learning from Databases. In TheoreticalComputer Science, Special Issue on FormalMethods in Databases and SoftwareEngineering, Vol. 133, No. 2, pp. 387-419.[10] Valtchev P. and Missaoui R., (2001).Building Concept (Galois) Lattices fromParts: Generalizing the IncrementalMethods. In Proceedings of the 9thInternational Conference on ConceptualStructures (ICCS 2001), USA, pp. 290-303. [11] Agrawal R., Imielinski T., and Swami A.,(1993). Mining Association Rules betweenSets of Items in Large Databases. InProceedings of the ACM SIGMODInternational Conference on the Managementof Data 1993, Washington, D.C., pp. 207-216.[12] Stumme G., Taouil R., Bastide Y.,Pasquier N., and Lakhal L., (2001a).Intelligent Structuring and Reducing ofAssociation Rules with Formal ConceptAnalysis. In Proceedings of KI 2001:Advances in Artificial Intelligence, pp. 335-350.[13] Pasquier N., (2000). Mining AssociationRules using Formal Concept Analysis. InWorking with Conceptual Structures,Contributions to ICCS 2000, pp. 259-264. [14] Agrawal R. and Srikant R. (1994). FastAlgorithms for Mining Association. InProceedings of the 20th InternationalConference on Very Large Data Bases,Santiago, Chile, pp. 487-499.[15] Stumme G., (2002). Efficient DataMining Based on Formal Concept Analysis.In Proceedings of Database and ExpertSystems Applications (DEXA 2002), pp.534-546.[16] Stumme G., Taouil R., Bastide Y., andLakhal L., (2001b). Conceptual Clusteringwith Iceberg Concept Lattices. InProceedings of GI-FachgruppentreffenMaschinelles Lernen'01, Universität Dortmund, Vol. 763.[17] Srivastava J., Cooley R., Deshpande M.,and Tan P.-N., (2000). Web Usage Mining:Discover and Applications of UsagePatterns from Web Data. In ACM SIGKDDExplorations, Vol. 1, No. 2, pp. 12-23.。