Mining data records in web pages
资料采矿DataMining技术简介

資料採礦(Data Mining)技術簡介*鄧家駒近年來,商業統計軟體的設計有個新的趨勢,就是專為當前發展得已經相當成熟的資料庫與資料倉儲(data base & data warehouse)技術,針對使用這一類技術所儲存的鉅量電子化資訊,發展出一套套分類與解析的數值技術分析軟體。
一般而言,資料採礦所分析的資料,例如就金融相關的申請貸款資料而言,不外乎以下的數種類型:(一)個別資料:例如個人資料的年齡、性別、地址、所得、教育水準、婚姻狀況等,或者是公司行號資料的行業別、財務報表、經營績效、市場佔有等等;(二)行為資料:例如帳戶的貸款額度、利率、款項動支情形、還款狀態、還款餘額、累積利息等等;(三)背景資料:例如個人或公司的當前負債總額、信用額度、申請信用審查頻率、信用情形、壞帳記錄等等;(四)經濟資料:例如申請當時的利率水準、物價指標、房地產等標的物的物價水準、景氣循環指數、與其他經濟指標等等;(五)其他資料:與活動相關的其他資訊,例如抵押品資訊、保證人資訊、聯貸資訊等。
為何不對資料庫使用一般的統計軟體來作分析呢?這裡有兩個主要的原因。
在過去統計軟體的缺點之一,就是軟體設計者從來就沒有預期會使用到這麼龐大的資訊。
當資料量增大到一個程度時,這些傳統的統計軟體,配置在一般PC之下的可運算容量與運算速度都會產生嚴重的問題。
例如,JCIC(聯合徵信中心)每年都會接受所有金融單位許多的資訊,這些每年或每季傳過來的資訊不僅極端的龐大,另外在時間演進之下也會不斷的累積。
其結果是總資料量的龐大決不是一般人所能想像的。
同樣的,我們也可以設想我們的健保資料,也會在時間的進程當中,因為不斷的有人到各類醫院診所看各種疾病因而不斷的累積。
這當然是因為當前發生的疾病資料固然重要,過去的疾病與用藥歷史資料也是不可忽視的。
於是乎時間越久,各種資料的累積當然就越多。
另外,如果我們希望儲存的資訊細節越是詳細的話,資料的科目(變數)項目當然也就越多,其資訊密度也就越密集,當然所涵蓋的資料量也自然而然的更為龐大。
数据采集和营销工具(英文版)

Motivations for DM
Abundance of business and industry data
Competitive focus - Ke, powerful computing engines
Strong theoretical/mathematical foundations
1. Decision Trees and Fraud Detection
2. Association Rules and Market Basket Analysis 3. Clustering and Customer Segmentation
3. Trends in technology
1. Knowledge Discovery Support Environment 2. Tools, Languages and Systems
Provide a systematization to the many many concepts around this area, according the following lines
the process the methods applied to paradigmatic cases the support environment the research challenges
1970s:
Relational data model, relational DBMS implementation.
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).
Data Mining in Forensic Image Databases

Data Mining in Forensic Image DatabasesZeno Geradts*, Jurrien BijholdNetherlands Forensic InstituteABSTRACTForensic Image Databases appear in a wide variety. The oldest computer database is with fingerprints. Other examples of databases are shoeprints, handwriting, cartridge cases, toolmarks drugs tablets and faces.In these databases searches are conducted on shape, color and other forensic features. There exist a wide variety of methods for searching in images in these databases. The result will be a list of candidates that should be comparedmanually.The challenge in forensic science is to combine the information acquired. The combination of the shape of a partial shoe print with information on a cartridge case can result in stronger evidence. It is expected that searching in the combination of these databases with other databases (e.g. network traffic information) more crimes will be solved.Searching in image databases is still difficult, as we can see in databases of faces. Due to lighting conditions and altering of the face by aging, it is nearly impossible to find a right face from a database of one million faces in top position by a image searching method, without using other information. The methods for data mining in images in databases (e.g. MPEG-7 framework) are discussed, and the expectations of future developments are presented in thisstudy.Keywords: image databases, data mining, forensic science, biometrics1. INTRODUCTIONThe importance of image databases in forensic science has long been recognized. For example, the utility of databases of fingerprints1 is well known. Over the past four years, DNA databases have received particular attention and have been featured on the front page of newspapers. These databases have proven extremely useful in verifying or falsifying the involvement of a suspect in a crime, and have led to the resolution of old cases. DNA databases are also playing an increasingly prominent role in the forensic literature2,3 However, a variety of other databases are also crucial to forensic casework, such as databases of fingerprints, faces,and bullets and cartridge cases of firearms4. Research into forensic image databases is a rapidly expanding field of scientific endeavor that has a direct impact on the number of criminal cases solved.Throughout the 20th century, many databases were available in the form of paper files or photographs (e.g., cartridge cases, fingerprints, shoeprints). Fingerprint databases were computerized in the 1980s, and became the first databases to be widely used in networks. In addition, the Bundes Kriminal Amt attempted to store images of handwriting in a database. These databases were all in binary image format. At the beginning of the 1990s, computer databases of shoeprints, tool marks and striation marks on cartridge cases and bullets became available. Improvements in image acquisition and storage made it economically feasible to compile these databases in gray-value or color format usingoff-the-shelf computersSome forensic databases contain several millions of images, as is the case with fingerprints. If databases are large, the forensic examiner needs a method for selecting a small number of relevant items from a database, because if this cannot be achieved the investigation becomes time consuming and therefore either expensive or impossible. The retrieval of similar images from a database based on the contents of each image requires an automatic comparison algorithm that is *zeno@forensic.to;phone +31704135681;fax +31704135454;Netherlands Forensic Institute;Volmerlaan 17;2288 GD Rijswijk;Netherlands;fast, accurate, and reliable. To formulate such an algorithm, one must first identify which parts or features of the images are both crucial and suitable for finding correspondences. The development of the retrieval system then requires a multidisciplinary approach with knowledge of multimedia database organization, pattern recognition, image analysis and user interfaces.Forensic image databases often contain one or more sub-databases:• Images that are collected from the scene of crime (e.g., shoeprints recovered from the crime scene) à with this database it is possible to link cases to each other• Images that are collected from the suspect (e.g., shoeprints that are collected from a suspect) à with this database in combination with the database of images that are collected from the scene of crime, it is possible to link suspects with cases• Reference images (e.g., shoeprints from shoes that are commercially available, that can be used to determine which brand and make of shoe a certain shoe print is from)The compiling of large-scale forensic image databases has made available statistical information regarding the uniqueness of certain features. For example, at the beginning of the 20th century there have been arguments about the number of matching points that are needed for concluding that a fingerprint matches. (depending on the country5 this could be 8-16 points). Nowadays, however, statistical ranking in fingerprint databases provides more information regarding the uniqueness of fingerprints and the number of points versus the statistical relevancy. U p until very recently, most forensic conclusions were drawn more on experience of the investigator than on real statistics. The statistical information now available from databases should result in forensic investigation that is more objective. This is necessary since courts and lawyers are asking questions that are more critical about forensic investigation and conclusions that are based on experience of the investigator instead of real statistics*.2. Visual ContentPrevious studies on information retrieval from image databases based on visual contents have used the following features6,7 such as color, texture, shape, structure, and motion, either alone or in combination.ColorColor reflects the chromatic attributes of the image as it is captured with a sensor. A range of geometric color models8 (e.g., HSV, RGB, Luv) for discriminating between colors are available. Color histograms are the most traditional technique for describing the low-level properties of an image.TextureTexture has proved to be an important characteristic for the classification and recognition of objects and scenes Haralick and Shapiro9 defined texture as the uniformity, density, coarseness, roughness, regularity, intensity, and direc tionality of discrete tonal features and their spatial relationships. Haralick10 reviewed the two main approaches to characterizing and measuring texture: statistical approaches and structural approaches. Tuceryan and Jain11 carried out a survey of textures in which texture models were classified into statistical models, geometrical models, model-based methods, and signal processing methods.*On January 9th 2002 U.S. District Court Judge Louis H. Pollak in Philadelphia, ruled that finger print evidence does not meet standards of scientific scrutiny established by the U.S. Supreme Court, and said fingerprint examiners cannot testify at trial that a suspect's fingerprints "match" those found at a crime scene. It is expected that other forensic fields (e.g., handwriting, toolmarks, hairs) will also be challenged for court..ShapeShape features are expressed in text, for example squares, rectangles, and circles. However, complex forms are more difficult to express in text.Figure 1: Flowchart for a visual information retrieval systemTraditionally, shapes are expressed through a set of features that are extracted using image-processing tools. Features can characterize the global form of the shape such as the area, local elements of its boundary or corners, characteristic points, etc. In this approach, shapes are viewed as points in the shape feature space. Standard mathematical distances are used to calculate the degree of similarity of two shapes.Preprocessing is often needed to find the shapes in an image. Multiple scale techniques12 are often used as filters to elucidate the shapes in an image. In many cases, shapes must be extracted by human interaction, because it is notalways known beforehand which are the important shapes in an image.The property of invariance (e.g.; the invariance of a shape representation in a database to geometric transformations such as scaling, rotating, and translation) is important in the comparison of shapesStructureThe structure of an image is defined as the set of features that provide the “gestalt” impression of an image. The distribution of visual features can be used to classify and retrieve an image. Image structure can be important for achieving fast pre-selection of a database (i.e., the selection of a part of the database) based on the image contents. A simple example is distinguishing line drawings from pictures gray values and location in the image.MotionMotion is used in video databases and is analyzed in a sequence of frames. There are several models for calculating the motion vectors from a sequence of images. These methods range from simply determining the motion from the difference between two images to more complex approaches using optical flows for different objects.3. Similarity MeasuresSimilarity measures are used to identify the similarities in images. These measures are usually based on histograms of a feature in the image; however, more sophisticated implementations taking into account other measurements from the images are also possible. A metric model is frequently used to compare features that can be expressed within a metric system. The implementation of this method is particularly efficient in terms of the computing power required. Several distance functions are commonly used:13 the Euclidean distance, the city-block-distance, and the Minkowsky distance for histograms. However, many other metric models can be implemented. Another approach to measuring similarity is the use of transformational distances.14 This approach requires the definition of a deformation process that transforms one shape into another, where the amount of deformation is a measure of the similarity. One such group of models are the elastic models, which use a discrete set of parameters to evaluate the similarity of shapes. Evolutionary models, on the other hand, consider the shapes that result from a process in which at every step forces are applied at specific points of the contours.Several studies in psychology have pointed out that the human visual system has a number of inadequacies compared to algorithm-based systems. The Earthmover’s distance15 is an example of a new kind of implementation that, according to the developers, seems to be equivalent to the human visual system. The developers of the Earthermover’s distance claim that it is easy to implement and that efficient indexing methods are possible.4. Biometric databasesBiometric systems are an important area of research into visual information retrieval systems. In this section, examples of practical implementations are discussed in relation to the methods described earlier.Biometric systems serve two potential goals: identification or recognition.16 Recognition is often used to distinguish a particular individual from a limited number of people whose biometric data is known. For instance, a company that has 40 employees uses face recognition access control, in which the 40 employees have to be distinguished from each other. If an error (false positive or false negative) is made with this access control system, this is acceptable. Identification is much more difficult to achieve than recognition, because false positives are unacceptable.For biometric systems, a group of measures are used to identify or recognize a person. For a measure to be suitable for a biometric system, it must satisfy several requirements17:• Universality: the biometric measure must be a characteristic possessed by all people.• Uniqueness: no two individuals should have the same biometric measure• Permanence: the biometric measure should not change with time.Examples of biometric measures are fingerprints, palm prints, DNA, iris, face, handwriting, gait, and speech. In the next section, some implementations of databases of fingerprints, faces, and handwriting are described, which meet most of the requirements mentioned above. A study to gait comparison is also included in this chapter, which does not meet all requirements.FingerprintsThe most commonly used system for classifying fingerprints is the Henry Classification scheme18. Although this classification system dates back to the early 1900s, it is still used in commercial systems where it has been used since the era in which fingerprints were classified manually. Commercial Automatic Fingerprint Identification Systems (AFIS) usually employ this classification scheme; however, other approaches have been developed.FacesImages of faces are easy to obtain with a camera, and are important for the surveillance industry. The problem with recognizing faces in images from a camera system is that the faces are not acquired in a standardized fashion. The face can be in any position and the lighting and magnification can vary. Furthermore, hairstyle, facial hair, makeup, jewels and glasses all influence the appearance of a face. Other longer-term effects on faces are aging, weight change, and facial changes such as scars or a face-lift. Images of faces are generally taken under more standardized conditions for police investigations.19.Before a face can be analyzed, it must be located in the image20. For face recognition systems to perform effectively, it is important to isolate and extract the main features in the input data in the most efficient way. One of the main problems that must be overcome in face recognition systems is removing redundant sampling so as to reduce the dimensionality21 Sophisticated pre-processing techniques are required to attain the best results.Some face recognition systems represent the primary facial features (e.g., eyes, nose, and mouth) in a space based on their relative positions and sizes. However, important facial data may be lost using this approach22, especially if variations in shape and texture are considered to be an important aspect of the facial features.Template matching involves the use of pixel intensity information, corresponding to the largest eigenvalues. This can be in the original gray-level dataset or in pre-processed datasets. The templates can be the entire face or regions corresponding to general feature locations such as the mouth and eyes. Cross correlation of test images with all training images is used to identify the best matches.Statistical approaches can also be used in the analysis of faces. For example, Principal Components Analysis (PCA23) is a simple statistical dimensionality reducing technique that is frequently employed for face recognition. PCAs extract the most statistically significant information for a set of images as a set of eigenvectors (often referred to as ‘eigenfaces’). Once the faces have been normalized for their eye position they can be treated as a one-dimensional array of pixel values, which are called eigenvectors.For forensic investigation based on facial comparison that is focused on identification, there exist different approaches for positioning. It is necessary to get a reference image in which the suspect is positioned exactly as in the questioned image.. Approaches for positioning a suspect are described by van den Heuvel24 and Maat25. Positioning takes much time, and this is often not available in commercial systems.In practice it appears that it is not possible to identify persons with the regular CCTV-systems. The expectations of face recognition software is sometimes too high, as can be seen in a trial in Palm Beach for searching terrorists26. It appeared that the system falsely identified several people as suspects and has at times been unable to distinguish between men and womenHandwritingVarious handwriting comparison systems27exist on the market. The oldest system is the Fish system, which was developed by the Bundes Kriminal Amt in Germany. Another well-known system is Script, developed by TNO in the Netherlands.In both systems, handwriting is digitized using a flatbed scanner and the strokes of certain letters are analyzed with user interaction. The features of the handwriting are represented as content semantics.GaitGait is a new biometric28 aimed to recognize a subject by the manner in which they walk. Gait has several advantages over other biometrics, most notably that it is non-invasive and perceivable at a distance when other biometrics are obscured.At the present time, many crimes, including bank robberies, are captured using CCTV surveillance systems sited at stores, banks and other public places. These recordings are often passed to our institute for the purpose of identification. If a criminal has covered his face, the recognition is much more difficult. The question then asked is if it is possible to compare the gait of the perpetrator with the gait of the suspect. For this purpose it is necessary that some of the gait parameters have subject characteristic features.Since there is not much known on the characteristics of gait in literature and use in forensic analysis, we have started a research project on gait analysis. Human gait contains numerous parameters. These parameters can be categorized into spatial-temporal and kinematic parameters. Because it was impossible to investigate all gait parameters in our study29, a selection has been made on the criteria that the gait parameters could probably also be obtained in non-experimental settings, and could be characteristic of a person. In our experiments markers were used as is shown in. The subjects wore only their underwear and shoes.From this research it appeared that with this measurement the gait was not unique. Some non-characteristic differences were measured between different persons, so it could be used for distinguishing people from each other. For forensic analysis of CCTV-images, gait analysis is even more difficult, since there are most often a few images available and the subjects wear clothes.5. Databases in the forensic laboratoryAn overview is shown on methods for automatic comparison of images. For databases of tool marks, cartridge cases, shoe prints and drugs tablets a wide variety of implementations have been accomplished at our laboratory in the last ten years.Toolmarks30From the methods available for image matching, texture is an option for making a smaller set. Spatial relationships in the tool mark are in the end necessary for the comparison. The similarity model that has been used is based on registration of two line patterns, in view of the fact that there might be a shift and a zoom. The main contribution of this research was to develop a method for extracting relevant forensic information from the striation marks (Figure 2). With a 3D-approach that is implemented in 1999, this resulted in higher correlation factors for matching striation marks.For the toolmarks, the practical use of these databases is still limited. In the Netherlands, there exist some databases of tool marks, however automatic comparison has not been implemented. Several efforts have been made to implement automatic comparison of striation marks. The results of these algorithms are promising depending on the amount of time that is available, as also can be seen in the bullet systems (IBIS).Cartridge Cases31This research is focused on pre-selection of features. For improving the features that are selected, a manual intervention might improve the results. Registration methods are used for the comparison. Shape for the firing pin, could be a good option for making a faster pre-selection. A faster pre-selection is also possible with texture and or structure of the impression marks. Spatial relationships have been implemented in our research. The cartridge case systems are widely used compared to toolmarks databases. These systems have correlation engines, and modification to a 3D-system will result in better correlation ranks. In the Netherlands we use the system Drugfire,as is used by many agencies in the United States. The company that produces this product is phasing out the software, since there appeared to be patent infringement problems with the other company IBIS. In the United States, the IBIS system will be the standard. This system has a more reproducible way of imaging the cartridge cases, and better results are possible with this system. In practice, the system results in “cold” hitsFigure 2: Example of striation mark from screw driver.Shoe Prints32This research has been focused on shape recognition. At the time of the study (1995), computing power and memory was limited which caused that more sophisticated implementations have not been tested. In the mean time other methods for shape recognition have been implemented (as implemented with the drugs tablets study), that could resultin a better matching. The spatial relationships between the different shapes of the shoe profile have not been implemented. This will improve the results.Digital shoe print systems are often used in the United Kingdom and Japan. Many shoe print systems could not survive in the market, since government forensic laboratories that develop these systems, develop them for themselves, and this means that they do not really want to market them. In the Netherlands, we have seen some results with these systems, when used manually. Automatic classification and comparison is possible for clear shoeprints. In practice, the problem with shoeprints is that they are often vague, and for that reason a human being should classify. Shoe prints are valuable in forensic science. They are time-consuming for comparison and collection, and it depends on the police region if they are used. In regions with much violent crimes, we see that this kind of evidence is used less. Shoe prints in blood are an important part of evidence for a homicide that is sometimes skipped due to limitation in time per case. For this reason, the use of shoe print databases should be promoted, since more crimes can be solved, and more forensic information is available.Drugs Tablets33This research is focused on shape recognition. The different methods for shape recognition should be refined for the approach of the 3D shape. The other approach is by making better images, as this will improve the results importantly. Registration methods have been implemented for the comparison. The combination with text (description of logo) of the drugs tablet will greatly reduce the effort, and for larger databases (> 1000) this research should be implemented.In future, if more images are in the database, it could result in more need for such a logo comparison. The problem with shape comparison of these logos is to filter out the images of the logo. It appeared that the acquisition with a regular camera system, the quality of those images is limited, and this results in huge problems of splitting the logos from the drugs tablets with an image processing method. With 3D-techniques, the real shape of the stamp is available, which is under current research.Fig. 3: Shoe print comparison of shoeprint found at the crime scene (left) with a test shoeprint (right). The characteristic marks are pointed with arrows.6. ConclusionsIn this study it appeared that general approaches of searching in forensic image databases are applicable to forensic science databases. The human being is in general still much better in interpreting the separate images, compared to an algorithm. For larger databases, it is not feasible to ask an investigator to compare all forensic evidence, and for this reason the use of image databases in forensic science is very useful, even for crimes in “cold” cases.Pre-processing of the images is important in all forensic applications we have evaluated. The data acquisition should be handled with a standardized approach for optimal results.7. Discussion and future researchFor getting more results out of the separate databases, it is necessary to combine the information of the forensic database. This will result in relations between crimes, which were not considered before. Relations can be found that afterwards appear to be a coincidence and misleading, and sometimes they will really result that the case is solved.It is also possible to use data mining techniques in combination with forensic image databases. Biometric databases for access control (which are expected to be more widely implemented) can also be used. Also information of (cellular) phones and other network systems can be used to track and trace criminals, which will result in an efficient way of reducing data. Furthermore, credit card and banking information combined with the loyalty programs of shops are a way of reducing the number of possibilities between a crime scene and suspects. These methods can not be used for all cases, since privacy issues are involved. For this reason the law in many countries will protect us against un-proportional use of data for criminal investigation.All these methods will result that a criminal is more aware that traces can be found. In court, they are confronted with the evidence. This evidence is public, and the networks of criminals will be informed about new methods for solvingcrimes. Another issue is that they will try to alter the evidence in a way that another person is charged for the crime, as can be seen with DNA evidence.The developments in image acquisition (3D) and image matching algorithms (e.g. MPEG-7) combined with the computing power of future systems, will result in systems that can be used effectively for solving more crimes.It remains important to analyze the results of the search and to have user feed back about the position that a certain image is found on the hit list. The conclusion on matching of two marks or prints should be done by a forensically qualified human being. The forensic examiner should also learn from the image databases and more statistical information of these databases should be used in their investigation.ACKNOWLEDGEMENTSFor this work we would like to acknowledge the Netherlands Forensic Institute, especially Hage Postema for his contribution to this work.REFERENCES1. J.H. Wegstein, J.F. Rafferty, “The LX39 Latent Fingerprint Matcher”, NBS Special Publication, 500(36), pp.15-30, 1978.2. T.D. Kupferschmid,, T. Calicchio, B. Budowle; “Maine Caucasian population DNA database using twelve short tandem repeat.”, Journal of Forensic Sciences, 44(2), pp. 392-395, 1999.3. D.J. Werrett, “The National DNA Database” Forensic Sciences International, 88(1) , pp. 33 42. 1997.4. R.W. Sibert, “DRUGFIRE Revolutionizing Forensic Firearms Identification and Providing the Foundation for a National Firearms Identification Network”, USA Crime Lab Digest, 21(4), pp. 63-68, 1994.5. I.W. Evett; R.L. Williams; “A Review of the Sixteen Points Fingerprint Standard in England and Wales”, UK Fingerprint World, 21(82) P125-143, 1995.6. S.K. Chang, A. Hsu, “Image Information systems: Where do we go from here?”, IEEE Transactions on Knowledge and Information Engineering, 4(5), pp.644-658, 1994.7. P. Agrain; H. Zhang; D. Petkovic; “Content-based representation and retrieval of visual media: a state of the art review”, Multimedia Tools and Applications, 3(3), pp. 179-202, 1997.8. M.J. Swain; D.H. Ballard; “Colour Indexing”,International Journal of Computer Vision, 7(1), pp. 11-32, 1995.9. R. Haralick, L. Shapiro; “Glossary of computer vision terms”, Pattern Recognition, 24(1), pp. 69-93, 1991.10. R. Haralick, “Statistical and structural approaches to texture”, Proceedings of IEEE, 67(5), pp. 786-804, 1977.11. N. Tuceryan, A.K. Jain; Handbook of Pattern Recognition and Computer Vision, World Scientific Publishing Company, pp. 235-276, 1993.12. A. Witkin; “Scale Space filtering”, Proceedings of 7th International Conference on Artificial Intelligence, pp. 1010-1021. 1983.13. S. Aksoy; R.M. Haralick; “Probabilistic vs. Geometric Similarity Measures for Image Retrieval”, Proceedings of Computer Vision and Pattern Recognition (CPRV’00), pp. 112-128, 2001.14. A.L. Youille, K. Honda, C. Peterson; “Particle tracking by deformable templates”, Proceedings International Joint Conference On Neural Networks, pp. 7-11, 1991.15. R. Yossi R; C. Tomasi; Perceptual metrics for Image Database Navigation., Kluwer Academic Publishers, 2001.16. P.J. Phillips PJ, “An Introduction to Evaluating Biometric Systems”, Computer, 33(2), pp. 56-63, 2000.17. B. Miller; “Vital Signs of Identity”, IEEE Spectrum, 31(2), pp. 22-30, 1994.18. E.R. Henry, Classification and uses of Finger Prints, Routledge, London, 1900.19. J. Atick, P.M. Griffin, A.N. Redlich, “FaceIt: face recognition from static and live video for law enforcement”Publication: Proc. SPIE 2932, p. 176-187, Human Detection and Positive Identification: Methods and Technologies, 1997.20. H.A. Rowley, S. Baluja, T. Kanade, “Human Face detection in visual scenes’, in D.S. Toretzky, M.C. Mozer, M.E. Hasselmo (Eds), Advances in Neural Information Processing, Volume 8, pp. 875-881, Cambridge, MIT Press, 1996 21. A.J. Howell, “Introduction to Face Recognition”, CRC Press International Series on Computational Intelligence, pp. 219-283, 1999.。
Data Mining PPT

The amount of raw data stored in corporate databases is exploding.
For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database. Raw data by itself, however, does not provide much information.
based on statistical significance.
Genetic algorithms遗传演算法- Optimization techniques based on the
concepts of genetic combination, mutation, and natural selection.
It discovers information within the data that queries and reports can't effectively reveal.
Data Warehouses
The drop in price of data storage has given companies willing to make the investment a tremendous resource: Data about their customers and potential customers stored in "Data Warehouses." Data warehouses are becoming part of the technology. Data warehouses are used to consolidate data located in disparate databases. A data warehouse stores large quantities of data by specific categories so it can be more easily retrieved, interpreted, and sorted by users. Warehouses enable executives and managers to work with vast stores of transactional or other data to respond faster to markets and make more in formed business decisions. It has been predicted that every business will have a data warehouse within ten years. But merely storing data in a data warehouse does a company little good. Companies will want to learn more about that data to improve knowledge of customers and markets. The company benefits when meaningful trends and patterns are extracted from the data.
我所知道的一点DataMining-电子邮件系统

◎我所知道的一點Data Mining1.前言2.定義3.方法4.工具5.應用6.結論◎以上內容提供者:趙民德中央研究院統計科學研究所◎◎資料採礦(Data Mining)連載之一‧何謂DATA MINING‧DATA MINING和統計分析的不同‧為什麼需要DATA MINING何謂DATA MINING?資料採礦的工作(Data Mining)是近年來資料庫應用領域中,相當熱門的議題。
它是個神奇又時髦的技術,但卻也不是什麼新東西,因為Data Mining使用的分析方法,如預測模型(迴歸、時間數列)、資料庫分割(Database Segmentation)、連接分析(Link Analysis)、偏差偵測(Deviation Detection)等;美國政府從第二次世界大戰前,就在人口普查以及軍事方面使用這些技術,但是資訊科技的進展超乎想像,新工具的出現,例如關連式資料庫、物件導向資料庫、柔性計算理論(包括Neural network、Fuzzy theory、Genetic Algorithms、Rough Set等)、人工智慧的應用(如知識工程、專家系統),以及網路通訊技術的發展,使從資料堆中挖掘寶藏,常常能超越歸納範圍的關係;使Data Mining成為企業智慧的一部份。
Data Mining是一個浮現中的新領域。
在範圍和定義上、推理和期望上有一些不同。
挖掘的資訊和知識從巨大的資料庫而來,它被許多研究者在資料庫系統和機器學習(Machine learning)當作關鍵研究議題,而且也被企業體當作主要利基的重要所在。
有許多不同領域的專家,對Data Mining展現出極大興趣,例如在資訊服務業中,浮現一些應用,如在Internet之資料倉儲和線上服務,並且增加企業的許多生機。
隨著資訊科技的進步以及電子化時代的來臨,現今企業所面對的是一個與以往截然不同的競爭環境。
在資訊科技的推波助瀾下,不僅企業競爭的強度與速度倍數於以往,激增的市場交易也使得各企業所需儲存與處理的資料量越來越龐大。
100个信息工程专业术语中英文

100个信息工程专业术语中英文全文共3篇示例,供读者参考篇1Information Engineering is a rapidly growing field that encompasses a wide range of technologies and concepts. With the increasing importance of digital data and communications in today’s world, professionals in the field of Information Engineering need to be well-versed in a variety of technical terms and concepts. In this article, we will explore 100 key Information Engineering terms in both English and Chinese.1. Algorithm 算法2. ASCII American Standard Code for Information Interchange 美国信息交换标准代码3. Back End Development 后端开发4. Binary Code 二进制代码5. Big Data 大数据6. Blockchain 区块链7. Cloud Computing 云计算8. Code 代码9. Compiler 编译器10. Cybersecurity 网络安全11. Data Mining 数据挖掘12. Database 数据库13. Debugging 调试14. Deep Learning 深度学习15. DNS Domain Name System 域名系统16. Encryption 加密17. Firewall 防火墙18. Front End Development 前端开发19. GUI Graphic User Interface 图形用户界面20. HTML Hypertext Markup Language 超文本标记语言21. HTTP Hypertext Transfer Protocol 超文本传输协议22. HTTPS Hypertext Transfer Protocol Secure 安全超文本传输协议23. Information System 信息系统24. IoT Internet of Things 物联网25. IP Internet Protocol 互联网协议26. Java Script 脚本语言27. LAN Local Area Network 局域网28. Machine Learning 机器学习29. Malware 恶意软件30. Meta Data 元数据31. Network 网络32. Object-Oriented Programming 面向对象编程33. Operating System 操作系统34. PHP Hypertext Preprocessor 超文本预处理器35. Python 编程语言36. Query 查询37. Responsive Design 响应式设计38. Ruby on Rails 开发框架39. Server 服务器40. SQL Structured Query Language 结构化查询语言41. SSL Secure Sockets Layer 安全套接字层42. UI User Interface 用户界面43. URL Uniform Resource Locator 统一资源定位符44. UX User Experience 用户体验45. Virtual Reality 虚拟现实46. VPN Virtual Private Network 虚拟私人网络47. WAN Wide Area Network 广域网48. Web Development 网页开发49. XML Extensible Markup Language 可扩展标记语言50. Agile Development 敏捷开发51. API Application Programming Interface 应用程序接口52. ASP Active Server Pages53. Cache 缓存54. CLI Command Line Interface 命令行界面55. CSS Cascading Style Sheets 层叠样式表56. DHCP Dynamic Host Configuration Protocol 动态主机配置协议57. Document Object Model 文档对象模型58. Ethernet 以太网59. FTP File Transfer Protocol 文件传输协议60. Gigabyte 千兆字节61. Hacking 黑客攻击62. IDE Integrated Development Environment 集成开发环境63. IP Address Internet Protocol Address 互联网协议地址64. Java 编程语言65. LAN Local Area Network 局域网66. Megabyte 百万字节67. Malware 恶意软件68. NAT Network Address Translation 网络地址转换69. P2P Peer to Peer 对等网络70. Repository 存储库71. Serialization 序列化72. SDK Software Development Kit 软件开发工具包73. TCP/IP Transmission Control Protocol/Internet Protocol 传输控制协议/互联网协议74. URL Uniform Resource Locator 统一资源定位符75. Virtual Machine 虚拟机76. VPN Virtual Private Network 虚拟私人网络77. Wireless 网络78. XML Extensible Markup Language 可扩展标记语言79. Algorithm 算法80. API Application Programming Interface 应用程序接口81. Agile Development 敏捷开发82. Back End Development 后端开发83. Big Data 大数据84. Cloud Computing 云计算85. HTML Hypertext Markup Language 超文本标记语言86. HTTP Hypertext Transfer Protocol 超文本传输协议87. JavaScript 脚本语言88. LAN Local Area Network 局域网89. Machine Learning 机器学习90. Malware 恶意软件91. Meta Data 元数据92. Network 网络93. Operating System 操作系统94. PHP Hypertext Preprocessor 超文本预处理器95. Query 查询96. SQL Structured Query Language 结构化查询语言97. UI User Interface 用户界面98. URL Uniform Resource Locator 统一资源定位符99. UX User Experience 用户体验100. WAN Wide Area Network 广域网These terms are just a starting point for those interested in Information Engineering, and there are many more concepts and technologies to explore in this vast and ever-changing field. By familiarizing yourself with these key terms in both English and Chinese, you will be better equipped to navigate the world of Information Engineering and stay up-to-date with the latest developments in technology.篇2Information technology is a constantly evolving field, with new technologies and terms emerging all the time. As professionals in the industry, it is important to stay up-to-date with the latest terms and concepts. In this article, we will explore 100 key information engineering terms in both English and Chinese.1. Artificial Intelligence (AI) - 人工智能2. Big Data - 大数据3. Cloud Computing - 云计算4. Cybersecurity - 网络安全5. Data Analytics - 数据分析6. Database Management System (DBMS) - 数据库管理系统7. DevOps - 开发运营8. Internet of Things (IoT) - 物联网9. Machine Learning - 机器学习10. Mobile App Development - 移动应用开发11. Agile - 敏捷12. API (Application Programming Interface) - 应用程序接口13. Backend - 后端14. Frontend - 前端15. HTML (Hypertext Markup Language) - 超文本标记语言16. CSS (Cascading Style Sheets) - 层叠样式表17. JavaScript - JavaScript18. Python - Python19. Java - Java20. C++ - C++21. Blockchain - 区块链22. Cryptocurrency - 加密货币23. Digital Transformation - 数字转型24. E-commerce - 电子商务25. Firewall - 防火墙26. GUI (Graphical User Interface) - 图形用户界面27. Hadoop - Hadoop28. Virtual Reality (VR) - 虚拟现实29. Augmented Reality (AR) - 增强现实30. Wearable Technology - 可穿戴技术31. API Economy - API经济32. Agile Development - 敏捷开发33. Back-End Development - 后端开发34. Cloud-Native - 云本地35. Data Mining - 数据挖掘36. Deep Learning - 深度学习37. DevOps Engineer - 开发运营工程师38. Docker - Docker39. Full-Stack Developer - 全栈开发者40. Git - Git41. Internet Protocol (IP) - 互联网协议42. Load Balancer - 负载均衡器43. Microservices - 微服务44. Network Security - 网络安全45. Open Source - 开源46. Responsive Design - 响应式设计47. SAAS (Software as a Service) - 作为服务的软件48. Scrum - Scrum49. Serverless - 无服务器50. SQL (Structured Query Language) - 结构化查询语言51. Virtual Machine - 虚拟机52. Agile Methodology - 敏捷方法论53. API Integration - API集成54. Big Data Analytics - 大数据分析55. Blockchain Technology - 区块链技术56. Cloud Storage - 云存储57. Cyber Attack - 网络攻击58. Data Security - 数据安全59. DevOps Tools - 开发运营工具60. Front-End Development - 前端开发61. GitLab - GitLab62. IoT Devices - 物联网设备63. JavaScript Frameworks - JavaScript框架64. Network Administrator - 网络管理员65. Open Source Software - 开源软件66. Responsive Web Design - 响应式网页设计67. SAAS Provider - SAAS供应商68. Scrum Master - Scrum主管69. Software Development Life Cycle (SDLC) - 软件开发生命周期70. SQL Database - SQL数据库71. VPN (Virtual Private Network) - 虚拟专用网72. Artificial Neural Network - 人工神经网络73. BI (Business Intelligence) - 商业智能74. Cloud Migration - 云迁移75. Data Science - 数据科学76. Docker Container - Docker容器77. Full-Stack Development - 全栈开发78. GitHub - GitHub79. IT Infrastructure - IT基础设施80. Machine Learning Algorithm - 机器学习算法81. Agile Scrum - 敏捷Scrum82. API Development - API开发83. Application Development - 应用程序开发84. Blockchain Developer - 区块链开发者85. Cloud Architecture - 云架构86. Cybersecurity Threat - 网络安全威胁87. Data Encryption - 数据加密88. DevOps Culture - 开发运营文化89. Front-End Frameworks - 前端框架90. Graphical User Interface (GUI) Design - 图形用户界面设计91. Internet Security - 互联网安全92. Machine Learning Model - 机器学习模型93. Network Protocol - 网络协议94. Open Source Development - 开源开发95. Responsive Web Development - 响应式网页开发96. SAAS Application - SAAS应用97. Scrum Team - Scrum团队98. Software Development Tools - 软件开发工具99. SQL Query - SQL查询100. Virtualization Technology - 虚拟化技术These 100 terms cover a wide range of topics in the information engineering field, from programming languages and development methodologies to cybersecurity and data analytics. By understanding and familiarizing yourself with these terms, you can better navigate the rapidly evolving landscape of information technology and stay ahead of the curve in this dynamic industry.篇3Information technology is a fast-growing field that encompasses a wide range of specialized terms and concepts. Having a good understanding of these terms can greatly benefit professionals in the industry. In this article, we will introduce 100 common information engineering terms in English.1. Algorithm - a step-by-step procedure for solving a problem2. API (Application Programming Interface) - a set of rules and protocols for building software applications3. Bandwidth - the amount of data that can be transmitted ina fixed amount of time4. Big Data - extremely large data sets that may be analyzed to reveal patterns and trends5. Blockchain - a decentralized digital ledger that records transactions across multiple computers6. Cloud Computing - the delivery of computing services over the internet7. Cybersecurity - the practice of protecting systems, networks, and data from digital attacks8. Data Mining - the process of analyzing large data sets to discover patterns and trends9. Database - a collection of structured data that is stored and accessed electronically10. Debugging - the process of identifying and fixing errors in software code11. Deep Learning - a subset of machine learning that uses neural networks with multiple layers12. DNS (Domain Name System) - a system that translates domain names into IP addresses13. Encryption - the process of converting data into a code to prevent unauthorized access14. Firewall - a network security system that monitors and controls incoming and outgoing network traffic15. HTML (Hypertext Markup Language) - the standard markup language for creating web pages16. IoT (Internet of Things) - the network of physical devices embedded with sensors and software17. Machine Learning - a type of artificial intelligence that enables computers to learn and improve from experience18. Metadata - data that provides information about other data19. Network - a collection of interconnected computers and other devices20. Operating System - software that manages computer hardware and software resources21. Open Source - software that is freely available for use, modification, and distribution22. PHP (Hypertext Preprocessor) - a server-side scripting language used for web development23. Python - a high-level programming language known for its simplicity and readability24. RDBMS (Relational Database Management System) - a type of database management system that stores data in tables25. Scrum - an agile project management framework for managing software development projects26. SDLC (Software Development Life Cycle) - a series of phases that software goes through from concept to delivery27. SEO (Search Engine Optimization) - the process of improving a website's visibility in search engine results28. Server - a computer or software that provides services to other computers over a network29. Software - a collection of programs and data that tell a computer how to operate30. SQL (Structured Query Language) - a programming language used for managing data in relational databases31. SSL (Secure Sockets Layer) - a standard security technology for establishing an encrypted link between a web server and a browser32. TCP/IP (Transmission Control Protocol/Internet Protocol) - the suite of protocols that governs how data is transmitted on the internet33. Virtualization - the creation of a virtual version of a device or resource34. VPN (Virtual Private Network) - a secure network that enables users to access the internet privately and securely35. Web Development - the process of creating websites and web applications36. XML (eXtensible Markup Language) - a markup language that defines a set of rules for encoding documents in a format that is readable by humans and machinesThis is just a small sampling of the many terms and concepts that are important in the field of information engineering. By familiarizing yourself with these terms and understanding their meanings, you can improve your knowledge and expertise in the industry. Whether you are a student studying information technology or a professional working in the field, having a solidunderstanding of these terms can help you succeed in your career.。
6-data mining(1)
Part II Data MiningOutlineThe Concept of Data Mining(数据挖掘概念) Architecture of a Typical Data Mining System (数据挖掘系统结构)What can be Mined? (能挖掘什么?)Major Issues(主要问题)in Data MiningData Cleaning(数据清理)3What Is Data Mining?Data mining is the process of discovering interesting knowledge from large amounts of data. (数据挖掘是从大量数据中发现有趣知识的过程) The main difference that separates information retrieval apart from data mining is their goals. (数据挖掘和信息检索的主要差别在于他们的目标) Information retrieval is to help users search for documents or data that satisfy their information needs(信息检索帮用户寻找他们需要的文档/数据)e.g. Find customers who have purchased more than $10,000 in the last month .(查找上个月购物量超过1万美元的客户)Data mining discovers useful knowledge by analyzing data correlations using sophisticated data mining techniques(数据挖掘用复杂技术分析…)e.g. Find all items which are frequently purchased with milk .(查找经常和牛奶被购买的商品)A KDD Process (1) Some people view data mining as synonymous5A KDD Process (2)Learning the application domain (学习应用领域相关知识):Relevant knowledge & goals of application (相关知识和目标) Creating a target data set (建立目标数据集) Data selection, Data cleaning and preprocessing (预处理)Choosing functions of data mining (选择数据挖掘功能)Summarization, classification, association, clustering , etc.Choosing the mining algorithm(s) (选择挖掘算法)Data mining (进行数据挖掘): search for patterns of interest Pattern evaluation and knowledge presentation (模式评估和知识表示)Removing redundant patterns, visualization, transformation, etc.Present results to user in meaningful manner.Use of discovered knowledge (使用所发现的知识)7Concept/class description (概念/类描述)Characterization(特征): provide a summarization of the given data set Comparison(区分): mine distinguishing characteristics(挖掘区别特征)that differentiate a target class from comparable contrasting classes. Association rules (correlation and causality)(关联规则)Association rules are of the form(这种形式的规则): X ⇒Y,Examples: contains(T, “computer”) ⇒contains(T, “software”)[support = 1%, confidence = 50%]age(X, “20..29”) ∧income(X, “20..29K ”) ⇒buys(X, “PC ”)[support = 2%, confidence = 60%]Classification and Prediction (分类和预测)Find models that describe and distinguish classes for future prediction.What kinds of patterns can be mined?(1)What kinds of patterns can be mined?(2)Cluster(聚类)Group data to form some classes(将数据聚合成一些类)Principle: maximizing the intra-class similarity and minimizing the interclass similarity (原则: 最大化类内相似度,最小化类间相似度) Outlier analysis: objects that do not comply with the general behavior / data model. (局外者分析: 发现与一般行为或数据模型不一致的对象) Trend and evolution analysis (趋势和演变分析)Sequential pattern mining(序列模式挖掘)Regression analysis(回归分析)Periodicity analysis(周期分析)Similarity-based analysis(基于相似度分析)What kinds of patterns can be mined?(3)In the context of text and Web mining, the knowledge also includes: (在文本挖掘或web挖掘中还可以发现)Word association (术语关联)Web resource discovery (WEB资源发现)News Event (新闻事件)Browsing behavior (浏览行为)Online communities (网上社团)Mining Web link structures to identify authoritative Web pages finding spam sites (发现垃圾网站)Opinion Mining (观点挖掘)…10Major Issues in Data Mining (1)Mining methodology(挖掘方法)and user interactionMining different kinds of knowledge in DBs (从DB 挖掘不同类型知识) Interactive mining of knowledge at multiple levels of abstraction (在多个抽象层上交互挖掘知识)Incorporation of background knowledge (结合背景知识)Data mining query languages (数据挖掘查询语言)Presentation and visualization of data mining results(结果可视化表示) Handling noise and incomplete data (处理噪音和不完全数据) Pattern evaluation (模式评估)Performance and scalability (性能和可伸缩性) Efficiency(有效性)and scalability(可伸缩性)of data mining algorithmsParallel(并行), distributed(分布) & incremental(增量)mining methods©Wu Yangyang 11Major Issues in Data Mining (2)Issues relating to the diversity of data types (数据多样性相关问题)Handling relational and complex types of data (关系和复杂类型数据) Mining information from heterogeneous databases and www(异质异构) Issues related to applications (应用相关的问题) Application of discovered knowledge (所发现知识的应用)Domain-specific data mining tools (面向特定领域的挖掘工具)Intelligent query answering (智能问答) Process control(过程控制)and decision making(决策制定)Integration of the discovered knowledge with existing knowledge:A knowledge fusion problem (知识融合)Protection of data security(数据安全), integrity(完整性), and privacy12CulturesDatabases: concentrate on large-scale (non-main-memory) data.(数据库:关注大规模数据)To a database person, data-mining is an extreme form of analytic processing. Result is the data that answers the query.(对数据库工作者而言数据挖掘是一种分析处理, 其结果就是问题答案) AI (machine-learning): concentrate on complex methods, small data.(人工智能(机器学习):关注复杂方法,小数据)Statistics: concentrate on models. (统计:关注模型.)To a statistician, data-mining is the inference of models. Result is the parameters of the model (数据挖掘是模型推论, 其结果是一些模型参数)e.g. Given a billion numbers, a statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.©Wu Yangyang 13Data Cleaning (1)Data Preprocessing (数据预处理):Cleaning, integration, transformation, reduction, discretization (离散化) Why data cleaning? (为什么要清理数据?)--No quality data, no quality mining results! Garbage in, Garbage out! Measure of data quality (数据质量的度量标准)Accuracy (正确性)Completeness (完整性)Consistency(一致)Timeliness(适时)Believability(可信)Interpretability(可解释性) Accessibility(可存取性)14Data Cleaning (2)Data in the real world is dirtyIncomplete (不完全):Lacking some attribute values (缺少一些属性值)Lacking certain interest attributes /containing only aggregate data(缺少某些有用属性或只包含聚集数据)Noisy(有噪音): containing errors or outliers(包含错误或异常) Inconsistent: containing discrepancies in codes or names(不一致: 编码或名称存在差异)Major tasks in data cleaning (数据清理的主要任务)Fill in missing values (补上缺少的值)Identify outliers(识别出异常值)and smooth out noisy data(消除噪音)Correct inconsistent data(校正不一致数据) Resolve redundancy caused by data integration (消除集成产生的冗余)15Data Cleaning (3)Handle missing values (处理缺值问题) Ignore the tuple (忽略该元组) Fill in the missing value manually (人工填补) Use a global constant to fill in the missing value (用全局常量填补) Use the attribute mean to fill in the missing value (该属性平均值填补) Use the attribute mean for all samples belonging to the same class to fill in the missing value (用同类的属性平均值填补) Use the most probable value(最大可能的值)to fill in the missing value Identify outliers and smooth out noisy data(识别异常值和消除噪音)Binning method (分箱方法):First sort data and partition into bins (先排序、分箱)Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.(然后用平均值、中值、边界值平滑)©Wu Yangyang 16Data Cleaning (4)Example: Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins (分成等深的箱):-Bin 1: 4, 8, 9, 15-Bin 2: 21, 21, 24, 25-Bin 3: 26, 28, 29, 34Smoothing by bin means (用平均值平滑):-Bin 1: 9, 9, 9, 9-Bin 2: 23, 23, 23, 23-Bin 3: 29, 29, 29, 29Smoothing by bin boundaries (用边界值平滑):-Bin 1: 4, 4, 4, 15-Bin 2: 21, 21, 25, 25-Bin 3: 26, 26, 26, 34Clustering (。
WEB MINING
A KDD Process
Steps of a KDD Process(1) Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes
• Data transformation • Data reduction
Why Data Preprocessing?
• No quality data, no quality mining results!
T a x a b le In c o m e 75K 50K 150K 90K 40K 80K
C heat ? ? ? ? ? ?
Yes No No Yes No No Yes No No No
Single Married Single Married
No Yes No Yes No No
Divorced 95K Married 60K
Data Mining Models and Tasks
Model Classifer
Patterns
Two Types of Data Mining
• Supervised
– Know specifically what we are looking for
• Who is likely to respond to our offer?
– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
extracted
extractedExtractedIntroductionIn today's digital age, the extraction of important information from various sources has become a vital aspect of many industries. Whether it is the analysis of large datasets, the extraction of key insights from research papers, or the retrieval of specific information from text documents, the process of extraction plays a crucial role in shaping decision-making processes and driving innovation. This document aims to explore the concept of extraction, its significance, and the methods used to extract relevant information.Understanding ExtractionExtraction, in simple terms, refers to the process of isolating and obtaining specific information or data from a given source. This can involve extracting text, numerical values, or other structured data from databases, web pages, documents, or any other source that contains relevant information. Theextracted data can then be analyzed, processed, and utilized for various purposes.Significance of ExtractionThe extraction of relevant information holds immense significance in numerous industries and domains. Let's explore a few areas where extraction plays a crucial role:1. Business Intelligence:Extraction is a fundamental step in business intelligence, where data analysts extract, transform, and load (ETL) data from multiple sources into a centralized database. This enables organizations to gain valuable insights into their operations, customers, and markets, helping them make informed decisions.2. Scientific Research:Researchers often rely on extraction techniques to extract relevant data from a vast array of research papers, articles, and publications. This allows them to analyze, compare, and synthesize information, aiding in the development of new theories or advancements in various scientific fields.3. Information Retrieval:In the digital age, the extraction of specific information from vast amounts of unstructured text is crucial for efficient information retrieval. Search engines, for example, use extraction techniques to identify and retrieve relevant web pages or documents based on user queries.Methods of ExtractionExtracting information from diverse sources requires the use of various methods and techniques. Here are some commonly used methods:1. Text Extraction:Text extraction utilizes natural language processing (NLP) techniques to extract relevant information from textual sources. This can involve identifying key phrases, entities, or concepts, extracting specific data from structured documents, or even summarizing lengthy texts.2. Web Scraping:Web scraping involves automatically extracting data from websites. This method uses specialized tools and techniquesto scrape and extract specific information, such as product details, customer reviews, or financial data, from web pages.3. Data Mining:Data mining techniques aim to extract patterns and insights from large datasets. By using advanced algorithms and statistical analysis, data mining helps identify hidden relationships or trends within the data, enabling businesses to make data-driven decisions.4. Image and Video Processing:Extraction is not limited to text-based information; it can also involve analyzing images or videos. Image and video processing techniques, such as object recognition or motion detection, allow for the extraction of valuable data from multimedia sources.Challenges and ConsiderationsWhile extraction techniques offer immense opportunities, there are several challenges and considerations that need to be addressed:1. Data Quality:Ensuring the accuracy and reliability of extracted data is crucial. Data inconsistencies, noise, or missing values can significantly impact the validity of insights drawn from extracted data. Therefore, implementing quality control measures is essential.2. Legal and Ethical Considerations:Extraction must comply with legal and ethical guidelines. Individuals' privacy, intellectual property rights, and data usage restrictions must be respected, and appropriate consent must be obtained when extracting data-related information.3. Technical Limitations:Extraction processes can be computationally intensive, requiring powerful hardware and storage capabilities. Additionally, handling large datasets may pose challenges in terms of processing time and storage requirements.ConclusionExtraction plays a crucial role in various domains, enabling organizations to gain valuable insights, make informeddecisions, and drive innovation. By employing different extraction techniques, such as text extraction, web scraping, data mining, and image processing, businesses and researchers can obtain specific information from diverse sources. However, it is essential to address challenges related to data quality, legal and ethical considerations, and technical limitations to ensure the reliability and validity of extracted information. Overall, extraction has become an indispensable tool in the modern information-driven era.。
英文web表达的中文是什么意思
英文web表达的中文是什么意思英文单词web所表达的中文意思英 [web] 美 [wɛb]名词蜘蛛网,网状物; [机]万维网; 织物; 圈套及物动词在…上织网; 用网缠住; 使中圈套; 形成网状名词1. The spider spins its web.蜘蛛结网。
2. Ducks are web-footed to help them move through the water.鸭子脚上有蹼,便于游泳。
3. His story was a web of lies.他的话是一套谎言。
4. Who can understand the web of life?谁能弄懂这错综复杂的人生?英文单词web的单语例句1. Web security problems have disrupted the world's business operations from time to time.2. Web security problems have disrupted global business operations from time to time.3. They can publish the registration number for their business charters on the Web page of their online stores.4. The summit provided a platform for hundreds of business people to discuss Internet adoption among enterprises as well as the trend of Web use.5. LOS ANGELES - George Clooney personally responded Thursday to rumors circulated by two gossip Web sites.6. The foundation notified all donors by letter within the last 10 days to let them know their names would be published on its Web site.7. But the Web network was disintegrated by the end of 2004 for no specific reason.8. A report by the China Financial Certification Authority said that 60 percent of Web surfers would not make deals via online bankingfor security concerns.9. The campaign said the ad would run on national cabletelevision networks and the campaign's Web site.10. An online tracking firm in California said Web sites specialising in gifts and flowers had nearly trebled from this time last year.英文单词web的双语例句1. According to the NHTSA`s Web site, the recall is due to a faulty drive shaft cover plate that could lead to a detached drive shaft.根据美国国家公路交通安全局的网站,此次召回是由于错误的传动轴盖板可能导致脱离传动轴。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Mining Data Records in Web PagesBing Liu Department of Computer Science University of Illinois at Chicago 851 S. Morgan StreetChicago, IL 60607-7053liub@Robert GrossmanDept. of Mathematics, Statistics, andComputer ScienceUniversity of Illinois at Chicago851 S. Morgan Street, IL 60607grossman@Yanhong ZhaiDepartment of Computer ScienceUniversity of Illinois at Chicago851 S. Morgan StreetChicago, IL 60607-7053yzhai@ABSTRACTA large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products and services. It is useful to mine such data records in order to extract information from them to provide value-added services.Existing approaches to solving this problem mainly include the manual approach, supervised learning, and automatic techniques. The manual method is not scalable to a large number of pages. Supervised learning needs manually prepared positive and negative training data and thus also require substantial human effort. Current automatic techniques are still unsatisfactory because of their poor performances. In this paper, we propose a much more effective automatic technique to perform the task. This technique is based on two important observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and non-contiguous data records. By non-contiguous data records, we mean that two or more data records intertwine in terms of their HTML codes.When they are displayed on a browser, each of them appears contiguous. Existing techniques are unable to mine such records. Our experimental results show that the proposed technique outperforms existing techniques substantially. Categories and Subject DescriptorsI.5 [Pattern Recognition]: statistical and structural.H.2.8 [Database Applications]: data miningKeywordsWeb data records, Web mining1.INTRODUCTIONA large amount of information on the Web is presented in regularly structured objects. A list of such objects in a Web page often describes a list of similar items, e.g., a list of products or services. They can be regarded as database records displayed in Web pages with regular patterns. In this paper, we also call them data records. Mining data records in Web pages is useful because it allows us to extract and integrate information from multiple sources to provide value-added service, e.g., customizable Web information gathering, comparative-shopping, meta-search, etc. Figure 1 gives an example, which is a segment of a Web page that lists two Apple notebooks. The full description of each notebook is a data record. The objective of the proposed technique is toautomatically mine all the data records in a given Web page.Figure 1. An example: two data recordsSeveral approaches have been reported in the literature for mining data records (or their boundaries) from Web pages. The first approach is the manual approach. By observing a Web page and its source code, the programmer can find some patterns and then writes a program to identify each data record. This approach is not scalable to a large number of pages. Other approaches [2][4] [6][7][8][9][10][12][14][16][17][18][19][21][22][23] all have some degree of automation. They rely on some specific HTML tags and/or machine learning techniques to separate objects. These methods either require prior syntactic knowledge or human labeling of specific regions in the Web page to mark them as interesting. [10] presents an automatic method which uses a set of heuristics and domain ontology to perform the task. Domain ontology is costly to build (about 2-man weeks for a given Web site) [10]. [2] extends this approach by designing some additional heuristics without using any domain knowledge. We will show in the experiment section that the performance of this approach is poor. [4] proposes another automatic method, which uses Patricia tree [11] and approximate sequence alignment to find patterns (which represent a set of data records) in a Web page. Due to the inherent limitation of Patricia tree and inexact sequence matching, it often produces many patterns and most of them are spurious. In many cases, none of the actual data records is found. Again, this method performs poorly. [19] proposes a method using clusteringPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.Conference ’00, Month 1-2, 2000, City, State.Copyright 2000 ACM 1-58113-000-0/00/0000…$5.00.and grammar induction of regular languages. As shown in [19], the results are not satisfactory. More detailed discussions on these and other related works will be given in the next section. Another problem with existing automatic approaches is that they assume that the relevant information of a data record is contained in a contiguous segment of the HTML code. This model is insufficient because in some Web pages, the description of one object (or a data record) may intertwine with the descriptions of some other objects in the HTML source code. For example, the descriptions of two objects in the HTML source may follow this sequence, part1 of object1, part1 of object2, part2 of object1, part2 of object2. Thus, the descriptions of both object1 and object2 are not contiguous. However, when they are displayed on a Web browser, they appear contiguous to human viewers.In this paper, we propose a novel and more effective method to mine data records in a Web page automatically. The algorithm is called MDR (Mining Data Records in Web pages). It currently finds all data records formed by table and form related tags, i.e., table, form, tr, td, etc. A large majority of Web data records are formed by them. Our method is based on two observations:1. A group of data records that contains descriptions of a set ofsimilar objects are typically presented in a contiguous region of a page and are formatted using similar HTML tags. Such a region is called a data record region (or data region in short).For example, in Figure 1 two notebooks are presented in one contiguous region. They are also formatted using almost the same sequence of HTML tags. If we regard the HTML formatting tags of a page as a long string, we can use string matching to compare different sub-strings to find those similar ones, which may represent similar objects or data records.The problem with this approach is that the computation is prohibitive because a data record can start from anywhere and end anywhere. A set of data records typically do not have the same length in terms of their tag strings because they may not contain exactly the same pieces of information (see Figure 1).The next observation helps us to deal with this problem.2.The nested structure of HTML tags in a Web page naturallyforms a tag tree [3]. Our second observation is that a group of similar data records being placed in a specific region is reflected in the tag tree by the fact that they are under one parent node, although we do not know which parent (our algorithm will find out). For example, the tag tree for the page in Figure 1 is given in Figure 2 (some details are omitted).Each notebook (a data record) in Figure 1 is wrapped in 5 TR nodes with their sub-trees under the same parent node TBODY (Figure 2). The two data records are in the two dash-lined boxes. In other words, a set of similar data records are formed by some child sub-trees of the same parent node.A further note is that it is very unlikely that a data recordstarts inside of a child sub-tree and ends inside another child sub-tree. Instead, it starts from the beginning of a child sub-tree and ends at the same or a later child sub-tree. For example, it is unlikely that a data record starts from TD* and ends at TD# (Figure 2). This observation makes it possible to design a very efficient algorithm to mine data records.Our experiments show that these observations are true. So far, we have not seen any Web page containing a list of data records that violates these observations. Note that we do not assume that a Web page has only one data region that contains data records. In fact, a Web page may contain a few data regions. Different regions may have different data records. Our method only requires that a data region to have two or more data records. Given a Web page, the proposed technique works in three steps: Step 1: Building a HTML tag tree of the page.Step 2: Mining data regions in the page using the tag tree and string comparison. Note that instead of mining data records directly, which is hard, our method mines data regions first and then find the data records within them. For example, in Figure 2, we first find the single data region below node TBODY. Step 3: Identifying data records from each data region. For example, in Figure 2, this step finds data record 1 and data record 2 in the data region below node TBODY.Figure 2. Tag tree of the page in Figure 1Our contributions1. A novel and effective technique is proposed to automaticallymine data records in a Web page. Extensive evaluation using a large number of Web pages from diverse domains show that the proposed technique is able to produce dramatically better results than the state-of-the-art automatic systems in [2][4] (even without considering non-contiguous data records).2.Our new method is able to discover non-contiguous datarecords, which cannot be handled by existing methods. Our technique is able to handle such cases because it explores the nested structure and presentation features of Web pages.2.RELATED WORKWeb information extraction from regularly structured data records is an important problem. Identifying individual data records is often the first step. So far, several attempts have been made to deal with the problem. We discuss them below.Related works to ours are mainly in the area of wrapper generation. A wrapper is a program that extracts data from a website and put them in a database [3][12][13][16]. There are several approaches to wrapper generation. The first approach is to manually write a wrapper for each Web page based on observed format patterns of the page. This approach is labor intensive and very time consuming. It cannot scale to a large number of pages. The second approach is wrapper induction, which uses supervised learning to learn data extraction rules. A wrapper induction system is typically trained by using manually labeled positive and negative data. Thus, it still needs substantial manual effort aslabeling of data is also labor intensive and time consuming. Additionally, for different sites or even pages in the same site, the manual labeling process may need to be repeated. Example wrapper induction systems include WIEN [17], Softmealy [14], Stalker [21], WL [6], etc. Our technique requires no human involvement. It mines data records in a page automatically. [19] reports a unsupervised method based on clustering and grammar induction. However, the results are unsatisfactory due to the deficiencies of the techniques as noted in [19].[9] reports a comparative shopping agent, which also tries to identify each product from search results. A number of heuristics rules are used to find individual products, e.g., price information, and required attributes from the application domainIn [10], a more general study is made to automatically identify data record boundaries. The method is based on 5 heuristic rules.(1) Highest-count tags: This rule says that those highest-counttags are more likely to be record boundary tags.(2) Identifiable “separate” tags: This rule says that tags, such asht, td, tr, table, p, h1, etc., are more likely to be boundary tags.(3) Standard deviation: This rule computes the number ofcharacters between each occurrence of a candidate separator tag. Those with smallest standard deviation are assumed to be more likely boundary tags.(4) Repeating-tag pattern: This rule says that boundaries oftenhave consistent patterns of two or more adjacent tags.(5) O ntology-matching: This rule uses domain knowledge to findthose domain specific terms that only appear once and only once in data records.A combination technique (based on certainty factor [10] in AI) is used to combine the heuristics to make the final decision.[2] proposes some more heuristics to perform the task, e.g., sibling tag heuristic, which counts the pairs of tags that are immediate siblings in a tag tree, and partial path heuristic, which lists the paths from a node to all other reachable nodes and counts the number of occurrences of each path. It is shown in [2] that the new method (OMINI) performs better than the system in [10], even without using domain ontology. Our technique is very different from these tag based heuristic techniques. We will show that our method outperforms these existing methods dramatically.[4] proposes a method (IEPAD) to find patterns from the HTML tag string, and then use the patterns to extract objects. The method uses a PAT tree (a Patricia tree [11]) to find patterns. The problem with the PAT tree is that it is only able to find exact match of patterns. In the context of the Web, data records are seldom exactly the same. Thus, [4] also proposes a heuristic method based on string alignment to find inexact matches. However, this method results in many patterns, most of which are spurious. For example, for the same segment of a tag string many patterns may be found and they intersect one another. In many cases the right patterns in the page are not found. As we will see in the experiment section, the result of this method is also poor. Finally, all the above automatic methods assume that each record is contiguous in the HTML source. This is not true in some Web pages as we will see in Sections 3.3 and 4. Our method does not make this assumption. Thus, it is able to deal with the problem. 3.THE PROPOSED TECHNIQUEAs mentioned earlier, the proposed method has three main steps. This section presents them in turn. 3.1Building the HTML Tag TreeWeb pages are hypertext documents written in HTML that consists of plain texts, tags and links to image, audio and video files, etc. In this work, we only use tags in string comparison to find data records. Most HTML tags work in pairs. Each pair consists of an opening tag and a closing tag (indicated by <> and </> respectively). Within each corresponding tag-pair, there can be other pairs of tags, resulting in nested blocks of HTML codes. Building a tag tree from a Web page using its HTML code is thus natural. In our tag tree, each pair of tags is considered as one node, and the nested pairs of tags within it are the children of the node. This step performs two tasks:1.Preprocessing of HTML codes:Some tags do not requireclosing tags (e.g., <li> and <hr>). Hence, additional closing tags are inserted to ensure all tags are balanced. Next, useless or redundant tags are removed. Examples include tags related to HTML comments <!--, <script>, and others.2. Building a tag tree: It follows the nested blocks of the HTMLtags in the page to build a tag tree. This is fairly easy. We will not discuss it further. Figure 2 shows an example.3.2Mining Data RegionsThis step mines every data region in a Web page that contains similar data records. Instead of mining data records directly, which is hard, we first mine generalized nodes (defined below) in a page. A sequence of adjacent generalized nodes forms a data region. From each data region, we will identify the actual data records (discussed in Section 3.3). Below, we define generalized nodes and data regions using the HTML tag tree:Definition: A generalized node (or a node combination) of length r consists of r(r≥ 1) nodes in the HTML tag tree with the following two properties:1) the nodes all have the same parent.2) the nodes are adjacent.The reason that we introduce the generalized node is to capture the situation that an object (or a data record) may be contained in a few sibling tag nodes rather than one. For example, in Figures 1 and 2, we can see that each notebook is contained in five table rows (or 5 TR nodes). Note that we call each node in the HTML tag tree a tag node to distinguish it from a generalized node. Definition: A data region is a collection of two or more generalized nodes with the following properties:1) the generalized nodes all have the same parent.2) the generalized nodes all have the same length.3) the generalized nodes are all adjacent.4) the normalized edit distance (string comparison) betweenadjacent generalized nodes is less than a fixed threshold. For example, in Figure 2, we can form two generalized nodes, the first one consists of the first 5 children TR nodes of TBODY, and the second one consists of the next 5 children TR nodes of TBODY. It is important to notice that although the generalized nodes in a data region have the same length (the same number of children nodes of a parent node in the tag tree), their lower level nodes in their sub-trees can be quite different. Thus, they can capture a wide variety of regularly structured objects.To further explain different kinds of generalized nodes and data regions, we make use of an artificial tag tree in Figure 3. For notational convenience, we do not use actual HTML tag names but ID numbers to denote tag nodes in a tag tree. The shaded areas are generalized nodes. Nodes 5 and 6 are generalized nodesof length 1 and they together define the data region labeled 1 if the edit distance condition 4) is satisfied. Nodes 8, 9 and 10 are also generalized nodes of length 1 and they together define the data region labeled 2 if the edit distance condition 4) is satisfied. The pairs of nodes (14, 15), (16, 17) and (18, 19) are generalized nodes of length 2. They together define the data region labeled 3 if the edit distance condition 4) is satisfied.It should be emphasized that a data region includes the sub-trees of the component nodes, not just the component nodes alone.Figure 3: An illustration of generalized nodes and data regions We end this part with some important notes:1. In practice, the above definitions are very robust as ourexperiments show. The key assumption here is that nodes forming a data region are from the same parent, which is reasonable. For example, it is unlikely that a data region starts at node 7 and ends at node 14 (see also Figure 2).2. A generalized node may not represent a final data record (seeSection 3.3). It will be used to find the final data records.3. It is possible for a single parent node to cover more than onedata region, e.g., node 2 in Figure 3. Nodes adjacent to a data region may or may not be part of that region. For example, in Figure 3, nodes 7, 13, and 20 are not part of any data region. 3.2.1Comparing Generalized NodesIn order to find each data region in a Web page, the mining algorithm needs to find the following. (1) Where does the first generalized node of a data region start? For example, in Region 2 of Figure 3, it starts at node 8. (2) How many tag nodes or components does a generalized node in each data region have? For example, in Region 2 of Figure 3, each generalized node has one tag node (or one component).Let the maximum number of tag nodes that a generalized node can have be K, which is normally a small number (< 10). In order to answer (1), we can try to find a data region starting from each node sequentially. To answer (2), we can try: one node, two node combination, …, K node combination. That is, we start from each node and perform all 1-node string comparisons, all 2-node string comparisons, and so on (see the example below). We then use the comparison results to identify each data region.The number of comparisons is actually not very large because: •Due to our assumption, we only perform comparisons among the children nodes of a parent node. For example, in Figure 3, we do not compare node 8 with node 13.•Some comparisons done for earlier nodes are the same as for later nodes (see the example below). We use Figure 4 to illustrate the comparison process. Figure 4 has 10 nodes, which are below a parent node, p. We start from each node and perform string comparison of all possible combinations of component nodes. Let the maximum number of components that a generalized node can have be 3 in this example.Figure 4: combination and comparisonStart from node 1: We compute the following string comparisons. •(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10) •(1-2, 3-4), (3-4, 5-6), (5-6, 7-8), (7-8, 9-10)•(1-2-3, 4-5-6), (4-5-6, 7-8-9)(1, 2) means that the tag string of node 1 is compared with the tag string of node 2. The tag string of a node includes all the tags of the sub-tree of the node. For example, in Figure 2, the tag string for the second TR node below TBODY is <TR TD TD … TD TD>, where “…”denotes the sub-string of sub-tree below the second TD node. The tag string for the third TR node below TBODY is <TR TD TD>.(1-2, 3-4) means that the combined tag string of nodes 1 and 2 is compared with the combined tag string of nodes 3 and 4. Start from node 2: We only compute:•(2-3, 4-5), (4-5, 6-7), (6-7, 8-9)•(2-3-4, 5-6-7), (5-6-7, 8-9-10)We do not need to do 1-node comparisons because they have been done when we started from node 1 above.Start from node 3: We only need to compute:•(3-4-5, 6-7-8)Again, we do not need to do 1-node comparisons. Here, we also do not need to do 2-node comparisons because they have been done when we started from node 1.We do not need to start from any other nodes after node 3 because all the computations have been done. It is fairly easy to prove that the process is complete. It is omitted here due to space limitations. The overall algorithm (MDR) for computing all the comparisons at each node of a tag tree is given in Figure 5. It traverses the tag tree from the root downward in a depth-first fashion (lines 3 and 4). At each internal node, procedure CombComp (Figure 6) performs string comparisons of various combinations of the children sub-trees. Line 1 says that the algorithm will not search for data regions if the depth of the sub-tree from Node is 2 or 1 as it is unlikely that a data region is formed with only a single level of tag(s) (data regions are formed by the children of Node). Algorithm MDR(Node, K)1 if TreeDepth(Node)>=3 then2 CombComp(Node.Children, K);3 for each ChildNode∈Node.Children4 MDR(ChildNode, K);5 endFigure 5: The overall algorithmThe main idea of CombComp has been discussed above. In line 1 of Figure 6, it starts from each node of NodeList. It only needs to try up to the K th node. In line 2, it compares different combinations of nodes, beginning from i-component combinationto K -components combination. Line 3 tests to see whether there is at least one pair of combinations. If not, no comparison is needed. Lines 4-8 perform string comparisons of various combinations by calling procedure EditDist, which compares two strings using edit distance [1][11].CombComp(NodeList , K )1 for (i = 1; i <= K ; i++) /* start from each node */2 for (j = i ; j <= K ; j ++) /* comparing different combinations3 if NodeList [i +2*j -1] exists then4 St = i ;5 for (k = i +j ; k < Size(NodeList ); k +j )6 if NodeList [k +j-1] exists then7 EditDist(NodeList [St ..(k -1)], NodeList [k ..(k +j -1)]);8 St = k +j ;9 endFigure 6: The structure comparison algorithm Assume that the number of element in NodeList is n . Without considering the edit distance comparison, the time complexity of CombComp is O (nK ), which is the number of times that we need to run EditDist. To see this, let us do the following analysis: Starting at node 1, we need at most the following number of comparisons (running EditDist) (we assume that n is much large than K or n / K ≥ 2). Starting from node 2, we have at most:...... …… Starting from node K , we have at most:Adding all together, we have 2)1(+−K K nK , which gives O (nK )(we assume that n is much larger than K ). Since K is normally small (< 10), the algorithm can be considered linear in n . Assume that the total number of nodes in the tag tree is N , the complexity of MDR is O (NK ) without considering string comparison.3.2.2 String Comparison Using Edit DistanceThe string comparison method that we use is based on edit distance (also known as Levenshtein distance) [1][11], which is a widely-used string similarity measure. In this work, we used a normalized version of edit distance to compare similarity between two strings. The edit distance of two strings, s 1 and s 2, is defined as the minimum number of point mutations required to change s 1 into s 2, where a point mutation is one of: (1) change a letter, (2) insert a letter and (3) delete a letter. As edit distance is a well known technique, we will not discuss it further in this paper. The Normalized edit distance ND (s 1, s 2) is obtained by dividings 1 and s 2:1s 2|) [1]. In our application, the computation can be substantially reduced because we are only interested in very similar strings. The computation is only large when the strings are long. If we want the strings to have the similarity of more than 50%, we can use the followingmethod to reduce the computation:• If |s 1| > 2|s 2| or |s 2| > 2|s 1|, no comparison is needed because they are obviously too dissimilar.3.2.3 Determining Data RegionsAfter all string comparisons have been done, we are ready to identify each data region by finding its generalized nodes. We use Figure 7 to illustrate the main issues. There are 8 data records (1-8) in this page. Our algorithm reports each row as a generalized node, and the whole area (the dash-lined box) as a data region.Figure 7. A possible configuration of data recordsThe algorithm basically uses the string comparison results at each parent node to find similar children node combinations to obtain candidate generalized nodes and data regions of the parent node. Three main issues are important for making the final decisions. 1. If a higher level data region covers a lower level data region,we report the higher level data region and its generalized nodes. Cover here means that a lower level data region is within a higher level data region. For example, in Figure 7, at a low level we find that cell 1 and cell 2 are candidate generalized nodes and they together form a candidate data region, row 1. However, they are covered by the data region including all the 4 rows at a higher level. In this case, we only report each row is a generalized node. The reason for taking this approach is to avoid the situations where many very low level nodes (with very small sub-trees) are very similar but do not represent true data records.2. A property about similar strings is that if a set of strings s 1, s 2,s 3, …., s n , are similar to one another, then a combination of any number of them is also similar to another combination of them of the same number. Thus, we only report generalized nodes of the smallest length that cover a data region, which helps us to find the final data records later. In Figure 7, we only report each row as a generalized node rather than a combination of two rows (rows 1-2, and rows 3-4).3. An edit distance threshold is needed to decide whether twostrings are similar. We used a set of training pages to decide it to be 0.3, which performs very well in general (see Section 4). The algorithm for this step is given in Figure 8. It finds every data region and its generalized nodes in a page. T is the edit distance threshold. Node is any node. K is the maximum number of tag nodes in a generalized node (we use 10 in our experiments, which is sufficient). Node.DRs is the set of data regions under Node , and tempDRs is a temporal variable storing the data regions passed up from every Child of Node . Line 1 is the same as line 1 in Figure 5.row 1row 2row 3row 4)1(...)13()12()1(−++−+−+−K nn n n )1(...)13()12(−++−+−Knn n )1(−K n。