数据争用的系统研究(IJITCS-V10-N1-4)

I.J. Information Technology and Computer Science, 2018, 1, 32-39

Published Online January 2018 in MECS (https://www.360docs.net/doc/852385782.html,/)

DOI: 10.5815/ijitcs.2018.01.04

A Systematic Study of Data Wrangling

Malini M. Patil

Associate Professor, Dept. of Information Science and Engineering.

J.S.S Academy of Technical Education, Bengaluru, Karnataka

E-mail: drmalinimpatil@https://www.360docs.net/doc/852385782.html,

Basavaraj N. Hiremath

Research Scholar, Dept. of Computer Science and Engineering,

JSSATE Research Centre, J.S.S Academy of Technical Education, Bengaluru, Karnataka

E-mail: basavaraj@https://www.360docs.net/doc/852385782.html,

Received: 26 September 2017; Accepted: 07 November 2017; Published: 08 January 2018

Abstract—The paper presents the theory, design, usage aspects of data wrangling process used in data ware hous-ing and business intelligence. Data wrangling is defined as an art of data transformation or data preparation. It is a method adapted for basic data management which is to be properly processed, shaped, and is made available for most convenient consumption of data by the potential future users. A large historical data is either aggregated or stored as facts or dimensions in data warehouses to ac-commodate large adhoc queries. Data wrangling enables fast processing of business queries with right solutions to both analysts and end users. The wrangler provides inter-active language and recommends predictive transfor-mation scripts. This helps the user to have an insight of reduction of manual iterative processes. Decision support systems are the best examples here. The methodologies associated in preparing data for mining insights are high-ly influenced by the impact of big data concepts in the data source layer to self-service analytics and visualiza-tion tools.

Index Terms—Business Intelligence, wrangler, prescrip-tive analytics, data integration, predictive transformation.

I.I NTRODUCTION

The evolution of data warehouse (DWH) and business intelligence (BI) started with basic framework of maintaining the wide variety of data sources. In traditional systems, the data warehouse is built to achieve compliance auditing, data analysis, reporting and data mining. A large historical data is either aggregated or stored in facts to accommodate ad hoc queries. In building these dimensional models the basic feature foc used is …clean data? and …integrate data? to have an interaction, when a query is requested from the downstream applications, to envisage meaningful analysis and decision making. This process of cleansing not only relieves the computational related complexities at the business intelligence layer but also on the context of performance. The two key processes involved are detecting discrepancy and transforming them to standard content to carry out to the next level of data warehouse architecture. The wrangler provides interactive transformative language with power of learning and recommending predictive transformational scripts to the users to have an insight into data by reduction in manual iterative processes. There are also tools for learning methodologies to give predictive scripts. Finally, publishing the results of summary data set in a compatible format for a data visualization tool. In metadata management, data integration and cleansing play a vital role. Their role is to understand how to utilize the automated suggestions in changing patterns, be it data type or mismatched values in standardizing data or in identifying missing values.

Data wrangling term is derived and defined as a process to prepare the data for analysis with data visualization aids that accelerates the faster process [1]. It allows reformatting, validating, standardizing, enriching and integration with varieties of data sources which also provides space for self-service by allowing iterative discovery of patterns in the datasets. Wrangling is not dynamic data cleaning [2]. In this process it manages to reduce inconsistency in cleaning incomplete data objects, deleting outliers by identifying abnormal objects. All these methods involve distance metrics to make a consistent dataset. Techniques of data profiling [3] are involved in few of the data quality tools, which consists of iterative processes to clean the corrupted data. These techniques have limitations in profiling the data of multiple relations with dependencies. The optimization techniques and algorithms need to be defined in the tools. Data wrangling refers to …Data preparation? which is associated with business user savvy i.e. self-service capability and enables, faster time to business insights and faster time to action into business solutions to the business end users and analysts in today?s analytics space. As per the recent best practices report from Transforming Data with Intelligence (TDWI) [4], the percentage of time spent in preparing data compared to the time spent in performing analysis is considerable to the tune of 61 per-cent to 80 percentage. The report emphasizes on the challenges like limited data access, poor data quality and delayed data preparation tasks. The best practices fol

lowed are as follows:

1. Evolve as an independent entity without the involvement of information technology experts, as it?s not efficient and time consuming.

2. Data preparation should involve all types of corporate data sets like data warehouses/data lakes, BI data, log data, web data and historical data in documents and reports.

3. To create a hub of data community which eases collaboration for individuals and organization, more informed as agile and productive.

The advent of more statistical and analytical methods to arrive at a decision-making solution of business needs, prompted the use of intensive tools and graphical user interface based reports. This advancement paved the way for in-memory and …data visualization? tools in the indu s-try, like Qlik Sense, Tableau, SiSense, and SAS. But their growth was very fast and limited to …BI reporting area?. On the other hand, managing data quality, integration of multiple data sources in the upstream side of data ware-house was inevitable. Because of the generous sized data, as described by authors [5], a framework of bigdata eco-system has emerged to cater the challenges posed by five dimensions of big data viz., volume, velocity, veracity, value and variety. The trend has triggered growth in tech-nology to move towards, designing of self-service, high throughput systems with reduction in deployment dura-tion in all units of building end to end solution ecosystem. The data preparation layer which focuses on “WRANGLING” of data by data wrangler tools, A data management processes to increase the quality and completeness of overall data, to keep as simple and flexible as possible. The features of data wrangler tools involve natural language based suggestive scripts, transformation predictions, reuse of history scripts, formatting, data lineage and quality, profiling and cleaning of data. The Big data stack includes the following processes to be applied for data [6], data platforms integration, data preparation, analytics, advanced analytics.

In all these activities data preparation consumes more time and effort when compared to other tasks. So, the data preparation work requires self-service tools for data transformation.The basic process of building and working with the data warehouse is managed by extract, transform and loading (ETL) tools. They have focused on primary tasks of removing inconsistencies and errors of data termed as data cleansing [7] in decision support systems. Authors specify need of …potters wheel? approach on cleansing processes in parallel with detecting discrepancy and transforming them to an optimized program in [8].

II.R ELATED W ORK

Wrangle the data into a dataset that provides meaning-ful insights to carryout cleansing process, it requires writ-ing codes in idiosyncratic characters in languages of Perl, R and editing manually with tools like MS-Excel [9]. There are processes, that user might use iterative frame-work to cleanse the data at all stages like analysis in visu-alization layer and on downstream side. The understand-ing of sources of data problems can give us the level of quality of data for example sensor data which is collected has automated edits in a standard format, whereas manu-ally edited data can give relatively higher errors which follows different formats.

In the later years, many research works are carried out in developing query and transformational languages. The authors [10] suggests to focus future of research should be on data source formats, split transforms, complex transforms and format transforms which extracts semi automatically restricted text documents. The research works continued along with advancement in structure and architecture of functioning of ETL tools. These structures helped in building warehousing for data mining [11]. The team of authors [12] from Berkeley and Stanford university made a breakthrough work in discovering the methodology and processes of wrangler, an interactive visual specification of data transformation scripts by embedding learning algorithms and self-service recommending scripts which aid validation of data i.e. created by automatic inference of relevant transforms. According to description of data transformation in [13], specify exploratory way of data cleansing and mining which evolved into interesting statistical results. The results were published in economic education commission report, as this leads to statutory regulations of any organization in Europe. The authors [14] provide an insight of specification for cleansing data for large databases. In publication [15], authors stipulated the need of intelligently creating and recommending rules for reformatting. The algorithm automatically generates reformatting rules. This deals with expert system called …tope? that can recognize and reformat. The concepts of Stanford team [7] became a visualized tool and it is patented for future development of enterprise level, scalable and commercial tools The paper discusses about the overview of evolution of data integrator and data preparation tools used in the process of data wrangling.

A broad description of what strategy is made to evolve from basic data cleansing tools to automation and self-service methods in data warehousing and business intelligence space is discussed. To understand this the desktop version of Trifacta software, the global leader in wrangling technology is used as an example [16]. The tool creates a unique experience and partnership with user and machine. A data flow is also shown with a case study of a sample insurance data, in the later part of the paper. The design and usage of data wrangling has been dealt in this paper. The paper is organized as, in section III, detailed and the latest wrangling processes are defined which made Datawarehousing space to identify self-service methods to be carried out by Business analysts, in order to understand how their own data gives insights to be derived from this dataset. The section III A, illustrates a case study for data wrangling process with a sample dataset followed with conclusions.

A. Overview of Data Wrangling Process

For building and designing different DWH and BI roadmap for any business domain the following factors are followed. namely,

?Advances in technology landscape

?Performance of complete process

?Cost of resources

?Ease of design and deployment

?Complexity of data and business needs

?Challenges in Integration and data visualization. ?Rapid development of Data Analytics solution frameworks

Fig.1. Data Flow diagram for data preparation

This trend has fuelled the fast growth of ETL process of enterprise [8] or data integration tools i.e. building the methodologies, processing frameworks and layers of data storages. In the process of evolution of transformation rules, for a specific raw data, various data profiling and data quality tools are plugged in. In handling, various data sources to build an enterprise data warehouse relationship, a “Data Integrator” tool is used which forms the main framework of ETL tools. At the other end, the robust reporting tools, the front end of DWH, built on the downstream of ETL framework also transformed to cater both canned and real-time reports under the umbrella of business intelligence, which helps in decision making. Fig. 1., is a typical data flow which needs to be carried out for preparing data, subjected to exploratory analytics. In this data flow, the discussion on big data is carried out in two cases one, cluster with Hadoop platform and other with multiple application and multi structured environment called as “data lake”. In both the cases the data preparation happens with data wrangling tools which has all the transformation language features. The data obtained from data lake platform has lot of challenging transformations i.e. not only reshaping or resizing but data governance tasks [17] which is like “Governance on need” [6] and the schema of the data is “Schema on read? which is beyond the predefined silos structure. This situation demands self-service tasks in transformations beyond IT driven tasks. So, on hand prescriptive scripts, visual and interactive data profiling methods, supports guided data wrangling tasks steps to reshape and load for the consumption of exploratory analytics. The analysis of evolution of enterprise analytics is shown in Fig. 2. Earlier, business intelligence was giving solutions for business analysis, now predictive and prescriptive analytics guides and influence the business for …what happens? scenarios. These principles guide the data management decisions about the data formats and data interpretation, data presentation and documentation. Some of the Jargons associated are …Data munging? and …Janitor works?.

The purpose and usage of Data Lake will not yield results as expected unless the information processing is fast and the underlying data governance is built with a well-structured form. The business analysis team could not deal with data stored in Hadoop systems and they could not patch with techno functional communication to get what business team wants to accomplish as most of the data is behind the frame, built and designed by IT team. These situations end in the need of self-service data preparation processes.

Fig.2. Evolution of Enterprise Analytics

The self-service tool will help to identify typical transformation data patterns in …easy to view visualize forms. Four years back there were few frameworks with open source tools released as a data quality tools to execute by IT teams. For example, TALEND DQ is emerged as an open source tool for building solutions to analyze and design transformations by IT teams, but they lack automation scripts and predictions for probable transformations. on the other side, advancement of learning algorithms and pattern classifications, fuzzy algorithms helped to design transformational language for enterprise level solution providers. The complexity of data size for identifying business analytical solutions is one of the important challenges of five dimensions of big data.

The authors [18] have the opinion that, wrangling is an opportunity and a problem with context to big data. The unstructured form of big data makes the approaches of manual ETL

processes a problem. If the methods and techniques of wrangling becomes cost effective, the

impractical tasks are made practical. Therefore, the traditional ETL tools, struggled to support for the analytical requirements and adopted the processes by framing tightly governed mappings and released as builds, with a considerable usage of BI and DWH methodologies. The data wrangling solutions were delivered fast by processing the …raw? data into the datasets required for analytical solutions. Traditionally the ETL technologies and data wrangling solutions are data preparation tools. but for gearing up any analytics initiatives they require cross functional and spanning various business verticals of organizations.

There is a need of analysis of data at the beginning, which ETL tools lacks, this can be facilitated by wrangling processes. The combination of visualization, pattern classification of data using algorithms of learning and the interaction of analysts makes process easy and faster, which will be self-serviced data preparation methods. This transformation language is [19] built with visualization interfaces to display profile and to interact with user by recommending predictive transformation for the specific data in wrangling processes. Earlier the programming [11] was used for automating transformations. It?s difficult to program the scripts for complex transformations using a programming language, spread sheets and schema mapping jobs. So, an outcome of research was design of a transformation language, the following basic transformations were considered in this regard.

Sizing: Data type and length of data value matters a lot in cleansing, rounding, truncate the integer and text values respectively.

Reshape: The data to consider different shapes like pivot/un-pivot makes rows and columns to have relevant grid frame.

Look up: To enrich data with more information like addition of attributes from different data sources provided Key ID present.

Joins: Use of joins helps to join two different datasets with join keys.

Semantics: To create semantic mapping of meanings and relationships with a learning tools.

Map positional transforms: To enrich with time and geographical reference maps for the available a data. Aggregations and sorting: The aggregations (count, mean, max, list, min etc.) by group of attributes to facilitate drill down feature and sorting on specific ranking for attribute values.

These transformation activities can be done by framework of architecture which includes connectivity to all the data sources of all sizes of data. This process supports task of metadata management as it is more effective for enrichment of data. An embedded knowledge learning system i.e. using machine learning algorithms are built to give user, a data lineage and matching data sets. The algorithm efficiently identifies categories of attributes like geospatial [20]. A frame of transformational language in Fig. 3. depicts sample script) is used by few wrangling tools which uses their own enterprise level …wrangling languages? [15] like Refine tool uses Google refine expression language [14]. This framework must have an in-built structure to deal with data governance and security, whereas, ETL tool frames an extra module built in enterprise level. Though the current self-service BI tools give limitless visual analytics to showcase and understand data, but they are not managed to give us end to end data governance lineage among data sets. This enables expert users to perform expressive transformations with less difficulty and tedium. They can concentrate on analysis and not being bogged down in preparing data. The design is built to suggest transforms without programming. The wrangler language uses input data type as semantic role to reduce tasks arise out of execution and evaluation by providing natural language scripts and virtual transform previews [8].

Fig.3. Sample script of transformation language

B. Tools for Data Wrangling

The data wrangling technology is evolving at a very faster rate to prove and get placed as enterprise level industry standard software (self-service data preparation tool) [14] To quote few, clear story data, Trifacta, Tamr, Paxata and Google refine, other tools Global ID?s, IBM Data works, Informatica Springbok. These tools are basically used in industry for data transformation, data harmonization, data preparation, data integration, data refinery and data governance tools [15] Currently Trifacta and Paxata are the strong performer tools [16]. In some organizations, the deployment is done both with ETL solutions and Wrangling tools, as ETL solutions do primary data integration and load into enterprise data warehouse. From this system, the business users can experience data analysis by exploring wrangling solutions. Most of data wrangling tools uses predictive transformation scripts with underlying machine learning algorithms, visualizations methods and scalable data technologies, namely., Hadoop based infrastructure framed by Cloudera, Hortonworks and IBM. In earlier phase, most of the “traditional” enterprise ETL / BI tools were doing cleansing of structured data and semi structured data to fit in RDBMS (Relational database management system) frame. But the advent of big data challenged the growth of ETL tools to be reshaped with various add-ons as plugins into the existing framework. So, the enterprise ETL tools were reformed as individual bundled tools to carry out the specific purpose of action as per the industry needs. They are data quality,

data

integrator, big data plugins, cloud integrators. In parallel, the BI space, as discussed earlier, because of fast growth of data visualization and in-memory tools and their ease of use, they evolved into self-service BI tools to leverage purpose of saving ease of design and development i.e. lesser intervention of IT resources, more ease of use of domain expertise to be called as …self-service tools?.

III. D ATA W RANGLING P ROCESS U SING T RIFACTA This section presents the data wrangling process using the desktop version of Trifacta tool. The description of the dataset used for the wrangling process is also dis-cussed.

A benchmark dataset is used to investigate the features of the wrangling tool is taken from catalog USA govern-ment dataset [21]. The data set is from insurance domain and is publicly available, which emphasizes economic growth by the USA federal forum on open data. This case study has three different files to be imported with scenarios of cleansing and standardizing [22]. The dataset has variations of data values with different data types like integer for policy number, date format for policy date, travel time in minutes, education profile in text and has a primary key to process any table joins. The dataset has the scope to enrich, merge, identify missing values, rounding to decimals and to visualize in the wrangling tool with perfect variations in attribute histogram. (Ob-served in Fig.4).

Fig.4. Visualization of distribution of each data attributes

An example of insurance company is considered to in-vestigate the business insights of it. A sample data is se-lected to wrangle. This case study has three different files to be imported with scenarios of cleansing and standard-izing [22]. They are main transaction file, enrich the data file with merged customer profile and the add on data from different branches. Trifacta enables the connectivity to various data sources. The data set is imported using drag and drop features available in Trifacta and saved as a separate file. The data can be previewed to have a first look of attributes and to know the relevance after import-ing. The data analysis is done in grid view displayed each attribute, data type patterns, in order to give a clear data format quality tips as shown in Fig. 4.

The user can get a predictive transformation tips on the processed data values for standardization, cleaning and formatting that will be displayed at the lower pane of the grid like rounding of integer values for e.g., travel time of each insured vehicle is represented in minutes as whole number and not as decimals. If a missing value is found, Trifacta intelligently suggests the profile of data values in each attribute, that is easy for analysis. To have an insight on valuable customers, the user can concentrate on data enrichment, that must be a formatting suggestion for date format.

All the features can be viewed in Fig. 5. The data en-richment is done by bringing profile data of the nature of personal information to be appended with lookup attrib-utes. The lookup parameters are set and mapped to the other files as shown in Fig. 6.

Fig.5. Data format script of transformation language Trifacta provides an option of retaining or deleting the information by selecting number of attributes in the file as shown in Fig. 6. Trifacta also supports editing the user defined functions and can be programmed using other languages like Java and Python. This feature can be used for doing sentiment analysis using social media data as Trifacta is able to import Json data and other text data. Finally, merging of the data files as received from other branches with existing records is done by opening tool menu using union tool which prompts for right keys and attributes. All the modified or processed data transfor-mation steps are stored as stepwise procedures which can be reused and modified for future use. A graphical repre-sentation of flow diagram is shown in Fig. 7. Once the job is run, the result summary parameters can be seen in Fig. 8. Data enrich, data cleansing and data merge is done with simple steps of visualizing and analyzing the data, finally the summarized result data set is stored in

the

native format of data visualization tool like tableau or QlikView or in csv and Json format.

Fig.6. Display of Lookup attributes

Thus, it is found that the wrangler tool allows human machine interactive transformation of real world data. It

enables business analysts to iteratively explore predictive transformation scripts with the help of highly trained learning algorithms. Its inefficient to wrangle data with Excel which limits the data volume. The wrangling tools come with seamless cloud deployment, big data and with scalable capabilities with wide variety of data source connectivity which has evolved from preliminary version displayed in Fig. 9 [19].

Fig.7. Graphical representation of flow diagram

The future of Data science leverages data sources re-served for data scientists which are now made available for business analysis, interactive exploration, predictive transformation, intelligent execution, collaborative data governance are the drivers of the future to build advanced analytics framework with prescriptive analytics to beat the race of analytical agility. The future insights of the data science need a scalable, faster and transparent pro-cesses. The [16] (2017 Q1) report cites that the future data preparation tools must evolve urgently to identify customer insights and must be a self-service tool equipped with machine learning approaches and move towards independence from technology management.

Fig.8. Results summary

Fig.9. Preliminary version of wrangler tool

IV.C ONCLUSION

It is concluded that data processing is a process of more sophisticated. The future …wrangling? tools should be aggressive in high throughput and reduction in time to carry out wrangling tasks in achieving the goal of making data more accessible and informative and ready to ex-plore and mine business insights. The data wrangling solutions are exploratory in nature in arriving at analytics initiative. The key element in encouraging technology is to develop strategy to skill enablement of considerable number of business users, analysts and to create large corpus of data on machine learning models. The

technology revolves around to design and support data integration, quality, governance, collaboration and en-richment [23]. The paper emphasizes on the usage of Trifacta tool for data warehouse process, which is one of its unique kind. It is understood that the data preparation, data visualization, validation, standardization, data en-richment, data integration encompasses a data wrangling process.

R EFERENCES

[1]Cline Don, Yueh Simon and Chapman Bruce, Stankov

Boba, Al Gasiewski, and Masters Dallas, Elder Kelly, Richard Kelly, Painter Thomas H., Miller Steve, Katzberg Steve, Mahrt Larry, (2009), NASA Cold Land Processes Experiment (CLPX 2002/03): Airborne Remote Sensing.

[2]S. K. S and M. S. S, “A New Dynamic Data Cleaning

Technique for Improving Incomplete Dataset Consisten-

cy,” Int. J. Inf. Technol. Comput. Sci., vol. 9, no. 9, pp.

60–68, 2017.

[3] A. Fatima, N. Nazir, an d M. G. Khan, “Data Cleaning In

Data Warehouse: A Survey of Data Pre-processing Tech-

niques and Tools,” Int. J. Inf. Technol. Comput. Sci., vol.

9, no. 3, pp. 50–61, 2017.

[4]Stodder. David (2016), WP 219 - EN, TDWI Best practic-

es report: Improving data preparation for business analyt-

ics Q3 2016. ? 2016 by TDWI, a division of 1105 Media, Inc, [Accessed on: May 3rd, 2017].

[5]Richard Wray, “Internet d ata heads for 500bn gigabytes |

Business the Guardian,” https://www.360docs.net/doc/852385782.html,. [Online] Availa-

ble:https://https://www.360docs.net/doc/852385782.html,/business/2009/may/18/d

igital-content-expansion. [Accessed: 24-Oct-2017].

[6]Aslett Matt Research analyst Trifacta of 451 Research and

Davis Will head of marketing, Trifacta, “Trifacta mai n-

tains data preparation” [July ,7, 2017], [Online] A vailable: https://https://www.360docs.net/doc/852385782.html, [Accessed on: 01 August 2017].

[7]Kandel Sean, Paepcke Andreas, Hellersteiny Joseph and

Heer Jeffrey (2011), Wrangler: Interactive Visual Specifi-

cation of Data Transformation Scripts, ACM Human Fac-

tors in Computing Systems (CHI) ACM 978-1-4503-

0267-8/11/05.

[8]Chaudhuri. S and Dayal. U (1997), An overview of data

warehousing and OLAP technology. In SIGMOD Record.

[9]S. Kandel et al., “Research directions in data wra ngling:

Visualizations and transformations for usable and credible da ta,” Inf. Vis., vol. 10, no. 4, pp. 271–288, 2011.

[10]Chen W, Kifer.M, and Warren D.S, (1993), “HiLog: A

foundation for higher-order logic programming”. In Jour-

nal of Logic Programming, volume 15, pages 187-230. [11]Raman Vijayshankar and Hellerstein Joseph M, frshankar,

(2001) “Potter's Wheel: An Interactive Data Cleaning Sys-

tem”, Proceedings of the 27th VLDB Conference.

[12]Norman D.A, (2013), Text book on “The Design of Eve-

ryday Things, Basic Books”, [Accessed on:12 April 2017].

[13]Car ey Lucy, “Self-Service Data Governance & Prepara-

tion on Hadoop”, https://www.360docs.net/doc/852385782.html,. [Online] (May 29, 2014) Available: https://https://www.360docs.net/doc/852385782.html,/trifacta-ceo-the-

evolution-of-data-transformation-and-its-impact-on-the-

bottom-line-107826.html [Accessed on01 April 2017]. [14]Google code online publication (n.d),

https://www.360docs.net/doc/852385782.html,. [Online] Available: https://https://www.360docs.net/doc/852385782.html,/archive/p/google-refine

https://https://www.360docs.net/doc/852385782.html,/OpenRefine/OpenRefine [Accessed on;

28 March 2017]. [15]Data wrangling platform (2017) publication,

https://www.360docs.net/doc/852385782.html,. [Online] Available: https://https://www.360docs.net/doc/852385782.html,/products/architecture//, [Ac-

cessed on: 01 May 2017].

[16]Little Cinny, The Forrester Wave?: Data Prepar ation

Tools, Q1 2017 “The Seven Providers That Matter Most and How They Stack Up”, [Accessed On:March 13, 2017] [17]Parsons Mark A, Brodzik Mary J, Rutter Nick J. (2004),

“Data management for the Cold Land Processes Experi-

ment: improving hydrological science HYDROLOGICAL PROCESSES” Hydrol. Process. 18, 3637-3653.

[18]T. Furche, G. Gottlob, L. Libkin, G. Orsi, and N. W. Pa-

ton, “Data Wrangling for Big Data: Challenges and O p-

portunities,” EDBT, pp. 473–478, 2016.

[19]Kandel Sean, Paepcke Andreas, Hellersteiny Joseph and

Heer Jeffrey (2011), published Image on Papers tab, https://www.360docs.net/doc/852385782.html,. [Online] Available: https://www.360docs.net/doc/852385782.html,/papers/wrangler%20paper [Ac-

cessed on: 25 May 2017].

[20]Ahuja.S, Roth.M, Gangadharaiah R, Schwarz.P and Bas-

tidas.R, (2016), “Using Machine Learning to Accelerate Data Wrangling”, IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016, Barcelona, Spain, pp. 343-349.doi:10.1109/ICDMW.2016.0055. [21]“Data catalog” Insurance dataset, [Online]https://www.360docs.net/doc/852385782.html,.

Available: https://https://www.360docs.net/doc/852385782.html,/dataset. [Accessed: 24-

Oct-2017].

[22]Piringer Florian Endel Harald, florian. (2015), Data

Wrangling: Making data useful again, IFAC-

PapersOnLine 48-1 (2015) 111-112.

[23]V. kumar, Tan Pang Ning, Steinbach micheal, Introduc-

tion to Data Mining. Dorling Kindersley (India) Pvt. Ltd: Pearson education publisher, 2012.

Authors’ Profiles

Malini M. Patil is presently working as

Associate Professor in the Department of

Information Science and Engineering at

J.S.S. Academy of Technical Education,

Bangalore, Karnataka, INDIA. She

received her Ph.D. degree from Bharathiar

University in the year 2015.

Her research interests are big data analytics, bioinformatics, cloud computing, image processing. She has published more than 20 research papers in many reputed international journals. Published article, Malini M patil, Prof P K Srimani, "Performance analysis of Hoeffding trees in data streams by using massive online analysis framework", 2015, vol 12, International Journal of Data Mining, Modelling and Management (IJDMMM),7,4pg293-313,Inderscience Publishers. Malini M Patil, and Srimani P K, "Mining Data streams with concept drift in massive online analysis frame work", WSES Transactions on computers, 2016/3, Volume 15 Pages 133-142.

Dr. M Patil is a member of IEEE, Institution of Engineers (India), Indian Society for Technical Education, Computer Society of India. She is guiding four students. She has attended and presented papers in many international conferences in India and Abroad. She is a recipient of distinguished woman in Science Award for the year 2017 from Venus International

Foundation. Contact email: drmalinimpatil@https://www.360docs.net/doc/852385782.html,

Basavaraj N Hiremath is a research

scholar working in the field of artificial

intelligence at JSSATE Research Centre,

Dept. of CSE, JSSATE, affiliated to VTU

Belagavi, INDIA. He is pursuing research

under Dr. Malini M Patil, Associate

Professor, Dept., of ISE, JSSATE.

Completed M. S. in Computer Cognition Technology in 2003 from Department of studies in Computer science, University of Mysore, Karnataka INDIA.

He has also worked in information technology industry as SOLUTION ARCHITECT in data warehouse, business Intelligence and analytics space in various business domains of Airlines, Retail, Logistics and FMCG. Published article, Basavaraj N Hiremath, and Malini M Patil, "A Comprehensive Study of Text Analytics", CiiT International Journal of Artificial Intelligent Systems and Machine Learning, 2017/4, vol 9, no 4,70-77

Mr Hiremath is a Fellow of Institution of Engineers (India), member IEEE, member Computer Society of India and Life member Indian Society for Technical Education, Member Association for the Advancement of Artificial Intelligence. How to cite this paper: Malini M. Patil, Basavaraj N. Hiremath, "A Systematic Study of Data Wrangling", International Journal of Information Technology and Computer Science(IJITCS), Vol.10, No.1, pp.32-39, 2018. DOI: 10.5815/ijitcs.2018.01.04

分布式数据库系统及其一致性方法研究

２００７年第２４卷第１０期微电子学与计算机１引言分布式数据库系统在系统结构上的真正含义是指物理上分布、逻辑上集中的分布式数据库结构。数据在物理上分布后，由系统统一管理，用户看到的似乎不是一个分布式数据库，而是一个数据模式为全局数据模式的集中式数据库［１￣５］。分布式数据库系统包括两个重要组成部分：分布式数据库和分布式数据库管理系统。分布式数据库系统具有位置透明性和复制透明性，使用户看到的系统如同一个集中式系统。分布式数据库系统分为三类：同构同质型ＤＤＢＳ、同构异质型ＤＤＢＳ和异构ＤＤＢＳ。同构同质型ＤＤＢＳ是指各个场地都采用同一类型的数据模型，并且是同一型号数据库管理系统；同构异质型ＤＤＢＳ是指各个场地都采用同一类型的数据模型，但是数据库管理系统是不同型号的；异构型ＤＤＢＳ是指各个场地的数据模型是不同的类型。分布式结构是相对于集中式结构而言的。从数据处理的角度来说，典型的集中式结构是数据集中存放和处理，用户通过远程终端或通过网络连接来共享集中存放的数据。分布式结构则是将数据及其处理分散在不同场地，各场地各自管理一部分数据，同时又通过网络系统相互连接。各场地的用户除可以访问和处理本地数据外，也可以访问和处理别的场地的数据。分布式数据库是典型的分布式结构。它包括对数据的分布存储和对事务的分布处理。设计一个分布式数据库系统会遇到许多集中式数据库设计中所没有的问题，一致性是其中必须认真对待和解决的主要问题。２ＤＤＢＳ的体系结构２．１综合型体系结构综合型体系结构是指在综合权衡用户需求之后，设计出分布的数据库，然后再设计出一个完整的ＤＢＭＳ，把ＤＢＭＳ的功能按照一定的决策分散配置在一个分布的环境中。每个结点的ＤＢＭＳ均熟知整个网络的情况，也了解其它结点的情况。从整体上，各结点组成一个完整的系统，它们之间是靠进程通讯的手段来维持互访连接，如图１所示。２．２联合型体系结构联合型体系结构是指每个结点上先有ＤＢＭＳ，以此为基础，再建立分布式环境以实现互访连接。若各个结点的局部ＤＢＭＳ支持同一种数据模式和分布式数据库系统及其一致性方法研究刘萍芬，马瑞芳，王军（西安交通大学电信学院，陕西西安７１００４９）摘要：分布式数据库系统是数据库领域中的一个主要研究方向，数据一致性维护是分布式数据库系统中的一个非常关键的技术问题。在分析分布式数据库系统体系结构的基础上，讨论了两种一致性方法：两阶段提交和复制服务器，并提出一种具有复制服务器的分布式数据库系统的结构框架，它具有有效性和实用性。关键词：分布式数据库系统；一致性；两阶段提交；复制服务器中图分类号：ＴＰ３１文献标识码：Ａ文章编号：１０００－７１８０（２００７）１０－０１３７－０３ＲｅｓｅａｒｃｈｏｆＤｉｓｔｒｉｂｕｔｅｄＤａｔａｂａｓｅＳｙｓｔｅｍａｎｄＤａｔａＣｏｎｓｉｓｔｅｎｃｙＬＩＵＰｉｎｇ－ｆｅｎ，ＭＡＲｕｉ－ｆａｎｇ，ＷＡＮＧＪｕｎ（ＣｏｌｌｅｇｅｏｆＥｌｅｃｔｒｏｎｉｃｓａｎｄＩｎｆｏｒｍａｔｉｏｎＥｎｇｉｎｅｅｔｉｎｇ，Ｘｉ′ａｎＪｉａｏｔｏｎｇＵｎｉｖｅｒｓｉｔｙ，Ｘｉ′ａｎ７１００４９，Ｃｈｉｎａ）Ａｂｓｔｒａｃｔ：Ｄｉｓｔｒｉｂｕｔｅｄｄａｔａｂａｓｅｓｙｓｔｅｍｉｓａｍａｉｎｒｅｓｅａｒｃｈｄｉｒｅｃｔｉｏｎｉｎｔｈｅｄａｔａｂａｓｅｆｉｅｌｄ．Ｍａｉｎｔａｉｎｉｎｇｔｈｅｄａｔａｃｏｎｓｉｓ－ｔｅｎｃｙｉｓａｃｒｉｔｉｃａｌｔｅｃｈｎｉｃａｌｐｒｏｂｌｅｍｉｎｔｈｅｄｉｓｔｒｉｂｕｔｅｄｄａｔａｂａｓｅｓｙｓｔｅｍ．Ｔｈｉｓｐａｐｅｒｄｉｓｃｕｓｓｅｓｔｗｏｍｅｔｈｏｄｓｏｆｍａｉｎｔａｉｎｉｎｇｄａｔａｃｏｎｓｉｓｔｅｎｃｙｂａｓｅｄｏｎａｎａｌｙｚｉｎｇｔｈｅｓｔｒｕｃｔｕｒｅｏｆｔｈｅｄｉｓｔｒｉｂｕｔｅｄｄａｔａｂａｓｅｓｙｓｔｅｍ，ｗｈｉｃｈａｒｅ２ＰＣａｎｄｒｅｐｌｉｃａｔｉｏｎｓｅｒｖ－ｅｒ．Ｔｈｅｎｔｈｅｐａｐｅｒｐｕｔｓｆｏｒｗａｒｄａｄｉｓｔｒｉｂｕｔｅｄｄａｔａｂａｓｅｆｒａｍｅｗｏｒｋｗｈｉｃｈｈａｖｅｒｅｐｌｉｃａｔｉｏｎｓｅｒｖｅｒｓｔｒｕｃｔｕｒｅ．Ａｎｄｉｔｉｓｅｆｆｅｃ－ｔｉｖｅａｎｄａｐｐｌｉｅｄ．Ｋｅｙｗｏｒｄｓ：ｄｉｓｔｒｉｂｕｔｅｄｄａｔａｂａｓｅｓｙｓｔｅｍ；ｄａｔａｃｏｎｓｉｓｔｅｎｃｙ；２ＰＣ；ｒｅｐｌｉｃａｔｉｏｎｓｅｒｖｅｒ收稿日期：２００６－１０－２７１３７

数据挖掘可视化系统研究与实现

数据挖掘可视化系统设计与实现摘要：针对当前数据可视化工具的种类、质量和灵活性的存在的不足，构建一个数据挖掘可视化平台。将获取的数据集上传到系统中，对数据集进行预处理，利用Mahout提供的分类、聚类等挖掘算法对数据集进行挖掘，使用ECharts将挖掘产生的结果进行可视化展示。关键词：数据挖掘；可视化展示；数据预处理；挖掘算法 1引言大数据时代，通过数据挖掘，可以对数据库中的大量业务数据进行抽取、转换、分析和其他模型化处理，从而提取辅助商业决策的关键性信息。丰富而灵活的数据挖掘结果可视化技术使抽象的信息以简明的形式呈现出来，加深用户对数据含义的理解，更好地了解数据之间的相互关系和发展趋势。然而当前数据可视化工具的种类、质量和灵活性较大的影响数据挖掘系统的使用、解释能力和吸引力。为此，本系统使用分布式大数据处理技术进行数据的存储和计算，构建一个数据挖掘可视化平台，以多种挖掘算法的实现对原始数据集进行挖掘，从而发现数据中有用的信息。 2.关键技术 (1)MapReduce离线计算框架一种在YARN系统之上的大数集离线计算框架，使用MapReduce可以并行的对原始数据集进行计算处理，从而高效的得出结果。 (2)HBase分布式数据库 HBase是一个构建在Hadoop之上分布式的、面向列的开源数据库。HBase不同于一般的关系数据库，他是一个适合于非结构化数据存储的数据库。 (3)Mahout Mahout是Apache Software Foundation旗下的一个开源项目，提供一些可扩展的机器学习领域经典算法的实现。包括聚类、分类、推荐过滤、频繁子项挖掘等算法的实现。 (4)ECharts Echarts是百度团队对ZRender做了一次大规模重构的产物。他被定义为商业级报表，创建了坐标系，图例，提示，工具箱等基础组件，并在此上构建出折线图、柱状图、散点图、K线图、饼图、雷达图、地图、和弦图、力导向布局图、仪表盘以及漏斗图，同时支持任意纬度的堆积和多图表混合实现。 3.研究思路数据挖掘可视化系统包括以下模块： (1)前台展示通过对上传的数据集处理、挖掘、分析，将有价值的信息结果以图形化的形式展现给用户。 (2)数据集的存储将要处理的数据集存储到HBase数据库中。HBase数据库能够对大数据提供随机、实时的读写访问功能。 (3)后台数据处理通过使用Mahout数据挖掘包，对挖掘算法进行相关参数的设定，对从数据库中提取的数据集进行挖掘，从而提取出有用的信息。具体如图1所示：

数据中心容灾备份方案完整版

数据中心容灾备份方案 HEN system office room 【HEN16H-HENS2AHENS8Q8-HENH1688】

数据保护系统医院备份、容灾及归档数据容灾解决方案 1、前言在医院信息化建设中，HIS、PACS、RIS、LIS 等临床信息系统得到广泛应用。医院信息化 HIS、LIS 和 PACS 等系统是目前各个医院的核心业务系统，承担了病人诊疗信息、行政管理信息、检验信息的录入、查询及监控等工作，任何的系统停机或数据丢失轻则降低患者的满意度、医院的信誉丢失，重则引起医患纠纷、法律问题或社会问题。为了保证各业务系统的高可用性，必须针对核心系统建立数据安全保护，做到“不停、不丢、可追查”，以确保核心业务系统得到全面保护。随着电子病历新规在 4 月 1 日的正式施行，《电子病历应用管理规范（试行）》要求电子病历的书写、存储、使用和封存等均需按相关规定进行，根据规范，门(急)诊电子病历由医疗机构保管的，保存时间自患者最后一次就诊之日起不少于15 年；住院电子病历保存时间自患者最后一次出院之日起不少于 30 年。

2、医院备份、容灾及归档解决方案针对医疗卫生行业的特点和医院信息化建设中的主要应用，包括：HIS、PACS、RIS、LIS 等，本公司推出基于数据保护系统的多种解决方案，以达到对医院信息化系统提供全面的保护以及核心应用系统的异地备份容灾数据备份解决方案针对于医院的 HIS、PACS、LIS 等服务器进行数据备份时，数据保护系统的备份架构采用三层构架。备份软件主控层（内置一体机）：负责管理制定全域内的备份策略和跟踪客户端的备份，能够管理磁盘空间和磁带库库及光盘库，实现多个客户端的数据备份。备份软件主服务器是备份域内集中管理的核心。客户端层（数据库和操作系统客户端）：其他应用服务器和数据库服务器安装备份软件标准客户端,通过这个客户端完成每台服务器的 LAN 或 LAN-FREE 备份工作。另外，为包含数据库的客户端安装数据库代理程序，从而保证数据库的在线热备份。备份介质层（内置虚拟带库）：主流备份介质有备份存储或虚拟带库等磁盘介质、物理磁带库等，一般建议将备份存储或虚拟带库等磁盘介质作为一级备份介质，用于近期的备份数据存放，将物理磁带库或者光盘库作为二级备份介质，用于长期的备份数据存放。

数据挖掘研究现状及发展趋势

数据挖掘研究现状及发展趋势摘要：从数据挖掘的定义出发，介绍了数据挖掘的神经网络法、决策树法、遗传算法、粗糙集法、模糊集法和关联规则法等概念及其各自的优缺点；详细总结了国内外数据挖掘的研究现状及研究热点，指出了数据挖掘的发展趋势。关键词：数据挖掘；挖掘算法；神经网络；决策树；粗糙集；模糊集；研究现状；发展趋势 Abstract：From the definition of data mining，the paper introduced concepts and advantages and disadvantages of neural network algorithm，decision tree algorithm，genetic algorithm，rough set method，fuzzy set method and association rule method of data mining，summarized domestic and international research situation and focus of data mining in details，and pointed out the development trend of data mining. Key words：data mining，algorithm of data mining，neural network，decision tree，rough set，fuzzy set，research situation，development tendency 1引言随着信息技术的迅猛发展，许多行业如商业、企业、科研机构和政府部门等都积累了海量的、不同形式存储的数据资料[1]。这些海量数据中往往隐含着各种各样有用的信息，仅仅依靠数据库的查询检索机制和统计学方法很难获得这些信息，迫切需要能自动地、智能地将待处理的数据转化为有价值的信息，从而达到为决策服务的目的。在这种情况下，一个新的技术———数据挖掘(Data Mining，DM)技术应运而生[2]。数据挖掘是一个多学科领域，它融合了数据库技术、人工智能、机器学习、统计学、知识工程、信息检索等最新技术的研究成果，其应用非常广泛。只要是有分析价值的数据库，都可以利用数据挖掘工具来挖掘有用的信息。数据挖掘典型的应用领域包括市场、工业生产、金融、医学、科学研究、工程诊断等。本文主要介绍数据挖掘的主要算法及其各自的优缺点，并对国内外的研究现状及研究热点进行了详细的总结，最后指出其发展趋势及问题所在。江西理工大学

分布式数据库系统的研究—张晓丽

论文论文题目：分布式数据库系统的研究所在单位：太原南瑞继保电力有限公司姓名：张晓丽二〇一六年九月分布式数据库系统的研究摘要随着智能终端的快速发展，当今对于数据库的访问请求通过网络高速增长，一些企业关键业务内容的数据平均每秒都要处理几千乃至于上万次的请求，对于企业数据库的响应速度提出了很高的要求。本文介绍了分布式数据库的定义及其特点，阐述分析了分布式数据库系统的关键技术。关键词：分布式数据库系统；同步技术；加密技术 1分布式数据库系统的定义计算机网络的发展为用户从网络中获取数据信息提供了便利，由于网络用户的逐年增长，网络信息量越来越大，因此信息查询、流通的效率成为制约网络发展的因素。数据库系统是由数据库和数据管理软件一同构成的一体的管理系统，为当今信息时代网络上海量数据信息的传输、存储、访问以及共享提供了保障。分布式数据库系统（Distributed Database System，DDBS）是一种数据集合，由多个小型计算机系统和相应的配套数据库，以网络的形式实现之间连接构成了统一的数据库。分布式数据库系统是一种能够帮助数据库实现分布处理的系统，能够辅助多台计算机体系的整体结构任务处理。分布式数据库系统可按其分布组成分为两种类型：一种是物理分布逻辑集中，即逻辑上数据集合属于同一系统，而在物理上这些数据集合分布在多台联网计算机上。此类数据库系统适用于用途单一、专业性强的中小企业或部门；另外一种是逻辑上或是物理上都是分布的，这种分布式数据库系统类型主要用于集成大范围数据库。 2分布式数据库系统的特点 2.1数据分布的透明性在分布式数据库系统中，数据的独立性是系统的核心，由于分布性的存在使得数据独立性的要求更加复杂，同时也更加丰富。数据的独立性用数据分布的透明性来描述，分布的透明性表现在用户在调用应用程序中的数据库是时，不必具体了解数据存储的物理位置，也不必关心局部场地上数据库支持哪种数据模型。增加了数据的重复利用率。 2.2自治性与共享性每个局部数据库管理系统可以对本地数据库进行独立管理，选择该站点数据是否共享到全局数据库，对于无需进行全局共享的数据，分布式数据库系统会将其保留在分站点中，从而节省数据流量。在普通用户使用分布式数据库系统时，如需要查询或者修改某一分站点数据，无论该数据位于任何站点，用户可以直接进行查询工作，称作全局共享。即在各个分布数据库站点，能够支持网络上其他站点及用户对于数据库系统的使用，能够提供本地数据库中数据的全局共享。 2.3可靠性分布式数据库系统具有更高的可靠性和灵活性，与集中式数据库系统相比，分布式数据库系

海量数据下分布式数据库系统的探索与研究

海量数据下分布式数据库系统的探索与研究摘要：当前，互联网用户规模不断扩大，这些都与互联网的快速发展有关。现在传统的数据库已经不能满足用户的需求了。随着云计算技术的飞速发展，我国海量数据快速增长，数据量年均增速超过50％，预计到2020年，数据总量全球占比将达到20％，成为数据量最大、数据类型最丰富的国家之一。采用分布式数据库可以显著提高系统的可靠性和处理效率，同时也可以提高用户的访问速度和可用性。本文主要介绍了分布式数据库的探索与研究。关键词：海量数据；数据库系统 1.传统数据库： 1.1 层次数据库系统。层次模型是描述实体及其与树结构关系的数据模型。在这个结构中，每种记录类型都由一个节点表示，并且记录类型之间的关系由节点之间的一个有向直线段表示。每个父节点可以有多个子节点，但每个子节点只能有一个父节点。这种结构决定了采用层次模型作为数据组织方式的层次数据库系统只能处理一对多的实体关系。 1.2 网状数据库系统。网状模型允许一个节点同时具有多个父节点和子节点。因此，与层次模型相比，网格结构更具通用性，可以直接描述现实世界中的实体。也可以认为层次模型是网格模型的特例。 1.3 关系数据库系统。关系模型是一种使用二维表结构来表示实体类型及其关系的数据模型。它的基本假设是所有数据都表示为数学关系。关系模型数据结构简单、清晰、高度独立，是目前主流的数据库数据模型。随着电子银行和网上银行业务的创新和扩展，数据存储层缺乏良好的可扩展性，难以应对应用层的高并发数据访问。过去，银行使用小型计算机和大型存储等高端设备来确保数据库的可用性。在可扩展性方面，主要通过增加CPU、内存、磁盘等来提高处理能力。这种集中式的体系结构使数据库逐渐成为整个系统的瓶颈，越来越不适应海量数据对计算能力的巨大需求。互联网金融给金融业带来了新的技术和业务挑战。大数据平台和分布式数据库解决方案的高可用性、高可靠性和可扩展性是金融业的新技术选择。它们不仅有利于提高金融行业的业务创新能力和用户体验，而且有利于增强自身的技术储备，以满足互联网时代的市场竞争。因此，对于银行业来说，以分布式数据库解决方案来逐步替代现有关系型数据库成为最佳选择。 2.分布式数据库的概念：分布式数据库系统：分布式数据库由一组数据组成，这些数据物理上分布在计算机网络的不同节点上（也称为站点），逻辑上属于同一个系统。（1）分布性：数据库中的数据不是存储在同一个地方，更准确地说，它不是存储在同一台计算机存储设备中，这可以与集中数据库区别开来。（2）逻辑整体性：这些数据在逻辑上是相互连接和集成的（逻辑上就像一个集中的数据库）。分布式数据库的精确定义：分布式数据库由分布在计算机网络中不同计算机

数据中心容灾备份方案

数据保护系统医院备份、容灾及归档数据容灾解决方案

1、前言在医院信息化建设中，HIS、PACS、RIS、LIS 等临床信息系统得到广泛应用。医院信息化HIS、LIS 和PACS 等系统是目前各个医院的核心业务系统，承担了病人诊疗信息、行政管理信息、检验信息的录入、查询及监控等工作，任何的系统停机或数据丢失轻则降低患者的满意度、医院的信誉丢失，重则引起医患纠纷、法律问题或社会问题。为了保证各业务系统的高可用性，必须针对核心系统建立数据安全保护，做到“不停、不丢、可追查”，以确保核心业务系统得到全面保护。随着电子病历新规在 4 月 1 日的正式施行，《电子病历应用管理规范（试行）》要求电子病历的书写、存储、使用和封存等均需按相关规定进行，根据规范，门(急)诊电子病历由医疗机构保管的，保存时间自患者最后一次就诊之日起不少于15 年；住院电子病历保存时间自患者最后一次出院之日起不少于30 年。

2、医院备份、容灾及归档解决方案针对医疗卫生行业的特点和医院信息化建设中的主要应用，包括：HIS、PACS、RIS、LIS 等，本公司推出基于数据保护系统的多种解决方案，以达到对医院信息化系统提供全面的保护以及核心应用系统的异地备份容灾 2.1 数据备份解决方案针对于医院的HIS、PACS、LIS 等服务器进行数据备份时，数据保护系统的备份架构采用三层构架。备份软件主控层（内置一体机）：负责管理制定全域内的备份策略和跟踪客户端的备份，能够管理磁盘空间和磁带库库及光盘库，实现多个客户端的数据备份。备份软件主服务器是备份域内集中管理的核心。客户端层（数据库和操作系统客户端）：其他应用服务器和数据库服务器安装备份软件标准客户端,通过这个客户端完成每台服务器的LAN 或LAN-FREE 备份工作。另外，为包含数据库的客户端安装数据库代理程序，从而保证数据库的在线热备份。

分布式数据库系统复习题

一、何为分布式数据库系统？一个分布式数据库系统有哪些特点？答案：分布式数据库系统通俗地说，是物理上分散而逻辑上集中的数据库系统。分布式数据库系统使用计算机网络将地理位置分散而管理和控制又需要不同程度集中的多个逻辑单位连接起来，共同组成一个统一的数据库系统。因此，分布式数据库系统可以看成是计算机网络与数据库系统的有机结合。一个分布式数据库系统具有如下特点：物理分布性，即分布式数据库系统中的数据不是存储在一个站点上，而是分散存储在由计算机网络连接起来的多个站点上，而且这种分散存储对用户来说是感觉不到的。逻辑整体性，分布式数据库系统中的数据物理上是分散在各个站点中，但这些分散的数据逻辑上却构成一个整体，它们被分布式数据库系统的所有用户共享，并由一个分布式数据库管理系统统一管理，它使得“分布”对用户来说是透明的。站点自治性，也称为场地自治性，各站点上的数据由本地的DBMS管理，具有自治处理能力，完成本站点的应用，这是分布式数据库系统与多处理机系统的区别。另外，由以上三个分布式数据库系统的基本特点还可以导出它的其它特点，即：数据分布透明性、集中与自治相结合的控制机制、存在适当的数据冗余度、事务管理的分布性。二、简述分布式数据库的模式结构和各层模式的概念。分布式数据库是多层的，国内分为四层：全局外层：全局外模式，是全局应用的用户视图，所以也称全局试图。它为全局概念模式的子集，表示全局应用所涉及的数据库部分。全局概念层：全局概念模式、分片模式和分配模式全局概念模式描述分布式数据库中全局数据的逻辑结构和数据特性，与集中式数据库中的概念模式是集中式数据库的概念视图一样，全局概念模式是分布式数据库的全局概念视图。分片模式用于说明如何放置数据库的分片部分。分布式数据库可划分为许多逻辑片，定义片段、片段与概念模式之间的映射关系。分配模式是根据选定的数据分布策略，定义各片段的物理存放站点。局部概念层：局部概念模式是全局概念模式的子集。局部内层：局部内模式局部内模式是分布式数据库中关于物理数据库的描述，类同集中式数据库中的内模式，但其描述的内容不仅包含只局部于本站点的数据的存储描述，还包括全局数据在本站点的存储描述。三、简述分布式数据库系统中的分布透明性，举例说明分布式数据库简单查询的各级分布透明性问题。分布式数据库中的分布透明性即分布独立性，指用户或用户程序使用分布式数据库如同使用集中式数据库那样，不必关心全局数据的分布情况，包括全局数据的逻辑分片情况、逻辑片段的站点位置分配情况，以及各站点上数据库的数据模型等。即全局数据的逻辑分片、片段的物理位置分配，各站点数据库的数据模型等情况对用户和用户程序透明。

(最新整理)分布式数据库研究现状及发展趋势

(完整)分布式数据库研究现状及发展趋势编辑整理：尊敬的读者朋友们：这里是精品文档编辑中心，本文档内容是由我和我的同事精心编辑整理后发布的，发布之前我们对文中内容进行仔细校对，但是难免会有疏漏的地方，但是任然希望（(完整)分布式数据库研究现状及发展趋势）的内容能够给您的工作和学习带来便利。同时也真诚的希望收到您的建议和反馈，这将是我们进步的源泉，前进的动力。本文可编辑可修改，如果觉得对您有帮助请收藏以便随时查阅，最后祝您生活愉快业绩进步，以下为(完整)分布式数据库研究现状及发展趋势的全部内容。

山西大学研究生学位课程论文（2014 —--— 2015 学年第 2 学期) 学院（中心、所）：计算机与信息技术学院专业名称：计算机应用技术课程名称：分布式数据库技术论文题目：分布式数据库研究现状及发展趋势授课教师（职称）: 曹峰（) 研究生姓名: 刘杰飞年级： 2014级学号： 201422403003 成绩: 评阅日期: 山西大学研究生学院 2015年 6 月 17日

分布式数据库研究现状及发展趋势摘要随着大数据、云时代的到来，数据库应用需求的拓展和计算机硬件环境的变化,特别是计算机网络与数字通信技术的飞速发展，卫星通信、蜂窝通信、计算机局域网、广域网和激增的Intranet及Internet得到了广泛应用,使分布式数据库系统应运而生。为了符合当今信息系统的应用需求和企业组织的管理思想和管理模式。分布式数据库提供了解决整个信息资产被分裂所成的信息孤岛，为孤岛联系在一起提供桥梁.本文主要介绍分布式数据库的研究现状，存在的一些问题以及未来的发展趋势。关键词分布式数据库；发展趋势；现状及问题 1.引言随着信息技术的飞速发展，社会经济结构、生产方式和消费结构已经发生了重大变化，这些变化深刻地影响着人民生活的方方面面。尤其是近十年来人们对计算机的依赖性越来越强，同时也对计算机提出了更高的要求。随着数据库在各个行业中的不断发展,各行业也对数据库提出了更高的要求，数据量也急剧增加，同时有关大数据分析的讨论正在愈演愈烈.甚至出现了爆炸性增长的趋势，一方面是由于移动互联网和移动智能终端的普及发展，数据信息正以每年40%的速度增长，造成数据量庞大；同时,数据种类呈多样性，文本、图片、视频等结构化和非结构化数据共存；另一方面也要求实时交互性强；最重要的是大数据蕴含了巨大的商业价值。相应的对于管理这些数据的复杂度也随之增加。同时各行业部门或企业所使用的软硬件之间的差异，这给开发企业管理数据库管理软件带来了巨大的工作量，如果能够有效解决这个问题,即使用同一模块管理操作不同的数据表格，对不同的数据表格进行查询、插入、删除、修改等操作，也即对企业简单的应用实现即插即用的功能，那么就能大大地减少软件开发的维护和更新费用,缩短软件的开发周期。分布式数据库系统的开发，降低了企业开发的成本,提高了软件使用的回报率。当今社会已进入了信息时代，人们将越来越多的信息存储在网络中的计算机上。如何更有

数据挖掘系统设计技术分析

数据挖掘系统设计技术分析【摘要】数据挖掘技术则是商业智能（Business Intelligence）中最高端的，最具商业价值的技术。数据挖掘是统计学、机器学习、数据库、模式识别、人工智能等学科的交叉，随着海量数据搜集、强大的多处理器计算机和数据挖掘算法等基础技术的成熟，数据挖掘技术高速发展，成为21世纪商业领域最核心竞争力之一。本文从设计思路、系统架构、模块规划等方面分析了数据挖掘系统设计技术。【关键词】数据挖掘；商业智能；技术分析引言数据挖掘是适应信息社会从海量的数据库中提取信息的需要而产生的新学科。它可广泛应用于电信、金融、银行、零售与批发、制造、保险、公共设施、政府、教育、远程通讯、软件开发、运输等各个企事业单位及国防科研上。数据挖掘应用的领域非常广阔，广阔的应用领域使用数据挖掘的应用前景相当光明。我们相信，随着数据挖掘技术的不断改进和日益成熟，它必将被更多的用户采用，使企业管理者得到更多的商务智能。 1、参考标准 1.1挖掘过程标准：CRISP-DM CRISP-DM全称是跨行业数据挖掘过程标准。它由SPSS、NCR、以及DaimlerChrysler三个公司在1996开始提出，是数据挖掘公司和使用数据挖掘软件的企业一起制定的数据挖掘过程的标准。这套标准被各个数据挖掘软件商用来指导其开发数据挖掘软件，同时也是开发数据挖掘项目的过程的标准方法。挖掘系统应符合CRISP-DM的概念和过程。 1.2ole for dm ole for dm是微软于2000年提出的数据挖掘标准，主要是在微软的SQL SERVER软件中实现。这个标准主要是定义了一种SQL扩展语言：DMX。也就是挖掘系统使用的语言。标准定义了许多重要的数据挖掘模型定义和使用的操作原语。相当于为软件提供商和开发人员之间提供了一个接口，使得数据挖掘系统能与现有的技术和商业应用有效的集成。我们在实现过程中发现这个标准有很多很好的概念，但也有一些是勉为其难的，原因主要是挖掘系统的整体概念并不是非常单纯，而是像一个发掘信息的方法集，所以任何概念并不一定符合所有的情况，也有一些需要不断完善和发展中的东西。 1.3PMML

分布式数据库管理系统简介

分布式数据库管理系统简介一、什么是分布式数据库：分布式数据库系统是在集中式数据库系统的基础上发展来的。是数据库技术与网络技术结合的产物。分布式数据库系统有两种：一种是物理上分布的，但逻辑上却是集中的。这种分布式数据库只适宜用途比较单一的、不大的单位或部门。另一种分布式数据库系统在物理上和逻辑上都是分布的，也就是所谓联邦式分布数据库系统。由于组成联邦的各个子数据库系统是相对“自治”的，这种系统可以容纳多种不同用途的、差异较大的数据库，比较适宜于大范围内数据库的集成。分布式数据库系统(DDBS)包含分布式数据库管理系统(DDBMS)和分布式数据库(DDB)。在分布式数据库系统中，一个应用程序可以对数据库进行透明操作，数据库中的数据分别在不同的局部数据库中存储、由不同的DBMS进行管理、在不同的机器上运行、由不同的操作系统支持、被不同的通信网络连接在一起。一个分布式数据库在逻辑上是一个统一的整体：即在用户面前为单个逻辑数据库，在物理上则是分别存储在不同的物理节点上。一个应用程序通过网络的连接可以访问分布在不同地理位置的数据库。它的分布性表现在数据库中的数据不是存储在同一场地。更确切地讲，不存储在同一计算机的存储设备上。这就是与集中式数据库的区别。从用户的角度看，一个分布式数据库系统在逻辑上和集中式数据库系统一样，用户可以在任何一个场地执行全局应用。就好那些数据是存储在同一台计算机上，有单个数据库管理系统(DBMS)管理一样，用户并没有什么感觉不一样。分布式数据库中每一个数据库服务器合作地维护全局数据库的一致性。分布式数据库系统是一个客户/服务器体系结构。在系统中的每一台计算机称为结点。如果一结点具有管理数据库软件，该结点称为数据库服务器。如果一个结点为请求服务器的信息的一应用，该结点称为客户。在ORACLE客户，执行数据库应用，可存取数据信息和与用户交互。在服务器，执行ORACLE软件，处理对ORACLE 数据库并发、共享数据存取。ORACLE允许上述两部分在同一台计算机上，但当客户部分和服务器部分是由网连接的不同计算机上时，更有效。分布处理是由多台处理机分担单个任务的处理。在ORACLE数据库系统中分布处理的例子如：客户和服务器是位于网络连接的不同计算机上。单台计算机上有多个处理器，不同处理器分别执行客户应用。

XXX数据中心升级及容灾改造项目招标文件(原方案)

XXX数据中心升级及容灾项目采购招标文件. 招标内容及技术要求一、本项目工程建设的背景和现状随着我院信息系统的不断发展，业务系统的数据量、数据处理量和数据存储量越来越大。因此，业务系统的稳定与否，系统的保护和数据的保护是否健全，已成为本系统是否正常运行的关键。随着数据集中处理的实施，可以预计，我院信息系统的业务运作、管理模式将越来越依赖于计算机系统的可靠运行。我院所提供服务的连续性以及业务数据的完整性、正确性、有效性，会直接关系到业务的生产、管理与决策活动。这就要求我们对网络、通信线路、服务器主机等关键硬件设备以及数据库，应用服务器等软硬件进行相应的故障保护和容灾备份部署。一旦某一环节出现异常情况，如火灾、爆炸、地震、水灾、雷击或某个方向通信线路故障等自然原因以及电源机器故障、人为破坏等非自然原因引起的灾难，我们可以快速及时的进行灾难恢复，将损失降到最低点。如果没有全面的考虑容错、容灾设计，那么在任何一个环节上发生故障和灾难，都会导致业务无法正常进行，造成重要数据的丢失、破坏，造成相关的部门的系统中断，不仅不能社会大众提供正常的医疗服务，甚至在极大程度上会影响医院的形象和声誉，使日后的工作无法正常开展下去。因此，根据本系统的特点，必须充分考虑各种灾难情况，建立灾难备份系统。另外我院于2008年对整个信息系统平台的软硬件设备进行升级和改造，至目前为止已经使用将近7年时间，目前的系统平台设备已经慢慢出现故障增多、性能下降等问题，也急需要对整个系统平台进行升级。

二、采购内容及招标需求 1. 采购原则及规范：本次公开招标采购XXX数据中心升级及容灾项目的有关设备，投标人所投设备必须按招标文件规定的配置要求提供，并满足招标文件中提出的相关性能指标参数，同时应能满足XXX局信息化系统的当前以及今后3-5年内业务发展的需求。投标人应对所提供的设备性能、质量负责，并提供相应的安装、服务、质保及技术培训。采购的设备所涉及的产品标准、规范，验收标准等，应符合国家有关条例及标准的规定。 2. 采购设备清单（预算：140万元）

数据中心灾备系统的分类

数据中心灾备系统的分类根据数据中心的安全要求，应对灾难恢复系统采用的技术路线做出全面的考虑。 1.数据级容灾和应用级容灾按照容灾系统对应用系统的保护程度可以分为数据级容灾和应用级容灾，业务级容灾的大部分内容是非IT系统。数据级容灾系统只保证数据的完整性、可靠性和安全性，但提供实时服务的请求在灾难中会中断。应用级容灾系统能够提供不间断的应用服务，让服务请求能够透明(在灾难发生时毫无觉察)地继续运行，保证数据中心提供的服务完整、可靠、安全。因此对服务中断不太敏感的部分可以选择数据级容灾，以便节省成本，在数据级容灾的基础上构建应用级容灾系统，保证实时服务不间断运行，为用户提供更好的服务。 (1)数据级容灾。通过在异地建立一份数据复制的方式保证数据的安全性，当本地工作系统出现不可恢复的物理故障时，容灾系统提供可用的数据。数据级容灾是容灾的基础形式，由于只需要考虑数据的复制和存放，不需要考虑备用系统，实现起来相对简单，投资也较少。数据级容灾需要考虑三方面问题:在线模式与离线模式问题;远程数据复制技术问题;同步与异步容灾问题。 (2)应用级容灾。应用级容灾能保证业务的连续性。在数据级容灾的基础上，建立备份的应用系统环境，当本地工作系统出现不可恢复的物理故障时，容灾系统提供可用的数据和应用系统。应用级容灾系统是建立在数据级容灾系统基础上的，同时能完成数据和应用系统环境的复制存放和管理。为实现发生灾难时的应用切换，容灾中心需要配置与工作系统同构和相同功能的业务网络、应用服务器、应用软件等。应用级容灾还需要考虑数据复制的完全性、数据的一致性、数据的完整性、网络的通畅性、容灾切换的性能影响、应用软件的适应性改造等问题，以及为保证业务运行的所需设备、环境、人员及其相应的管理。 2.灾难恢复系统的在线/离线模式 (l)在线模式。在线灾难恢复系统要求工作系统与灾难备份系统通过网络线路连接，数据通过网络实时或定时从工作系统传输到灾难备份系统。对数据保护的实时性高，对业务连续性要求高，就需要采用在线模式。 (2)离线模式。离线灾难备份系统的数据通过存储介质(磁带、光盘等，搬运到异地保存起来实现数据的保护。离线模式适合于对数据保护的实时性要求不高的场合，离线模式设备比较简单，投资较少。 3.数据备份技术正常情况下系统的各种应用在数据中心运行，数据存放在数据中心和灾难备份中心两地保存。当灾难发生时，使用备份数据对工作系统进行恢复或将应用切换到备份中心。灾难备份系统中数据备份技术的选择应符合数据恢复时间或系统切换时间满足业务连续性的要求。目前数据备份技术主要有如下几种: (1)磁带备份。 (2)基于应用程序的备份。通过应用程序或者中间件产品，将数据中心的数据复制到灾难备份中心。在正常情况下，数据中心的应用程序在将数据写入本地存储系统的同时将数据发送到灾难备份中心，灾难备份中心只在后台处理数据，当数据中心瘫痪时，由于灾难备份中心也存有生产数据，所以可以迅速接管业务。这种备份方式往往需要应用程序的修改，工作量比较大。另外，

数据挖掘技术的研究现状及发展方向_陈娜

数据挖掘技术的研究现状及发展方向陈娜1.2 （1.北京交通大学计算机学院，北京100044；2.石家庄铁路运输学校，河北石家庄050021）第 !" 电脑与信息技术卷（ ! ）可视化技术［ " ］通过直观的图形方式将信息数据、关联关系以及发展趋势呈现给决策者，使用最多的方法是直方图、数据立方体、散点图。其中数据立方体可以通过 #$%& 操作将更多用户关心的信息反映给用户。（ ’ ）遗传算法［ ( ］是一种模拟生物进化过程的算法，最早由 )*++,-. 于 /0 世纪 (0 年代提出。它是基于群体的、具有随机和定向搜索特征的迭代过程，包括 ! 种典型的算子：遗传、交叉、变异和自然选择。遗传算法作用于一个由问题的多个潜

在解（个体）组成的群体上，并且群体中的每个个体都由一个编码表示，同时个体均需依据问题的目标函数而被赋予一个适应值。另外，为了应用遗传算法，还需要把数据挖掘任务表达为一种搜索的问题，以便发挥遗传算法的优势搜索能力。同时可以用遗传算法中的交叉、变异完成数据挖掘中用于异常数据的处理。（ "）统计学方法［ 1 ］在数据库字段项之间存在着两种关系：函数关系（能用函数公式表示的确定性关系）和相关关系（不能用函数公式表示，但仍是相关确定关系）。对它们的分析采用如下方法：回归分析、相关分析、主成分分析。主要用于数据挖据的聚类方法中。（ (）模糊集（23445 678）方法利用模糊集理论对实际问题进行模糊评判、模糊决策、模糊模式识别和模糊聚类分析。模糊性是客观存在的。系统的复杂性越高，精确化能力就越低，即模糊性就越强，这是 9,.7: 总结出的互克性原理。 / 数据挖掘的算法（ ;）关联规则中的算法 %<=>*=>算法是一种最具有影响力的挖掘布尔关联规则频繁项集的算法，该算法是一种称为主层搜索的迭代方法，它分为两个步骤： ,?通过多趟扫描数据库求解出频繁;@项集的集合 $ ; ； A?不断的寻找到/@项集$ / … -@项集$ - ，最后利用频繁项集生成规则。随后的许多算法都沿用

数据挖掘研究及发展现状

数据挖掘技术的研究现状及发展方向摘要：数据挖掘技术是当前数据库和人工智能领域研究的热点。从数据挖掘的定义出发，介绍了数据挖掘的神经网络法、决策树法、遗传算法、粗糙集法、模糊集法和关联规则法等概念及其各自的优缺点；详细总结了国内外数据挖掘的研究现状及研究热点，指出了数据挖掘的发展方向。关键词：数据挖掘；神经网络；决策树；粗糙集；模糊集；研究现状；发展方向 The present situation and future direction of the data mining technology research Abstract: Data mining technology is hot spot in the field of current database and artificial intelligence. From the definition of data mining, the paper introduced concepts and advantages and disadvantages of neural network algorithm, decision tree algorithm, genetic algorithm, rough set method, fuzzy set method and association rule method of data mining, summarized domestic and international research situation and focus of data mining in details, and pointed out the development trend of data mining. Key words: data mining, neural network, decision tree, rough set, fuzzy set, research situation, development direction 0 引言随着信息技术的迅猛发展，许多行业如商业、企业、科研机构和政府部门等都积累了海量的、不同形式存储的数据资料[1]。这些海量数据中往往隐含着各种各样有用的信息，仅仅依靠数据库的查询检索机制和统计学方法很难获得这些信息，数据和信息之间的鸿沟要求系统地开发数据挖掘工具，将数据坟墓转换成知识金砖，从而达到为决策服务的目的。在这种情况下，一个新的技术——数据挖掘(Data Mining，DM)技术应运而生[2]。数据挖掘正是为了迎合这种需要而产生并迅速发展起来的、用于开发信息资源的、一种新的数据处理技术。数据挖掘通常又称数据库中的知识发现（Knowledge Discovery in Databases），是一个多学科领域，它融合了数据库技术、人工智能、机器学习、统计学、知识工程、信息检索等最新技术的研究成果，其应用非常广泛。只要是有分析价值的数据库，都可以利用数据挖掘工具来挖掘有用的信息。数据挖掘典型的应用领域包括市场、工业生产、金融、医学、科学研究、工程诊断等。本文主要介绍数据挖掘的主要算法及其各自的优缺点，并对国内外的研究现状及研究热点进行了详细的总结，最后指出其发展趋势及问题所在。 1 数据挖掘算法数据挖掘就是从大量的、有噪声的、不完全的、模糊的、随机的实际应用数据中提取有效的、新颖的、潜在有用的知识的非平凡过程[3]。所得到的信息应具有先前未知、有效和实用三个特征。数据挖掘过程如图1所示。这些数据的类型可以是结构化的、半结构化的、甚至是异构型的。发现知识的方法可以是数学的、非数学的、也可以是归纳的。最终被发现了的知识可以用于信息管理、查询优化、决策支持及数据自身的维护等[4]。数据选择：确定发现任务的操作对象,即目标对象；预处理：包括消除噪声、推导计算缺值数据、消除重复记录、完成数据类型转换等；转换：消减数据维数或降维；数据开采：确定开采的任务，如数据总结、分类、聚类、关联规则发现或序列模式发现等，并确定使用什么样的开采算法；解释和评价：数据挖掘阶段发现的模式，经过用户和机器的评价，可能存在冗余或无关的模式，这时需要剔除，使用户更容易理解和应用。十大经典算法如图2：目前，数据挖掘的算法主要包括神经网络法、决策树法、遗传算法、粗糙集法、模糊集法、关联规则法等。

分布式数据库系统(DDBS)概述.

分布式数据库系统(DDBS概述一个远程事务为一个事务,包含一人或多个远程语句,它所引用的全部是在同一个远程结点上.一个分布式事务中一个事务,包含一个或多个语句修改分布式数据库的两个或多个不同结点的数据. 在分布式数据库中,事务控制必须在网络上直辖市,保证数据一致性.两阶段提交机制保证参与分布式事务的全部数据库服务器是全部提交或全部回滚事务中的语句. ORACLE分布式数据库系统结构可由ORACLE数据库管理员为终端用户和应用提供位置透明性,利用视图、同义词、过程可提供ORACLE分布式数据库系统中的位置透明性. ORACLE提供两种机制实现分布式数据库中表重复的透明性：表快照提供异步的表重复;触发器实现同步的表的重复。在两种情况下，都实现了对表重复的透明性。在单场地或分布式数据库中，所有事务都是用COMMIT或ROLLBACK语句中止。二、分布式数据库系统的分类： (1 同构同质型DDBS：各个场地都采用同一类型的数据模型(譬如都是关系型，并且是同一型号的DBMS。 (2同构异质型DDBS：各个场地采用同一类型的数据模型，但是DBMS的型号不同，譬如DB2、ORACLE、SYBASE、SQL Server等。 (3异构型DDBS：各个场地的数据模型的型号不同，甚至类型也不同。随着计算机网络技术的发展，异种机联网问题已经得到较好的解决，此时依靠异构型DDBS就能存取全网中各种异构局部库中的数据。三、分布式数据库系统主要特点： DDBS的基本特点： (1物理分布性：数据不是存储在一个场地上，而是存储在计算机网络的多个场地上。逻辑整体性：数据物理分布在各个场地，但逻辑上是一个整体，它们被所有用户(全局用户共享，并由一个DDBMS统一管理。 (2场地自治性：各场地上的数据由本地的DBMS管理，具有自治处理能力，完成本场地的应用(局部应用。 (3场地之间协作性：各场地虽然具有高度的自治性，但是又相互协作构成一个整体。 DDBS的其他特点 (1数据独立性 (2集中与自治相结合的控制机制 (3适当增加数据冗余度