云计算大数据外文翻译文献

合集下载

A View of Cloud Computing

A View of Cloud Computing
50
communicatio ns o f th e acm | apr i l 201 0 | vol. 5 3 | no. 4
hours. This elasticity of resources, without paying a premium for large scale, is unprecedented in the history of IT. As a result, cloud computing is a popular topic for blogging and white papers and has been featured in the title of workshops, conferences, and even magazines. Nevertheless, confusion remains about exactly what it is and when it’s useful, causing Oracle’s CEO Larry Ellison to vent his frustration: “The interesting thing about cloud computing is that we’ve redefined cloud computing to include everything that we already do…. I don’t understand what we would do differently in the light of cloud computing other than change the wording of some of our ads.” Our goal in this article is to reduce that confusion by clarifying terms, providing simple figures to quantify comparisons between of cloud and conventional computing, and identifying the top technical and non-technical obstacles and opportunities of cloud computing. (Armbrust et al4 is a more detailed version of this article.) Defining Cloud Computing Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the data centers that provide those services. The services themselves have long been referred to as Software as a Service (SaaS).a Some vendors use terms such as IaaS (Infrastructure as a Service) and PaaS (Platform as a Service) to describe their products, but we eschew these because accepted definitions for them still vary widely. The line between “low-level” infrastructure and a higher-level “platform” is not crisp. We believe the two are more alike than different, and we consider them together. Similarly, the

大数据外文翻译参考文献综述

大数据外文翻译参考文献综述

大数据外文翻译参考文献综述(文档含中英文对照即英文原文和中文翻译)原文:Data Mining and Data PublishingData mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the partyrunning the algorithm. In contrast, privacy-preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy.Although data mining is potentially useful, many data holders are reluctant to provide their data for data mining for the fear of violating individual privacy. In recent years, study has been made to ensure that the sensitive information of individuals cannot be identified easily.Anonymity Models, k-anonymization techniques have been the focus of intense research in the last few years. In order to ensure anonymization of data while at the same time minimizing the informationloss resulting from data modifications, everal extending models are proposed, which are discussed as follows.1.k-Anonymityk-anonymity is one of the most classic models, which technique that prevents joining attacks by generalizing and/or suppressing portions of the released microdata so that no individual can be uniquely distinguished from a group of size k. In the k-anonymous tables, a data set is k-anonymous (k ≥ 1) if each record in the data set is in- distinguishable from at least (k . 1) other records within the same data set. The larger the value of k, the better the privacy is protected. k-anonymity can ensure that individuals cannot be uniquely identified by linking attacks.2. Extending ModelsSince k-anonymity does not provide sufficient protection against attribute disclosure. The notion of l-diversity attempts to solve this problem by requiring that each equivalence class has at least l well-represented value for each sensitive attribute. The technology of l-diversity has some advantages than k-anonymity. Because k-anonymity dataset permits strong attacks due to lack of diversity in the sensitive attributes. In this model, an equivalence class is said to have l-diversity if there are at least l well-represented value for the sensitive attribute. Because there are semantic relationships among the attribute values, and different values have very different levels of sensitivity. Afteranonymization, in any equivalence class, the frequency (in fraction) of a sensitive value is no more than α.3. Related Research AreasSeveral polls show that the public has an in- creased sense of privacy loss. Since data mining is often a key component of information systems, homeland security systems, and monitoring and surveillance systems, it gives a wrong impression that data mining is a technique for privacy intrusion. This lack of trust has become an obstacle to the benefit of the technology. For example, the potentially beneficial data mining re- search project, Terrorism Information Awareness (TIA), was terminated by the US Congress due to its controversial procedures of collecting, sharing, and analyzing the trails left by individuals. Motivated by the privacy concerns on data mining tools, a research area called privacy-reserving data mining (PPDM) emerged in 2000. The initial idea of PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. The solutions were often tightly coupled with the data mining algorithms under consideration. In contrast, privacy-preserving data publishing (PPDP) may not necessarily tie to a specific data mining task, and the data mining task is sometimes unknown at the time of data publishing. Furthermore, some PPDP solutions emphasize preserving the datatruthfulness at the record level, but PPDM solutions often do not preserve such property. PPDP Differs from PPDM in Several Major Ways as Follows :1) PPDP focuses on techniques for publishing data, not techniques for data mining. In fact, it is expected that standard data mining techniques are applied on the published data. In contrast, the data holder in PPDM needs to randomize the data in such a way that data mining results can be recovered from the randomized data. To do so, the data holder must understand the data mining tasks and algorithms involved. This level of involvement is not expected of the data holder in PPDP who usually is not an expert in data mining.2) Both randomization and encryption do not preserve the truthfulness of values at the record level; therefore, the released data are basically meaningless to the recipients. In such a case, the data holder in PPDM may consider releasing the data mining results rather than the scrambled data.3) PPDP primarily “anonymizes” the data by hiding the identity of record owners, whereas PPDM seeks to directly hide the sensitive data. Excellent surveys and books in randomization and cryptographic techniques for PPDM can be found in the existing literature. A family of research work called privacy-preserving distributed data mining (PPDDM) aims at performing some data mining task on a set of private databasesowned by different parties. It follows the principle of Secure Multiparty Computation (SMC), and prohibits any data sharing other than the final data mining result. Clifton et al. present a suite of SMC operations, like secure sum, secure set union, secure size of set intersection, and scalar product, that are useful for many data mining tasks. In contrast, PPDP does not perform the actual data mining task, but concerns with how to publish the data so that the anonymous data are useful for data mining. We can say that PPDP protects privacy at the data level while PPDDM protects privacy at the process level. They address different privacy models and data mining scenarios. In the field of statistical disclosure control (SDC), the research works focus on privacy-preserving publishing methods for statistical tables. SDC focuses on three types of disclosures, namely identity disclosure, attribute disclosure, and inferential disclosure. Identity disclosure occurs if an adversary can identify a respondent from the published data. Revealing that an individual is a respondent of a data collection may or may not violate confidentiality requirements. Attribute disclosure occurs when confidential information about a respondent is revealed and can be attributed to the respondent. Attribute disclosure is the primary concern of most statistical agencies in deciding whether to publish tabular data. Inferential disclosure occurs when individual information can be inferred with high confidence from statistical information of the published data.Some other works of SDC focus on the study of the non-interactive query model, in which the data recipients can submit one query to the system. This type of non-interactive query model may not fully address the information needs of data recipients because, in some cases, it is very difficult for a data recipient to accurately construct a query for a data mining task in one shot. Consequently, there are a series of studies on the interactive query model, in which the data recipients, including adversaries, can submit a sequence of queries based on previously received query results. The database server is responsible to keep track of all queries of each user and determine whether or not the currently received query has violated the privacy requirement with respect to all previous queries. One limitation of any interactive privacy-preserving query system is that it can only answer a sublinear number of queries in total; otherwise, an adversary (or a group of corrupted data recipients) will be able to reconstruct all but 1 . o(1) fraction of the original data, which is a very strong violation of privacy. When the maximum number of queries is reached, the query service must be closed to avoid privacy leak. In the case of the non-interactive query model, the adversary can issue only one query and, therefore, the non-interactive query model cannot achieve the same degree of privacy defined by Introduction the interactive model. One may consider that privacy-reserving data publishing is a special case of the non-interactivequery model.This paper presents a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explains their effects on Data Privacy. k-anonymity is used for security of respondents identity and decreases linking attack in the case of homogeneity attack a simple k-anonymity model fails and we need a concept which prevent from this attack solution is l-diversity. All tuples are arranged in well represented form and adversary will divert to l places or on l sensitive attributes. l-diversity limits in case of background knowledge attack because no one predicts knowledge level of an adversary. It is observe that using generalization and suppression we also apply these techniques on those attributes which doesn’t need th is extent of privacy and this leads to reduce the precision of publishing table. e-NSTAM (extended Sensitive Tuples Anonymity Method) is applied on sensitive tuples only and reduces information loss, this method also fails in the case of multiple sensitive tuples.Generalization with suppression is also the causes of data lose because suppression emphasize on not releasing values which are not suited for k factor. Future works in this front can include defining a new privacy measure along with l-diversity for multiple sensitive attribute and we will focus to generalize attributes without suppression using other techniques which are used to achieve k-anonymity because suppression leads to reduce the precision ofpublishing table.译文:数据挖掘和数据发布数据挖掘中提取出大量有趣的模式从大量的数据或知识。

大数据的英文作文

大数据的英文作文

大数据的英文作文Title: The Impact of Big Data on the Modern World。

Introduction。

In the age of information, the concept of big data has become increasingly significant. It refers to the vast amount of structured and unstructured data that inundates individuals and organizations on a daily basis. This essay explores the profound impact of big data on various aspects of our modern world.Big Data in Business。

In the realm of business, big data has revolutionized decision-making processes. Companies now harness data analytics tools to extract valuable insights from massive datasets, enabling them to make informed strategic decisions. For instance, retailers analyze customer purchasing patterns to optimize product placement andmarketing strategies, thereby enhancing profitability.Moreover, big data has transformed marketing practices. Through advanced analytics, businesses can personalize marketing campaigns based on individual preferences and behavior, leading to higher customer engagement and conversion rates. Additionally, data-driven insights enable companies to anticipate market trends and stay ahead of competitors in today's dynamic business environment.Big Data in Healthcare。

云计算外文翻译参考文献

云计算外文翻译参考文献

云计算外文翻译参考文献(文档含中英文对照即英文原文和中文翻译)原文:Technical Issues of Forensic Investigations in Cloud Computing EnvironmentsDominik BirkRuhr-University BochumHorst Goertz Institute for IT SecurityBochum, GermanyRuhr-University BochumHorst Goertz Institute for IT SecurityBochum, GermanyAbstract—Cloud Computing is arguably one of the most discussedinformation technologies today. It presents many promising technological and economical opportunities. However, many customers remain reluctant to move their business IT infrastructure completely to the cloud. One of their main concerns is Cloud Security and the threat of the unknown. Cloud Service Providers(CSP) encourage this perception by not letting their customers see what is behind their virtual curtain. A seldomly discussed, but in this regard highly relevant open issue is the ability to perform digital investigations. This continues to fuel insecurity on the sides of both providers and customers. Cloud Forensics constitutes a new and disruptive challenge for investigators. Due to the decentralized nature of data processing in the cloud, traditional approaches to evidence collection and recovery are no longer practical. This paper focuses on the technical aspects of digital forensics in distributed cloud environments. We contribute by assessing whether it is possible for the customer of cloud computing services to perform a traditional digital investigation from a technical point of view. Furthermore we discuss possible solutions and possible new methodologies helping customers to perform such investigations.I. INTRODUCTIONAlthough the cloud might appear attractive to small as well as to large companies, it does not come along without its own unique problems. Outsourcing sensitive corporate data into the cloud raises concerns regarding the privacy and security of data. Security policies, companies main pillar concerning security, cannot be easily deployed into distributed, virtualized cloud environments. This situation is further complicated by the unknown physical location of the companie’s assets. Normally,if a security incident occurs, the corporate security team wants to be able to perform their own investigation without dependency on third parties. In the cloud, this is not possible anymore: The CSP obtains all the power over the environmentand thus controls the sources of evidence. In the best case, a trusted third party acts as a trustee and guarantees for the trustworthiness of the CSP. Furthermore, the implementation of the technical architecture and circumstances within cloud computing environments bias the way an investigation may be processed. In detail, evidence data has to be interpreted by an investigator in a We would like to thank the reviewers for the helpful comments and Dennis Heinson (Center for Advanced Security Research Darmstadt - CASED) for the profound discussions regarding the legal aspects of cloud forensics. proper manner which is hardly be possible due to the lackof circumstantial information. For auditors, this situation does not change: Questions who accessed specific data and information cannot be answered by the customers, if no corresponding logs are available. With the increasing demand for using the power of the cloud for processing also sensible information and data, enterprises face the issue of Data and Process Provenance in the cloud [10]. Digital provenance, meaning meta-data that describes the ancestry or history of a digital object, is a crucial feature for forensic investigations. In combination with a suitable authentication scheme, it provides information about who created and who modified what kind of data in the cloud. These are crucial aspects for digital investigations in distributed environments such as the cloud. Unfortunately, the aspects of forensic investigations in distributed environment have so far been mostly neglected by the research community. Current discussion centers mostly around security, privacy and data protection issues [35], [9], [12]. The impact of forensic investigations on cloud environments was little noticed albeit mentioned by the authors of [1] in 2009: ”[...] to our knowledge, no research has been published on how cloud computing environments affect digital artifacts,and on acquisition logistics and legal issues related to cloud computing env ironments.” This statement is also confirmed by other authors [34], [36], [40] stressing that further research on incident handling, evidence tracking and accountability in cloud environments has to be done. At the same time, massive investments are being made in cloud technology. Combined with the fact that information technology increasingly transcendents peoples’ private and professional life, thus mirroring more and more of peoples’actions, it becomes apparent that evidence gathered from cloud environments will be of high significance to litigation or criminal proceedings in the future. Within this work, we focus the notion of cloud forensics by addressing the technical issues of forensics in all three major cloud service models and consider cross-disciplinary aspects. Moreover, we address the usability of various sources of evidence for investigative purposes and propose potential solutions to the issues from a practical standpoint. This work should be considered as a surveying discussion of an almost unexplored research area. The paper is organized as follows: We discuss the related work and the fundamental technical background information of digital forensics, cloud computing and the fault model in section II and III. In section IV, we focus on the technical issues of cloud forensics and discuss the potential sources and nature of digital evidence as well as investigations in XaaS environments including thecross-disciplinary aspects. We conclude in section V.II. RELATED WORKVarious works have been published in the field of cloud security and privacy [9], [35], [30] focussing on aspects for protecting data in multi-tenant, virtualized environments. Desired security characteristics for current cloud infrastructures mainly revolve around isolation of multi-tenant platforms [12], security of hypervisors in order to protect virtualized guest systems and secure network infrastructures [32]. Albeit digital provenance, describing the ancestry of digital objects, still remains a challenging issue for cloud environments, several works have already been published in this field [8], [10] contributing to the issues of cloud forensis. Within this context, cryptographic proofs for verifying data integrity mainly in cloud storage offers have been proposed,yet lacking of practical implementations [24], [37], [23]. Traditional computer forensics has already well researched methods for various fields of application [4], [5], [6], [11], [13]. Also the aspects of forensics in virtual systems have been addressed by several works [2], [3], [20] including the notionof virtual introspection [25]. In addition, the NIST already addressed Web Service Forensics [22] which has a huge impact on investigation processes in cloud computing environments. In contrast, the aspects of forensic investigations in cloud environments have mostly been neglected by both the industry and the research community. One of the first papers focusing on this topic was published by Wolthusen [40] after Bebee et al already introduced problems within cloud environments [1]. Wolthusen stressed that there is an inherent strong need for interdisciplinary work linking the requirements and concepts of evidence arising from the legal field to what can be feasibly reconstructed and inferred algorithmically or in an exploratory manner. In 2010, Grobauer et al [36] published a paper discussing the issues of incident response in cloud environments - unfortunately no specific issues and solutions of cloud forensics have been proposed which will be done within this work.III. TECHNICAL BACKGROUNDA. Traditional Digital ForensicsThe notion of Digital Forensics is widely known as the practice of identifying, extracting and considering evidence from digital media. Unfortunately, digital evidence is both fragile and volatile and therefore requires the attention of special personnel and methods in order to ensure that evidence data can be proper isolated and evaluated. Normally, the process of a digital investigation can be separated into three different steps each having its own specificpurpose:1) In the Securing Phase, the major intention is the preservation of evidence for analysis. The data has to be collected in a manner that maximizes its integrity. This is normally done by a bitwise copy of the original media. As can be imagined, this represents a huge problem in the field of cloud computing where you never know exactly where your data is and additionallydo not have access to any physical hardware. However, the snapshot technology, discussed in section IV-B3, provides a powerful tool to freeze system states and thus makes digital investigations, at least in IaaS scenarios, theoretically possible.2) We refer to the Analyzing Phase as the stage in which the data is sifted and combined. It is in this phase that the data from multiple systems or sources is pulled together to create as complete a picture and event reconstruction as possible. Especially in distributed system infrastructures, this means that bits and pieces of data are pulled together for deciphering the real story of what happened and for providing a deeper look into the data.3) Finally, at the end of the examination and analysis of the data, the results of the previous phases will be reprocessed in the Presentation Phase. The report, created in this phase, is a compilation of all the documentation and evidence from the analysis stage. The main intention of such a report is that it contains all results, it is complete and clear to understand. Apparently, the success of these three steps strongly depends on the first stage. If it is not possible to secure the complete set of evidence data, no exhaustive analysis will be possible. However, in real world scenarios often only a subset of the evidence data can be secured by the investigator. In addition, an important definition in the general context of forensics is the notion of a Chain of Custody. This chain clarifies how and where evidence is stored and who takes possession of it. Especially for cases which are brought to court it is crucial that the chain of custody is preserved.B. Cloud ComputingAccording to the NIST [16], cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal CSP interaction. The new raw definition of cloud computing brought several new characteristics such as multi-tenancy, elasticity, pay-as-you-go and reliability. Within this work, the following three models are used: In the Infrastructure asa Service (IaaS) model, the customer is using the virtual machine provided by the CSP for installing his own system on it. The system can be used like any other physical computer with a few limitations. However, the additive customer power over the system comes along with additional security obligations. Platform as a Service (PaaS) offerings provide the capability to deploy application packages created using the virtual development environment supported by the CSP. For the efficiency of software development process this service model can be propellent. In the Software as a Service (SaaS) model, the customer makes use of a service run by the CSP on a cloud infrastructure. In most of the cases this service can be accessed through an API for a thin client interface such as a web browser. Closed-source public SaaS offers such as Amazon S3 and GoogleMail can only be used in the public deployment model leading to further issues concerning security, privacy and the gathering of suitable evidences. Furthermore, two main deployment models, private and public cloud have to be distinguished. Common public clouds are made available to the general public. The corresponding infrastructure is owned by one organization acting as a CSP and offering services to its customers. In contrast, the private cloud is exclusively operated for an organization but may not provide the scalability and agility of public offers. The additional notions of community and hybrid cloud are not exclusively covered within this work. However, independently from the specific model used, the movement of applications and data to the cloud comes along with limited control for the customer about the application itself, the data pushed into the applications and also about the underlying technical infrastructure.C. Fault ModelBe it an account for a SaaS application, a development environment (PaaS) or a virtual image of an IaaS environment, systems in the cloud can be affected by inconsistencies. Hence, for both customer and CSP it is crucial to have the ability to assign faults to the causing party, even in the presence of Byzantine behavior [33]. Generally, inconsistencies can be caused by the following two reasons:1) Maliciously Intended FaultsInternal or external adversaries with specific malicious intentions can cause faults on cloud instances or applications. Economic rivals as well as former employees can be the reason for these faults and state a constant threat to customers and CSP. In this model, also a malicious CSP is included albeit he isassumed to be rare in real world scenarios. Additionally, from the technical point of view, the movement of computing power to a virtualized, multi-tenant environment can pose further threads and risks to the systems. One reason for this is that if a single system or service in the cloud is compromised, all other guest systems and even the host system are at risk. Hence, besides the need for further security measures, precautions for potential forensic investigations have to be taken into consideration.2) Unintentional FaultsInconsistencies in technical systems or processes in the cloud do not have implicitly to be caused by malicious intent. Internal communication errors or human failures can lead to issues in the services offered to the costumer(i.e. loss or modification of data). Although these failures are not caused intentionally, both the CSP and the customer have a strong intention to discover the reasons and deploy corresponding fixes.IV. TECHNICAL ISSUESDigital investigations are about control of forensic evidence data. From the technical standpoint, this data can be available in three different states: at rest, in motion or in execution. Data at rest is represented by allocated disk space. Whether the data is stored in a database or in a specific file format, it allocates disk space. Furthermore, if a file is deleted, the disk space is de-allocated for the operating system but the data is still accessible since the disk space has not been re-allocated and overwritten. This fact is often exploited by investigators which explore these de-allocated disk space on harddisks. In case the data is in motion, data is transferred from one entity to another e.g. a typical file transfer over a network can be seen as a data in motion scenario. Several encapsulated protocols contain the data each leaving specific traces on systems and network devices which can in return be used by investigators. Data can be loaded into memory and executed as a process. In this case, the data is neither at rest or in motion but in execution. On the executing system, process information, machine instruction and allocated/de-allocated data can be analyzed by creating a snapshot of the current system state. In the following sections, we point out the potential sources for evidential data in cloud environments and discuss the technical issues of digital investigations in XaaS environmentsas well as suggest several solutions to these problems.A. Sources and Nature of EvidenceConcerning the technical aspects of forensic investigations, the amount of potential evidence available to the investigator strongly diverges between thedifferent cloud service and deployment models. The virtual machine (VM), hosting in most of the cases the server application, provides several pieces of information that could be used by investigators. On the network level, network components can provide information about possible communication channels between different parties involved. The browser on the client, acting often as the user agent for communicating with the cloud, also contains a lot of information that could be used as evidence in a forensic investigation. Independently from the used model, the following three components could act as sources for potential evidential data.1) Virtual Cloud Instance: The VM within the cloud, where i.e. data is stored or processes are handled, contains potential evidence [2], [3]. In most of the cases, it is the place where an incident happened and hence provides a good starting point for a forensic investigation. The VM instance can be accessed by both, the CSP and the customer who is running the instance. Furthermore, virtual introspection techniques [25] provide access to the runtime state of the VM via the hypervisor and snapshot technology supplies a powerful technique for the customer to freeze specific states of the VM. Therefore, virtual instances can be still running during analysis which leads to the case of live investigations [41] or can be turned off leading to static image analysis. In SaaS and PaaS scenarios, the ability to access the virtual instance for gathering evidential information is highly limited or simply not possible.2) Network Layer: Traditional network forensics is knownas the analysis of network traffic logs for tracing events that have occurred in the past. Since the different ISO/OSI network layers provide several information on protocols and communication between instances within as well as with instances outside the cloud [4], [5], [6], network forensics is theoretically also feasible in cloud environments. However in practice, ordinary CSP currently do not provide any log data from the network components used by the customer’s instances or applications. For instance, in case of a malware infection of an IaaS VM, it will be difficult for the investigator to get any form of routing information and network log datain general which is crucial for further investigative steps. This situation gets even more complicated in case of PaaS or SaaS. So again, the situation of gathering forensic evidence is strongly affected by the support the investigator receives from the customer and the CSP.3) Client System: On the system layer of the client, it completely depends on the used model (IaaS, PaaS, SaaS) if and where potential evidence could beextracted. In most of the scenarios, the user agent (e.g. the web browser) on the client system is the only application that communicates with the service in the cloud. This especially holds for SaaS applications which are used and controlled by the web browser. But also in IaaS scenarios, the administration interface is often controlled via the browser. Hence, in an exhaustive forensic investigation, the evidence data gathered from the browser environment [7] should not be omitted.a) Browser Forensics: Generally, the circumstances leading to an investigation have to be differentiated: In ordinary scenarios, the main goal of an investigation of the web browser is to determine if a user has been victim of a crime. In complex SaaS scenarios with high client-server interaction, this constitutes a difficult task. Additionally, customers strongly make use of third-party extensions [17] which can be abused for malicious purposes. Hence, the investigator might want to look for malicious extensions, searches performed, websites visited, files downloaded, information entered in forms or stored in local HTML5 stores, web-based email contents and persistent browser cookies for gathering potential evidence data. Within this context, it is inevitable to investigate the appearance of malicious JavaScript [18] leading to e.g. unintended AJAX requests and hence modified usage of administration interfaces. Generally, the web browser contains a lot of electronic evidence data that could be used to give an answer to both of the above questions - even if the private mode is switched on [19].B. Investigations in XaaS EnvironmentsTraditional digital forensic methodologies permit investigators to seize equipment and perform detailed analysis on the media and data recovered [11]. In a distributed infrastructure organization like the cloud computing environment, investigators are confronted with an entirely different situation. They have no longer the option of seizing physical data storage. Data and processes of the customer are dispensed over an undisclosed amount of virtual instances, applications and network elements. Hence, it is in question whether preliminary findings of the computer forensic community in the field of digital forensics apparently have to be revised and adapted to the new environment. Within this section, specific issues of investigations in SaaS, PaaS and IaaS environments will be discussed. In addition, cross-disciplinary issues which affect several environments uniformly, will be taken into consideration. We also suggest potential solutions to the mentioned problems.1) SaaS Environments: Especially in the SaaS model, the customer does notobtain any control of the underlying operating infrastructure such as network, servers, operating systems or the application that is used. This means that no deeper view into the system and its underlying infrastructure is provided to the customer. Only limited userspecific application configuration settings can be controlled contributing to the evidences which can be extracted fromthe client (see section IV-A3). In a lot of cases this urges the investigator to rely on high-level logs which are eventually provided by the CSP. Given the case that the CSP does not run any logging application, the customer has no opportunity to create any useful evidence through the installation of any toolkit or logging tool. These circumstances do not allow a valid forensic investigation and lead to the assumption that customers of SaaS offers do not have any chance to analyze potential incidences.a) Data Provenance: The notion of Digital Provenance is known as meta-data that describes the ancestry or history of digital objects. Secure provenance that records ownership and process history of data objects is vital to the success of data forensics in cloud environments, yet it is still a challenging issue today [8]. Albeit data provenance is of high significance also for IaaS and PaaS, it states a huge problem specifically for SaaS-based applications: Current global acting public SaaS CSP offer Single Sign-On (SSO) access control to the set of their services. Unfortunately in case of an account compromise, most of the CSP do not offer any possibility for the customer to figure out which data and information has been accessed by the adversary. For the victim, this situation can have tremendous impact: If sensitive data has been compromised, it is unclear which data has been leaked and which has not been accessed by the adversary. Additionally, data could be modified or deleted by an external adversary or even by the CSP e.g. due to storage reasons. The customer has no ability to proof otherwise. Secure provenance mechanisms for distributed environments can improve this situation but have not been practically implemented by CSP [10]. Suggested Solution: In private SaaS scenarios this situation is improved by the fact that the customer and the CSP are probably under the same authority. Hence, logging and provenance mechanisms could be implemented which contribute to potential investigations. Additionally, the exact location of the servers and the data is known at any time. Public SaaS CSP should offer additional interfaces for the purpose of compliance, forensics, operations and security matters to their customers. Through an API, the customers should have the ability to receive specific information suchas access, error and event logs that could improve their situation in case of aninvestigation. Furthermore, due to the limited ability of receiving forensic information from the server and proofing integrity of stored data in SaaS scenarios, the client has to contribute to this process. This could be achieved by implementing Proofs of Retrievability (POR) in which a verifier (client) is enabled to determine that a prover (server) possesses a file or data object and it can be retrieved unmodified [24]. Provable Data Possession (PDP) techniques [37] could be used to verify that an untrusted server possesses the original data without the need for the client to retrieve it. Although these cryptographic proofs have not been implemented by any CSP, the authors of [23] introduced a new data integrity verification mechanism for SaaS scenarios which could also be used for forensic purposes.2) PaaS Environments: One of the main advantages of the PaaS model is that the developed software application is under the control of the customer and except for some CSP, the source code of the application does not have to leave the local development environment. Given these circumstances, the customer obtains theoretically the power to dictate how the application interacts with other dependencies such as databases, storage entities etc. CSP normally claim this transfer is encrypted but this statement can hardly be verified by the customer. Since the customer has the ability to interact with the platform over a prepared API, system states and specific application logs can be extracted. However potential adversaries, which can compromise the application during runtime, should not be able to alter these log files afterwards. Suggested Solution:Depending on the runtime environment, logging mechanisms could be implemented which automatically sign and encrypt the log information before its transfer to a central logging server under the control of the customer. Additional signing and encrypting could prevent potential eavesdroppers from being able to view and alter log data information on the way to the logging server. Runtime compromise of an PaaS application by adversaries could be monitored by push-only mechanisms for log data presupposing that the needed information to detect such an attack are logged. Increasingly, CSP offering PaaS solutions give developers the ability to collect and store a variety of diagnostics data in a highly configurable way with the help of runtime feature sets [38].3) IaaS Environments: As expected, even virtual instances in the cloud get compromised by adversaries. Hence, the ability to determine how defenses in the virtual environment failed and to what extent the affected systems havebeen compromised is crucial not only for recovering from an incident. Also forensic investigations gain leverage from such information and contribute to resilience against future attacks on the systems. From the forensic point of view, IaaS instances do provide much more evidence data usable for potential forensics than PaaS and SaaS models do. This fact is caused throughthe ability of the customer to install and set up the image for forensic purposes before an incident occurs. Hence, as proposed for PaaS environments, log data and other forensic evidence information could be signed and encrypted before itis transferred to third-party hosts mitigating the chance that a maliciously motivated shutdown process destroys the volatile data. Although, IaaS environments provide plenty of potential evidence, it has to be emphasized that the customer VM is in the end still under the control of the CSP. He controls the hypervisor which is e.g. responsible for enforcing hardware boundaries and routing hardware requests among different VM. Hence, besides the security responsibilities of the hypervisor, he exerts tremendous control over how customer’s VM communicate with the hardware and theoretically can intervene executed processes on the hosted virtual instance through virtual introspection [25]. This could also affect encryption or signing processes executed on the VM and therefore leading to the leakage of the secret key. Although this risk can be disregarded in most of the cases, the impact on the security of high security environments is tremendous.a) Snapshot Analysis: Traditional forensics expect target machines to be powered down to collect an image (dead virtual instance). This situation completely changed with the advent of the snapshot technology which is supported by all popular hypervisors such as Xen, VMware ESX and Hyper-V.A snapshot, also referred to as the forensic image of a VM, providesa powerful tool with which a virtual instance can be clonedby one click including also the running system’s mem ory. Due to the invention of the snapshot technology, systems hosting crucial business processes do not have to be powered down for forensic investigation purposes. The investigator simply creates and loads a snapshot of the target VM for analysis(live virtual instance). This behavior is especially important for scenarios in which a downtime of a system is not feasible or practical due to existing SLA. However the information whether the machine is running or has been properly powered down is crucial [3] for the investigation. Live investigations of running virtual instances become more common providing evidence data that。

计算机专业外文资料翻译

计算机专业外文资料翻译

英文文献Object persistence and JavaBy Arsalan Saljoughy, , 05/01/97Object durability, or persistence, is the term you often hear used in conjunction with the issue of storing objects in databases. Persistence is expected to operate with transactional integrity, and as such it is subject to strict conditions. (See the Resources section of this article for more information on transaction processing.) In contrast, language services offered through standard language libraries and packages are often free from transactional constraints.As we'll see in this article, evidence suggests that simple Java persistence will likely stem from the language itself, while sophisticated database functionality will be offered by database vendors.No object is an islandIn the real world, you rarely find an object that lacks relations to other objects. Objects are components of object models. The issue of object durability transcends the issue of object model durability and distribution once we make the observation that objects are interconnected by virtue of their relations to one another.The relational approach to data storage tends to aggregate data by type. Rows in a table represent the physical aggregate of objects of the same type on disk. The relationships among objects are then represented by keys that are shared across many tables. Although through database organization, relational databases sometimes allow tables that are likely to be used together to be co-located (or clustered) in the same logical partition, such as a database segment, they have no mechanism to store object relationships in the database. Hence, in order to construct an object model, these relationships are constructed from the existing keys at run time in a process referred to as table joins. This is the same well-known property of the relational databases called data independence. Nearly all variants of object databases offer some mechanism to enhance the performance of a system that involves complex object relationships over traditional relational databases.To query or to navigate?In storing objects on disk, we are faced with the choice of co-locating related objects to better accommodate navigational access, or to store objects in table-like collections that aggregate objects by type to facilitate predicate-based access (queries), or both. The co-location of objects in persistent storage is an area where relational and object-oriented databases widely differ. The choice of the query language is another area of consideration. Structured Query Language (SQL) and extensions of it have provided relational systems with a predicate-basedaccess mechanism. Object Query Language (OQL) is an object variant of SQL, standardized by ODMG, but support for this language is currently scant. Polymorphic methods offer unprecedented elegance in constructing a semantic query for a collection of objects. For example, imagine a polymorphic behavior for acccount called isInGoodStanding. It may return the Boolean true for all accounts in good standing, and false otherwise. Now imagine the elegance of querying the collection of accounts, where inGoodStanding is implemented differently based on business rules, for all accounts in good standing. It may look something like:setOfGoodCustomers = setOfAccounts.query(account.inGoodStanding());While several of the existing object databases are capable of processing such a query style in C++ and Smalltalk, it is difficult for them to do so for larger (say, 500+ gigabytes) collections and more complex query expressions. Several of the relational database companies, such as Oracle and Informix, will soon offer other, SQL-based syntax to achieve the same result. Persistence and typeAn object-oriented language aficionado would say persistence and type are orthogonal properties of an object; that is, persistent and transient objects of the same type can be identical because one property should not influence the other. The alternative view holds that persistence is a behavior supported only by persistable objects and certain behaviors may apply only to persistent objects. The latter approach calls for methods that instruct persistable objects to store and retrieve themselves from persistent storage, while the former affords the application a seamless view of the entire object model -- often by extending the virtual memory system. Canonicalization and language independenceObjects of the same type in a language should be stored in persistent storage with the same layout, regardless of the order in which their interfaces appear. The processes of transforming an object layout to this common format are collectively known as canonicalization of object representation. In compiled languages with static typing (not Java) objects written in the same language, but compiled under different systems, should be identically represented in persistent storage.An extension of canonicalization addresses language-independent object representation. If objects can be represented in a language-independent fashion, it will be possible for different representations of the same object to share the same persistent storage.One mechanism to accomplish this task is to introduce an additional level of indirection through an interface definition language (IDL). Object database interfaces can be made through the IDL and the corresponding data structures. The downside of IDL style bindings is two fold: First, the extra level of indirection always requires an additional level of translation, which impacts the overall performance of the system; second, it limits use of database services that are unique to particular vendors and that might be valuable to application developers.A similar mechanism is to support object services through an extension of the SQL. Relational database vendors and smaller object/relational vendors are proponents of this approach; however, how successful these companies will be in shaping the framework for object storage remains to be seen.But the question remains: Is object persistence part of the object's behavior or is it an external service offered to objects via separate interfaces? How about collections of objects and methods for querying them? Relational, extended relational, and object/relational approaches tend to advocate a separation between language, while object databases -- and the Java language itself -- see persistence as intrinsic to the language:Native Java persistence via serializationObject serialization is the Java language-specific mechanism for the storage and retrieval of Java objects and primitives to streams. It is worthy to note that although commercial third-party libraries for serializing C++ objects have been around for some time, C++ has never offered a native mechanism for object serialization. Here's how to use Java's serialization: // Writing "foo" to a stream (for example, a file)// Step 1. Create an output stream// that is, create bucket to receive the bytesFileOutputStream out = new FileOutputStream("fooFile");// Step 2. Create ObjectOutputStream// that is, create a hose and put its head in the bucketObjectOutputStream os = new ObjectOutputStream(out)// Step 3. Write a string and an object to the stream// that is, let the stream flow into the bucketos.writeObject("foo");os.writeObject(new Foo());// Step 4. Flush the data to its destinationos.flush();The Writeobject method serializes foo and its transitive closure -- that is, all objects that can be referenced from foo within the graph. Within the stream only one copy of the serialized object exists. Other references to the objects are stored as object handles to save space and avoid circular references. The serialized object starts with the class followed by the fields of each class in the inheritance hierarchy.// Reading an object from a stream// Step 1. Create an input streamFileInputStream in = new FileInputStream("fooFile");// Step 2. Create an object input streamObjectInputStream ins = new ObjectInputStream(in);// Step 3. Got to know what you are readingString fooString = (String)ins.readObject();Foo foo = (Foo)s.readObject();Object serialization and securityBy default, serialization writes and reads non-static and non-transient fields from the stream. This characteristic can be used as a security mechanism by declaring fields that may not be serialized as private transient. If a class may not be serialized at all, writeObject and readObject methods should be implemented to throw NoAccessException.Persistence with transactional integrity: Introducing JDBCModeled after X/Open's SQL CLI (Client Level Interface) and Microsoft's ODBC abstractions, Java database connectivity (JDBC) aims to provide a database connectivity mechanism that is independent of the underlying database management system (DBMS).To become JDBC-compliant, drivers need to support at least the ANSI SQL-2 entry-level API, which gives third-party tool vendors and applications enough flexibility for database access.JDBC is designed to be consistent with the rest of the Java system. Vendors are encouraged to write an API that is more strongly typed than ODBC, which affords greater static type-checking at compile time.Here's a description of the most important JDBC interfaces:java.sql.Driver.Manager handles the loading of drivers and provides support for new database connections.java.sql.Connection represents a connection to a particular database.java.sql.Statement acts as a container for executing an SQL statement on a given connection.java.sql.ResultSet controls access to the result set.You can implement a JDBC driver in several ways. The simplest would be to build the driver as a bridge to ODBC. This approach is best suited for tools and applications that do not require high performance. A more extensible design would introduce an extra level of indirection to the DBMS server by providing a JDBC network driver that accesses the DBMS server through a published protocol. The most efficient driver, however, would directly access the DBMS proprietary API.Object databases and Java persistenceA number of ongoing projects in the industry offer Java persistence at the object level. However, as of this writing, Object Design's PSE (Persistent Storage Engine) and PSE Pro are the only fully Java-based, object-oriented database packages available (at least, that I am aware of). Check the Resources section for more information on PSE and PSE Pro.Java development has led to a departure from the traditional development paradigm for software vendors, most notably in the development process timeline. For example, PSE and PSE Pro are developed in a heterogeneous environment. And because there isn't a linking step in the development process, developers have been able to create various functional components independent of each other, which results in better, more reliable object-oriented code.PSE Pro has the ability to recover a corrupted database from an aborted transaction caused by system failure. The classes that are responsible for this added functionality are not present in the PSE release. No other differences exist between the two products. These products are what we call "dribbleware" -- software releases that enhance their functionality by plugging in new components. In the not-so-distant future, the concept of purchasing large, monolithic software would become a thing of the past. The new business environment in cyberspace, together with Java computing, enable users to purchase only those parts of the object model (object graph) they need, resulting in more compact end products.PSE works by post-processing and annotating class files after they have been created by the developer. From PSE's point of view, classes in an object graph are either persistent-capable or persistent-aware. Persistent-capable classes may persist themselves while persistent-aware classes can operate on persistent objects. This distinction is necessary because persistence may not be a desired behavior for certain classes. The class file post-processor makes the following modifications to classes:Modifies the class to inherit from odi.Persistent or odi.util.HashPersistent.Defines the initializeContents() method to load real values into hollow instances of your Persistent subclass. ObjectStore provides methods on the GenericObject class that retrieves each Field type.Be sure to call the correct methods for the fields in your persistent object. A separate method is available for obtaining each type of Field object. ObjectStore calls the initializeContents() method as needed. The method signature is:public void initializeContents(GenericObject genObj)Defines the flushContents() method to copy values from a modified instance (active persistent object) back to the database. ObjectStore provides methods on the GenericObject Be sure to call the correct methods for the fields in your persistent object. A separate method is available for setting each type of Field object. ObjectStore calls the flushContents() method as needed. The method signature is:public void flushContents(GenericObject genObj)Defines the clearContents() method to reset the values of an instance to the default values. This method must set all reference fields that referred to persistent objects to null. ObjectStore calls this method as needed. The method signature is:public void clearContents()Modifies the methods that reference non-static fields to call the Persistent.fetch() and Persistent.dirty() methods as needed. These methods must be called before the contents of persistent objects can be accessed or modified, respectively. While this step is not mandatory, it does provide a systematic way to ensure that the fetch() or dirty() method is called prior to accessing or updating object content.Defines a class that provides schema information about the persistence-capable class.All these steps can be completed either manually or automatically.PSE's transaction semanticYou old-time users of ObjectStore probably will find the database and transaction semantics familiar. There is a system-wide ObjectStore object that initializes the environment and is responsible for system-wide parameters. The Database class offers methods (such as create, open, and close), and the Transaction class has methods to begin, abort, or commit transactions. As with serialization, you need to find an entry point into the object graph. The getRoot and setRoot methods of the Database class serve this function. I think a few examples would be helpful here. This first snippet shows how to initialize ObjectStore:ObjectStore.initialize(serverName, null);try {db = Database.open(dbName, Database.openUpdate);} catch(DatabaseNotFoundException exception) {db = Database.create(dbName, 0664);}This next snippet shows how to start and commit a transaction:Transaction transaction = Transaction.begin(Transaction.update);try {foo = (Foo)db.getRoot("fooHead");} catch(DatabaseRootNotFoundException exception) {db.createRoot("fooHead", new Foo());}mit();The three classes specified above -- Transaction, Database, and ObjectStore -- are fundamental classes for ObjectStore. PSE 1.0 does not support nested transactions, backup and recovery, clustering, large databases, object security beyond what is available in the language, and any type of distribution. What is exciting, however, is all of this functionality will be incrementally added to the same foundation as the product matures.About the authorArsalan Saljoughy is asystems engineer specializing in object technology at Sun Microsystems. He earned his M.S. in mathematics from SUNY at Albany, and subsequently was a research fellow at the University of Berlin. Before joining Sun, he worked as a developer and as an IT consultant to financial services companies.ConclusionAlthough it is still too early to establish which methodology for object persistence in general and Java persistence in particular will be dominant in the future, it is safe to assume that a myriad of such styles will co-exist. The shift of storing objects as objects without disassembly into rows and columns is sure to be slow, but it will happen. In the meantime, we are more likely to see object databases better utilized in advanced engineering and telecommunications applications than in banking and back-office financial applications.英文翻译对象持久化和Java-深入的了解面向对象语言中的对象持久的讨论Arsalan Saljoughy,, 05/01/97对象持久化这个术语你常常会和数据存储一起听到。

大数据应用的英文作文

大数据应用的英文作文

大数据应用的英文作文Title: The Application of Big Data: Transforming Industries。

In today's digital age, the proliferation of data has become unprecedented, ushering in the era of big data. This vast amount of data holds immense potential,revolutionizing various sectors and industries. In this essay, we will explore the applications of big data and its transformative impact across different domains.One of the primary areas where big data has made significant strides is in healthcare. With the advent of electronic health records (EHRs) and wearable devices, healthcare providers can now collect and analyze vast amounts of patient data in real-time. This data includesvital signs, medical history, genomic information, and more. By applying advanced analytics and machine learning algorithms to this data, healthcare professionals canidentify patterns, predict disease outbreaks, personalizetreatments, and improve overall patient care. For example, predictive analytics can help identify patients at risk of developing chronic conditions such as diabetes or heart disease, allowing for proactive interventions to prevent or mitigate these conditions.Another sector that has been transformed by big data is finance. In the financial industry, data-driven algorithms are used for risk assessment, fraud detection, algorithmic trading, and customer relationship management. By analyzing large volumes of financial transactions, market trends, and customer behavior, financial institutions can make more informed decisions, optimize investment strategies, and enhance the customer experience. For instance, banks employ machine learning algorithms to detect suspicious activities and prevent fraudulent transactions in real-time, safeguarding both the institution and its customers.Furthermore, big data has revolutionized the retail sector, empowering companies to gain deeper insights into consumer preferences, shopping behaviors, and market trends. Through the analysis of customer transactions, browsinghistory, social media interactions, and demographic data, retailers can personalize marketing campaigns, optimize pricing strategies, and enhance inventory management. For example, e-commerce platforms utilize recommendation systems powered by machine learning algorithms to suggest products based on past purchases and browsing behavior, thereby improving customer engagement and driving sales.The transportation industry is also undergoing a profound transformation fueled by big data. With the proliferation of GPS-enabled devices, sensors, andtelematics systems, transportation companies can collect vast amounts of data on vehicle performance, traffic patterns, weather conditions, and logistics operations. By leveraging this data, companies can optimize route planning, reduce fuel consumption, minimize delivery times, and enhance overall operational efficiency. For instance, ride-sharing platforms use predictive analytics to forecast demand, allocate drivers more effectively, and optimizeride routes, resulting in improved service quality and customer satisfaction.In addition to these sectors, big data is making significant strides in fields such as manufacturing, agriculture, energy, and government. In manufacturing, data analytics is used for predictive maintenance, quality control, and supply chain optimization. In agriculture, precision farming techniques enabled by big data help optimize crop yields, minimize resource usage, and mitigate environmental impact. In energy, smart grid technologies leverage big data analytics to optimize energy distribution, improve grid reliability, and promote energy efficiency. In government, big data is utilized for urban planning, public safety, healthcare management, and policy formulation.In conclusion, the application of big data is transforming industries across the globe, enabling organizations to make data-driven decisions, unlock new insights, and drive innovation. From healthcare and finance to retail and transportation, the impact of big data is profound and far-reaching. As we continue to harness the power of data analytics and machine learning, we can expect further advancements and breakthroughs that will shape the future of our society and economy.。

云计算Cloud-Computing-外文翻译

云计算Cloud-Computing-外文翻译

毕业设计说明书英文文献及中文翻译学生姓名:学号:计算机与控制工程学院:专指导教师:2017 年 6 月英文文献Cloud Computing1。

Cloud Computing at a Higher LevelIn many ways,cloud computing is simply a metaphor for the Internet, the increasing movement of compute and data resources onto the Web. But there's a difference: cloud computing represents a new tipping point for the value of network computing. It delivers higher efficiency, massive scalability, and faster,easier software development. It's about new programming models,new IT infrastructure, and the enabling of new business models。

For those developers and enterprises who want to embrace cloud computing, Sun is developing critical technologies to deliver enterprise scale and systemic qualities to this new paradigm:(1) Interoperability —while most current clouds offer closed platforms and vendor lock—in, developers clamor for interoperability。

云计算与物联网外文翻译文献

云计算与物联网外文翻译文献

文献信息:文献标题:Integration of Cloud Computing with Internet of Things: Challenges and Open Issues(云计算与物联网的集成:挑战与开放问题)国外作者:HF Atlam等人文献出处:《IEEE International Conference on Internet of Things》, 2017字数统计:英文4176单词,23870字符;中文7457汉字外文文献:Integration of Cloud Computing with Internet of Things:Challenges and Open IssuesAbstract The Internet of Things (IoT) is becoming the next Internet-related revolution. It allows billions of devices to be connected and communicate with each other to share information that improves the quality of our daily lives. On the other hand, Cloud Computing provides on-demand, convenient and scalable network access which makes it possible to share computing resources; indeed, this, in turn, enables dynamic data integration from various data sources. There are many issues standing in the way of the successful implementation of both Cloud and IoT. The integration of Cloud Computing with the IoT is the most effective way on which to overcome these issues. The vast number of resources available on the Cloud can be extremely beneficial for the IoT, while the Cloud can gain more publicity to improve its limitations with real world objects in a more dynamic and distributed manner. This paper provides an overview of the integration of the Cloud into the IoT by highlighting the integration benefits and implementation challenges. Discussion will also focus on the architecture of the resultant Cloud-based IoT paradigm and its new applications scenarios. Finally, open issues and future research directions are also suggested.Keywords: Cloud Computing, Internet of Things, Cloud based IoT, Integration.I.INTRODUCTIONIt is important to explore the common features of the technologies involved in the field of computing. Indeed, this is certainly the case with Cloud Computing and the Internet of Things (IoT) – two paradigms which share many common features. The integration of these numerous concepts may facilitate and improve these technologies. Cloud computing has altered the way in which technologies can be accessed, managed and delivered. It is widely agreed that Cloud computing can be used for utility services in the future. Although many consider Cloud computing to be a new technology, it has, in actual fact, been involved in and encompassed various technologies such as grid, utility computing virtualisation, networking and software services. Cloud computing provides services which make it possible to share computing resources across the Internet. As such, it is not surprising that the origins of Cloud technologies lie in grid, utility computing virtualisation, networking and software services, as well as distributed computing, and parallel computing. On the other hand, the IoT can be considered both a dynamic and global networked infrastructure that manages self-configuring objects in a highly intelligent way. The IoT is moving towards a phase where all items around us will be connected to the Internet and will have the ability to interact with minimum human effort. The IoT normally includes a number of objects with limited storage and computing capacity. It could well be said that Cloud computing and the IoT will be the future of the Internet and next-generation technologies. However, Cloud services are dependent on service providers which are extremely interoperable, while IoT technologies are based on diversity rather than interoperability.This paper provides an overview of the integration of Cloud Computing into the IoT; this involves an examination of the benefits resulting from the integration process and the implementation challenges encountered. Open issues and research directions are also discussed. The remainder of the paper is organised as follows: Section II provides the basic concepts of Cloud computing, IoT, and Cloud-based IoT; SectionIII discusses the benefits of integrating the IoT into the Cloud; Could-based IoT Architecture is presented in section IV; Section V illustrates different Cloud-based IoT applications scenarios. Following this, the challenges facing Cloud-based IoT integration and open research directions are discussed in Section VI and Section VII respectively, before Section VIII concludes the paper.II.BASIC CONCEPTSThis section reviews the basic concepts of Cloud Computing, the IoT, and Cloud-based IoT.1.Cloud ComputingThere exist a number of proposed definitions for Cloud computing, although the most widely agreed upon seems be that put forth by the National Institute of Standards and Technology (NIST). Indeed, the NIST has defined Cloud computing as "a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction".As stated in this definition, Cloud computing comprises four types of deployment models, three different service models, and five essential characteristics.Cloud computing deployment models are most commonly classified as belonging to the public Cloud, where resources are made available to consumers over the Internet. Public Clouds are generally owned by a profitable organisation (e.g. Amazon EC2). Conversely, the infrastructure of a private Cloud is commonly provided by a single organisation to serve the particular purposes of its users. The private Cloud offers a secure environment and a higher level of control (Microsoft Private Cloud). Hybrid Clouds are a mixture of private and public Clouds. This choice is provided for consumers as it makes it possible to overcome some of the limitations of each model. In contrast, a community Cloud is a Cloud infrastructure which is delivered to a group of users by a number of organisations which share the same need.In order to allow consumers to choose the service that suits them, services inCloud computing are provided at three different levels, namely: the Software as a Service (SaaS) model, where software is delivered through the Internet to users (e.g. GoogleApps); the Platform as a Service (PaaS) model, which offers a higher level of integrated environment that can build, test, and deploy specific software (e.g. Microsoft Azure); and finally, with the Infrastructure as a Service (IaaS) model, infrastructure such as storage, hardware and servers are delivered as a service (e.g. Amazon Web Services).2.Internet of ThingsThe IoT represents a modern approach where boundaries between real and digital domains are progressively eliminated by consistently changing every physical device to a smart alternative ready to provide smart services. All things in the IoT (smart devices, sensors, etc.) have their own identity. They are combined to form the communication network and will become actively participating objects. These objects include not only daily usable electronic devices, but also things like food, clothing, materials, parts, and subassemblies; commodities and luxury items; monuments and landmarks; and various forms of commerce and culture. In addition, these objects are able to create requests and alter their states. Thus, all IoT devices can be monitored, tracked and counted, which significantly decreases waste, loss, and cost.The concept of the IoT was first mentioned by Kevin Ashton in 1999, when he stated that “The Internet of Things has the potential to change the world, just as the Internet did. Maybe even more so”. Later, the IoT was formally presented by the International Telecommunication Union (ITU) in 2005. A great many definitions of the IoT have been put forth by numerous organisations and researchers. According to the ITU (2012), the IoT is “a global infrastructure for the Information Society, enabling advanced services by interconnecting (physical and virtual) things based on, existing and evolving, interoperable information and communication technologies”. The IoT introduces a variety of opportunities and applications. However, it faces many challenges which could potentially hinder its successful implementation, such as data storage, heterogeneous resource-constrained, scalability, Things, variable geospatial deployment, and energy efficiency.3.Cloud-Based Internet of ThingsThe IoT and Cloud computing are both rapidly developing services, and have their own unique characteristics. On the one hand, the IoT approach is based on smart devices which intercommunicate in a global network and dynamic infrastructure. It enables ubiquitous computing scenarios. The IoT is typically characterised by widely-distributed devices with limited processing capabilities and storage. These devices encounter issues regarding performance, reliability, privacy, and security. On the other hand, Cloud computing comprises a massive network with unlimited storage capabilities and computation power. Furthermore, it provides a flexible, robust environment which allows for dynamic data integration from various data sources. Cloud computing has partially resolved most of the IoT issues. Indeed, the IoT and Cloud are two comparatively challenging technologies, and are being combined in order to change the current and future environment of internetworking services.The Cloud-based Internet of Things is a platform which allows for the smart usage of applications, information, and infrastructure in a cost-effective way. While the IoT and Cloud computing are different from each other, their features are almost complementary, as shown in TABLE 1. This complementarity is the primary reason why many researchers have proposed their integration.TABLE 1. COMPARISON OF THE IOT WITH CLOUD COMPUTINGIII.BENEFITS OF INTEGRATING IOT WITH CLOUDSince the IoT suffers from limited capabilities in terms of processing power and storage, it must also contend with issues such as performance, security, privacy, reliability. The integration of the IoT into the Cloud is certainly the best way toovercome most of these issues. The Cloud can even benefit from the IoT by expanding its limits with real world objects in a more dynamic and distributed way, and providing new services for billions of devices in different real life scenarios. In addition, the Cloud provides simplicity of use and reduces the cost of the usage of applications and services for end-users. The Cloud also simplifies the flow of the IoT data gathering and processing, and provides quick, low-cost installation and integration for complex data processing and deployment. The benefits of integrating IoT into Cloud are discussed in this section as follows.municationApplication and data sharing are two significant features of the Cloud-based IoT paradigm. Ubiquitous applications can be transmitted through the IoT, whilst automation can be utilised to facilitate low-cost data distribution and collection. The Cloud is an effective and economical solution which can be used to connect, manage, and track anything by using built-in apps and customised portals. The availability of fast systems facilitates dynamic monitoring and remote objects control, as well as data real-time access. It is worth declaring that, although the Cloud can greatly develop and facilitate the IoT interconnection, it still has weaknesses in certain areas. Thus, practical restrictions can appear when an enormous amount of data needs to be transferred from the Internet to the Cloud.2.StorageAs the IoT can be used on billions of devices, it comprises a huge number of information sources, which generate an enormous amount of semi-structured or non-structured data. This is known as Big Data, and has three characteristics: variety (e.g. data types), velocity (e.g. data generation frequency), and volume (e.g. data size). The Cloud is considered to be one of the most cost-effective and suitable solutions when it comes to dealing with the enormous amount of data created by the IoT. Moreover, it produces new chances for data integration, aggregation, and sharing with third parties.3.Processing capabilitiesIoT devices are characterised by limited processing capabilities which preventon-site and complex data processing. Instead, gathered data is transferred to nodes that have high capabilities; indeed, it is here that aggregation and processing are accomplished. However, achieving scalability remains a challenge without an appropriate underlying infrastructure. Offering a solution, the Cloud provides unlimited virtual processing capabilities and an on-demand usage model. Predictive algorithms and data-driven decisions making can be integrated into the IoT in order to increase revenue and reduce risks at a lower cost.4.ScopeWith billions of users communicating with one another together and a variety of information being collected, the world is quickly moving towards the Internet of Everything (IoE) realm - a network of networks with billions of things that generate new chances and risks. The Cloud-based IoT approach provides new applications and services based on the expansion of the Cloud through the IoT objects, which in turn allows the Cloud to work with a number of new real world scenarios, and leads to the emergence of new services.5.New abilitiesThe IoT is characterised by the heterogeneity of its devices, protocols, and technologies. Hence, reliability, scalability, interoperability, security, availability and efficiency can be very hard to achieve. Integrating IoT into the Cloud resolves most of these issues. It provides other features such as ease- of-use and ease-of-access, with low deployment costs.6.New ModelsCloud-based IoT integration empowers new scenarios for smart objects, applications, and services. Some of the new models are listed as follows:•SaaS (Sensing as a Service), which allows access to sensor data;•EaaS (Ethernet as a Service), the main role of which is to provide ubiquitous connectivity to control remote devices;•SAaaS (Sensing and Actuation as a Service), which provides control logics automatically;•IPMaaS (Identity and Policy Management as a Service), which provides access to policy and identity management;•DBaaS (Database as a Service), which provides ubiquitous database management;•SEaaS (Sensor Event as a Service), which dispatches messaging services that are generated by sensor events;•SenaaS (Sensor as a Service), which provides management for remote sensors;•DaaS (Data as a Service), which provides ubiquitous access to any type of data.IV.CLOUD-BASED IOT ARCHITECTUREAccording to a number of previous studies, the well-known IoT architecture is typically divided into three different layers: application, perception and network layer. Most assume that the network layer is the Cloud layer, which realises the Cloud-based IoT architecture, as depicted in Fig. 1.Fig. 1. Cloud-based IoT architectureThe perception layer is used to identify objects and gather data, which is collected from the surrounding environment. In contrast, the main objective of the network layer is to transfer the collected data to the Internet/Cloud. Finally, the application layer provides the interface to different services.V.CLOUD-BASED IOT APPLICATIONSThe Cloud-based IoT approach has introduced a number of applications and smart services, which have affected end users’ daily lives. TABLE 2 presents a brief discussion of certain applications which have been improved by the Cloud-based IoT paradigm.TABLE 2. CLOUD-BASED IOT APPLICATIONSVI.CHALLENGES FACING CLOUD-BASED IOT INTEGRATION There are many challenges which could potentially prevent the successful integration of the Cloud-based IoT paradigm. These challenges include:1.Security and privacyCloud-based IoT makes it possible to transport data from the real world to theCloud. Indeed, one particularly important issues which has not yet been resolved is how to provide appropriate authorisation rules and policies while ensuring that only authorised users have access to the sensitive data; this is crucial when it comes to preserving users’ privacy, and particularly when data integrity must be guaranteed. In addition, when critical IoT applications move into the Cloud, issues arise because of the lack of trust in the service provider, information regarding service level agreements (SLAs), and the physical location of data. Sensitive information leakage can also occur due to the multi-tenancy. Moreover, public key cryptography cannot be applied to all layers because of the processing power constraints imposed by IoT objects. New challenges also require specific attention; for example, the distributed system is exposed to number of possible attacks, such as SQL injection, session riding, cross- site scripting, and side-channel. Moreover, important vulnerabilities, including session hijacking and virtual machine escape are also problematic.2.HeterogeneityOne particularly important challenge faced by the Cloud- based IoT approach is related to the extensive heterogeneity of devices, platforms, operating systems, and services that exist and might be used for new or developed applications. Cloud platforms suffer from heterogeneity issues; for instance, Cloud services generally come with proprietary interfaces, thus allowing for resource integration based on specific providers. In addition, the heterogeneity challenge can be exacerbated when end-users adopt multi-Cloud approaches, and thus services will depend on multiple providers to improve application performance and resilience.3.Big dataWith many predicting that Big Data will reach 50 billion IoT devices by 2020, it is important to pay more attention to the transportation, access, storage and processing of the enormous amount of data which will be produced. Indeed, given recent technological developments, it is clear that the IoT will be one of the core sources of big data, and that the Cloud can facilitate the storage of this data for a long period of time, in addition to subjecting it to complex analysis. Handling the huge amount of data produced is a significant issue, as the application’s whole performance is heavilyreliant on the properties of this data management service. Finding a perfect data management solution which will allow the Cloud to manage massive amounts of data is still a big issue. Furthermore, data integrity is a vital element, not only because of its effect on the service’s quality, but also because of security and privacy issues, the majority of which relate to outsourced data.4.PerformanceTransferring the huge amount of data created from IoT devices to the Cloud requires high bandwidth. As a result, the key issue is obtaining adequate network performance in order to transfer data to Cloud environments; indeed, this is because broadband growth is not keeping pace with storage and computation evolution. In a number of scenarios, services and data provision should be achieved with high reactivity. This is because timeliness might be affected by unpredictable matters and real-time applications are very sensitive to performance efficiency.5.Legal aspectsLegal aspects have been very significant in recent research concerning certain applications. For instance, service providers must adapt to various international regulations. On the other hand, users should give donations in order to contribute to data collection.6.MonitoringMonitoring is a primary action in Cloud Computing when it comes to performance, managing resources, capacity planning, security, SLAs, and for troubleshooting. As a result, the Cloud-based IoT approach inherits the same monitoring demands from the Cloud, although there are still some related challenges that are impacted by velocity, volume, and variety characteristics of the IoT.rge scaleThe Cloud-based IoT paradigm makes it possible to design new applications that aim to integrate and analyse data coming from the real world into IoT objects. This requires interacting with billions of devices which are distributed throughout many areas. The large scale of the resulting systems raises many new issues that are difficult to overcome. For instance, achieving computational capability and storage capacityrequirements is becoming difficult. Moreover, the monitoring process has made the distribution of the IoT devices more difficult, as IoT devices have to face connectivity issues and latency dynamics.VII.OPEN ISSUES AND RESEARCH DIRECTIONSThis section will address some of the open issues and future research directions related to Cloud-based IoT, and which still require more research efforts. These issues include:1.StandardisationMany studies have highlighted the issues of lack of standards, which is considered critical in relation to the Cloud- based IoT paradigm. Although a number of proposed standardisations have been put forth by the scientific society for the deployment of IoT and Cloud approaches, it is obvious that architectures, standard protocols, and APIs are required to allow for interconnection between heterogeneous smart things and the generation of new services, which make up the Cloud- based IoT paradigm.2.Fog ComputingFog computing is a model which extends Cloud computing services to the edge of the network. Similar to the Cloud, Fog supply communicates application services to users. Fog can essentially be considered an extension of Cloud Computing which acts as an intermediate between the edge of the network and the Cloud; indeed, it works with latency-sensitive applications that require other nodes to satisfy their delay requirements. Although storage, computing, and networking are the main resources of both Fog and the Cloud, the Fog has certain features, such as location awareness and edge location, that provide geographical distribution, and low latency; moreover, there are a large nodes; this is in contrast with the Cloud, which is supported for real-time interaction and mobility.3.Cloud CapabilitiesAs in any networked environment, security is considered to be one of the main issues of the Cloud-based IoT paradigm. There are more chances of attacks on boththe IoT and the Cloud side. In the IoT context, data integrity, confidentiality and authenticity can be guaranteed by encryption. However, insider attacks cannot be resolved and it is also hard to use the IoT on devices with limited capabilities.4.SLA enforcementCloud-based IoT users need created data to be conveyed and processed based on application-dependent limitations, which can be tough in some cases. Ensuring a specific Quality of Service (QoS) level regarding Cloud resources by depending on a single provider raises many issues. Thus, multiple Cloud providers may be required to avoid SLA violations. However, dynamically choosing the most appropriate mixture of Cloud providers still represents an open issue due to time, costs, and heterogeneity of QoS management support.5.Big dataIn the previous section, we discussed Big Data as a critical challenge that is tightly coupled with the Cloud-based IoT paradigm. Although a number of contributions have been proposed, Big Data is still considered a critical open issue, and one in need of more research. The Cloud-based IoT approach involves the management and processing of huge amounts of data stemming from various locations and from heterogeneous sources; indeed, in the Cloud-based IoT, many applications need complicated tasks to be performed in real-time.6.Energy efficiencyRecent Cloud-based IoT applications include frequent data that is transmitted from IoT objects to the Cloud, which quickly consumes the node energy. Thus, generating efficient energy when it comes to data processing and transmission remains a significant open issue. Several directions have been suggested to overcome this issue, such as compression technologies, efficient data transmission; and data caching techniques for reusing collected data with time-insensitive application.7.Security and privacyAlthough security and privacy are both critical research issues which have received a great deal of attention, they are still open issues which require more efforts. Indeed, adapting to different threats from hackers is still an issue. Moreover, anotherproblem is providing the appropriate authorisation rules and policies while ensuring that only authorised users have access to sensitive data; this is crucial for preserving users’ privacy, specifically when data integrity must be guaranteed.VIII.CONCLUSIONThe IoT is becoming an increasingly ubiquitous computing service which requires huge volumes of data storage and processing capabilities. The IoT has limited capabilities in terms of processing power and storage, while there also exist consequential issues such as security, privacy, performance, and reliability; As such, the integration of the Cloud into the IoT is very beneficial in terms of overcoming these challenges. In this paper, we presented the need for the creation of the Cloud-based IoT approach. Discussion also focused on the Cloud-based IoT architecture, different applications scenarios, challenges facing the successful integration, and open research directions. In future work, a number of case studies will be carried out to test the effectiveness of the Cloud-based IoT approach in healthcare applications.中文译文:云计算与物联网的集成:挑战与开放问题摘要物联网(IoT)正在成为下一次与互联网相关的革命。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

云计算大数据外文翻译文献(文档含英文原文和中文翻译)原文:Meet HadoopIn pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.—Grace Hopper Data!We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world.This flood of data is coming from many sources. Consider the following:• The New York Stock Exchange generates about one terabyte of new trade data perday.• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.• , the genealogy site, stores around 2.5 petabytes of data.• The Internet Archive stores around 2 petabytes of data, and is growing at a rate of20 terabytes per month.• The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.So there’s a lot of data out there. But you are probably wondering how it affects you.Most of the data is locked up in the largest web properties (like search engines), orscientific or financial institutions, isn’t it? Does the advent of “Big Data,” as it is being called, affect smaller organizations or individuals?I argue that it does. Take photos, for example. My wife’s grandfather was an avid photographer, and took photographs throughout his adult life. His entire corpus of medium format, slide, and 35mm film, when scanned in at high-resolution, occupies around 10 gigabytes. Compare this to the digital photos that my family took last year,which take up about 5 gigabytes of space. My family is producing photographic data at 35 times the rate my wife’s grandfather’s did, and the rate is increasing every year as it becomes easier to take more and more photos.More generally, the digital streams that individuals are producing are growing apace. Microsoft Research’s MyLifeBits project gives a glimpse of archiving of personal in formation that may become commonplace in the near future. MyLifeBits was an experiment where an individual’s interactions—phone calls, emails, documents were captured electronically and stored for later access. The data gathered included a photo taken every minute, which resulted in an overall data volume of one gigabyte a month. When storage costs come down enough to make it feasible to store continuous audio and video, the data volume for a future MyLifeBits service will be many times that.The trend is f or every individual’s data footprint to grow, but perhaps more importantly the amount of data generated by machines will be even greater than that generated by people. Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions—all of these contribute to the growing mountain of data.The volume of data being made publicly available increases every year too. Organizations no longer have to merely manage their own data: success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.Initiatives such as Public Data Sets on Amazon Web Services, , and exist to foster the “information commons,” where data can be freely (or in the case of AWS, for a modest price) shared for anyone to download and analyze. Mashups between different information sources make for unexpected and hitherto unimaginable applications.Take, for example, the project, which watches the Astrometry groupon Flickr for new photos of the night sky. It analyzes each image, and identifies which part of the sky it is from, and any interesting celestial bodies, such as stars or galaxies. Although it’s still a new and experimental service, it shows the kind of things that are possible when data (in this case, tagged photographic images) is made available andused for something (image analysis) that was not anticipated by the creator.It has been said that “More data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences),however fiendish your algorithms are, they can often be beaten simply by having more data (and a less sophisticated algorithm).The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it.Data Storage and AnalysisThe problem is simple: while the storage capacities of hard drives have increased massively over the years, access speeds--the rate at which data can be read from drives--have not kept up. One typical drive from 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five minutes. Almost 20years later one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.This is a long time to read all data on a single drive and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.Only using one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them. We can imagine that the users of such a system would be happy to share access in return for shorter analysis times, and, statistically, that their analysis jobs would be likely to be spread over time, so they wouldn`t interfere with each other too much.There`s more to being able to read and write data in parallel to or from multiple disks, though. The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. This is how RAID works, for instance, although Hadoop`s filesystem, the Hadoop Distributed Filesystem (HDFS),takes a slightly different approach, as you shall see later. The second problem is that most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into a computation over sets of keys and values. We will look at the details of this model in later chapters, but the important point for the present discussion is that there are two parts to the computation, the map and the re duce, and it’s the interface between the two where the “mixing” occurs. Like HDFS, MapReduce has reliability built-in.This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS, and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.Comparison with Other SystemsThe approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset—or at least a good portion of it—is processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data, and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights.For e xample, Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users.In their words: This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow. By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers. You can read more about how Rackspace uses Hadoop in Chapter 14.RDBMSWhy c an’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.In many ways, MapReduce can be seen as a complement to an RDBMS. (The differences between the two systems are shown in Table 1-1.) MapReduce is a good fit for problems thatneed to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.Table 1-1. RDBMS compared to MapReduceTraditional RDBMS MapReduceData size Gigabytes PetabytesAccess Interactive and batch BatchWrite once, read many times Updates Read and write manytimesStructure Static schema Dynamic schemaIntegrity High LowScaling Nonlinear LinearAnother difference between MapReduce and an RDBMS is the amount of structure in the datasets that they operate on. Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold anyform of data. Unstructured data does not have any particular internal structure: for example, plain text or image data. MapReduce works well on unstructured or semistructured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.Relational data is often normalized to retain its integrity, and remove redundancy. Normalization poses problems for MapReduce, since it makes reading a record a nonlocaloperation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear many times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReduce.MapReduce is a linearly scalable programming model. The programmer writes two functions—a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to another. These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More importantly, if you double the size of the input data, a job will run twice as slow. But if you also double the size of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries.Over time, however, the differences between relational databases and MapReduce systems are likely to blur. Both as relational databases start incorporating some of the ideas from MapReduce (such as Aster Data’s and Greenplum’s databases), and, from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable to traditional database programmers.Grid ComputingThe High Performance Computing (HPC) and Grid Computing communities have been doing large-scale data processing for years, using such APIs as Message Passing Interface (MPI). Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN. This works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck, and compute nodes become idle.MapReduce tries to colocate the data with the compute node, so data access is fast since it is local. This feature, known as data locality, is at the heart of MapReduce and is the reason for its good performance. Recognizing that network bandwidth is the most precious resource in a data center environment (it is easy to saturate network links by copying data around),MapReduce implementations go to great lengths to preserve it by explicitly modelling network topology. Notice that this arrangement does not preclude high-CPU analyses in MapReduce.MPI gives great control to the programmer, but requires that he or she explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higher-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit.Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure—when you don’t know if a remote process has failed or not—and still making progress with the overall computation. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this since it is a shared-nothing architecture, meaning that tasks have no dependence on one other. (This is a slight oversimplification, since the output from mappers is fed to the reducers, but this is under the control of the MapReduce system; in this case, it needs to take more care rerunning a failed reducer than rerunning a failed map, since it has to make sure it can retrieve the necessary map outputs, and if not, regenerate them by running the relevant maps again.) So from the programmer’s point of view, the order in which the tasks run doesn’t matter. By contrast, MPI programs have to explicitly manage their own checkpointing and recovery, which gives more control to the programmer, but makes them more difficult to write.MapReduce might sound like quite a restrictive programming model, and in a sense itis: you are limited to key and value types that are related in specified ways, and mappers and reducers run with very limited coordination between one another (the mappers pass keys and values to reducers). A natural question to ask is: can you do anything useful or nontrivial with it?The answer is yes. MapReduce was invented by engineers at Google as a system for building production search indexes because they found themselves solving the same problem over and over again (and MapReduce was inspired by older ideas from the functional programming, distributed computing, and database communities), but it has since been used for many other applications in many other industries. It is pleasantly surprising to see the range of algorithms that can be expressed in MapReduce, from image analysis, to graph-based problems,to machine learning algorithms. It can’t solve every problem, of course, but it is a general data-processing tool.You can see a sample of some of the applications that Hadoop has been used for in Chapter 14.Volunteer ComputingWhen people first hear about Hadoop and MapReduce, they often ask, “How is it different from SETI@home?” SETI, the Search for Extra-Terrestrial Intelligence, runs a project called SETI@home in which volunteers donate CPU time from their otherwise idle computers to analyze radio telescope data for signs of intelligent life outside earth. SETI@home is the most well-known of many volunteer computing projects; others include the Great Internet Mersenne Prime Search (to search for large prime numbers) and Folding@home (to understand protein folding, and how it relates to disease).Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed. For example, a SETI@home work unit is about 0.35 MB of radio telescope data, and takes hours or days to analyze on a typical home computer. When the analysis is completed, the results are sent back to the server, and the client gets another work unit. As a precaution to combat cheating, each work unit is sent to three different machines, and needs at least two results to agree to be accepted.Although SETI@home may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differences. The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world, since the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.译文:初识Hadoop古时候,人们用牛来拉重物,当一头牛拉不动一根圆木的时候,他们不曾想过培育个头更大的牛。

相关文档
最新文档