Scaling Issues for Multi-Paradigm Network Modeling

合集下载

面向不等长多维时间序列的聚类改进算法

面向不等长多维时间序列的聚类改进算法
面向不等长多维时间序列的聚类改进算法一直是机器学习中的一个重要研究方向。

聚类改进算法可以对时序数据进行有效的模式识别和分类，从而在模式识别、自然语言处理等领域得到应用。

传统的聚类算法在处理不等长的多维序列时存在一定的局限性，无法有效预测目标序列的数据分布特性。

近年来，针对不等长多维时间序列聚类改进算法开展了相关研究。

研究者提出一种基于遗传算法最小二乘回归的聚类改进算法，其思想是通过建立一个算法，它可以根据给定的多维时间序列的属性，预测其特征值，区分不等长多维时间序列，以准确地进行聚类分析。

其算法以一组参数作为输入进行训练，然后由遗传算法和最小二乘回归来寻找最佳参数，从而有效地拟合不等长多维时间序列，分析其模式特征以及进行聚类分析。

针对不等长多维时间序列聚类改进算法，研究者提出了一种基于维护误差的聚类改进算法，通过对维护误差进行判断，从而使传统的聚类算法能够很好地处理不等长的多维序列，在聚类时可以更准确地预测目标序列的数据分布特征，从而实现模式识别的最终目的。

此外，为了更好地处理不等长多维时间序列，研究者也利用深度学习等技术，设计多种新的聚类改进算法，将聚类改进算法和深度学习技术结合起来，提出了一种基于时间序列特征提取的深度学习聚类改进算法，它使用深度受限玻尔兹曼机（DRBM）和决策树等技术，通过有效提取时间序列特征，实现深度复杂数据的特征学习，从而实现有效的模式识别和聚类分析。

综上所述，面向不等长多维时间序列的聚类改进算法是时序数据模式识别和分类方面的一个重要研究方向，传统聚类改进算法在处理不等长多维时间序列时。

Isilon Configuration Guide for VMWare

Isilon IQ and VMware vSphere 4.0Configuration Guide for VMWare vSphere and Isilon IQ™ with OneFS® v5.xBy Shai Harmelin, Sr. Solutions ArchitectAn Isilon Systems® Technical Configuration GuideUpdated August 2009Table of Contents1. Introduction (2)Scale-out NAS for Server Virtualization (2)2. Using Network Attached Storage (NAS) with VMWare (3)VMWare Support for NAS (3)3. Advantages of Isilon Scale-Out NAS over Traditional NAS (4)4. Isilon IQ Cluster Configuration (4)Cluster Configuration Considerations (4)Isilon Networking Concepts (6)Isilon Network Configuration (7)High Availability with NFS Failover (9)Isilon Network Design Best Practices (10)5. ESX Configuration (12)Creating a Virtual Switch (12)Configuring a Service Console (VI3 Only) (13)Creating a VMware Datastore (15)6. Virtual Machine Configuration (16)Placing VM Files in a Datastore on Isilon Cluster (16)Creating Virtual Disks in a Datastore on Isilon IQ Cluster (17)7. Migrating VMs between ESX Hosts (17)8. Working with Snapshots (18)Snapshots Explained (18)Taking Snapshots with the ESX Snapshot Utility (19)Reverting To a Snapshot (20)9. Disaster Recovery with SyncIQ (20)10. Performing Backups (20)Introduction (21)Best Practices for NDMP Backup of VMs (21)VMware Consolidated Backup (VCB) Best Practices (22)1.IntroductionScale-out NAS for Server VirtualizationIsilon IQ Scale-out NAS with clustered architecture and single file system, is designed to be scalable, highly reliable, and easy to manage platform for storing Virtual Machines (VMs) hosted on VMware vSphere 4.0 with ESX 4.0 or ESX3.0 using the industry-standard file-sharing protocol, NFS.Server virtualization is quickly becoming a standard in major enterprises to simplify overhead and reduce costs of managing large-scale server environments for test, development, and production applications as well as hosted services in the cloud. VMware is a leader in this trend, vastly simplifying server management, driving up server utilization rates, driving down costs, and providing a new virtual infrastructure to simplify the ongoing management of many VMs.While virtualization o a solution for server sprawl, a new challenge for enterprises with traditional SAN or scale-up NAS comes into play in especially when consolidating or deploying large numbers of VMs along with non-virtualized servers. These challenges with traditional storage can often negate the cost savings and efficiencies expected in virtual environments.While Storage Area Networks (SANs) enable sharing of storage across multiple ESX Server hosts (an advancement over Direct Attached Storage), each LUN requires a separate management point — either dedicated or shared acrossa set of VMs. If LUNs are dedicated to individual VMs, the number of management points grows quickly along withthe number of VMs. In turn, if VMs are consolidated on an individual LUN, changes to a storage device or LUNs often impact a large number of VMs. With traditional SAN or scale-up NAS in a virtualized environment, the complexity of volume sprawl, capacity and load balancing, mount management and other storage administration adds tomanagement time, and slows deployments.With a single file system and clustered architecture, Isilon Scale-out NAS is a scalable, agile and easy-to-manage storage platform for your virtualized environment and large-scale deployments.Isilon CertificationsIsilon is a VMWare Ready certified storage vendor for both ESX 3.0 and ESX 4.0 (vSphere). This certification for the Isilon IQ product family, including IQ1920x, IQ3000x, IQ6000x, IQ9000x and IQ12000x, ensures compatibility with VMware vSphere™ 4 and that Isilon is ready for deployment in customer environments.AssumptionsIn writing this Configuration Guide, it was assumed the reader has:•Understanding of the NFS protocol•Working knowledge of Isilon IQ storage, the OneFS® operating system, and the WebUI and the command-line interface•Working knowledge of VMware ESX Server and Virtual CenterThis configuration guide will provide the necessary steps to configure the Isilon IQ storage cluster and ESX hosts to manage virtual machine datastores on Isilon storage systems over NFS.ing Network Attached Storage (NAS) with VMWareVMWare Support for NASVMware introduced support for network NAS datastores in ESX Server 3.0. Prior to ESX 3.0, VMware supported only block-level storage options, i.e. direct attach storage (DAS) or storage area networks (SAN). With NAS support in ESX Server customers have a more manageable and flexible alternative to traditional block-level DAS or SAN storage.Fundamentally, NAS stores large VM datastores on a Network File System (an industry-standard file sharingprotocol) export rather than VMFS volumes and storage LUNs. The storage system is presented to each ESX Server as a network mount and ESX then stores and accesses VMs on the storage system using a Network File System (NFS).Among the advantages of NAS-based datastores are:•Rapid and simple storage provisioning: Once storage is allocated to an ESX host, it can be used and re-used as required.•Lower costs: Implementing a NAS infrastructure costs less than comparable SAN-based architectures. This is primarily due to lower networking and hardware costs, but also due to lower management costs, as dedicated storage administrators are typically not required.•Higher storage utilization rates: VMware disk files (VMDK files) are thin-provisioned by default with NAS datastores.•Easier to manage storage with multiple VMs: Instead of managing LUNs for individual VMs, all VMDK files may be stored on a common file export.•Simplified backup scenarios: All VM files may be backed up behind a single, central mount point.However not all NAS products are not created equal. Traditional NAS vendors that rely on single head architecture that simply manage SAN storage “under the covers” have some disadvantages:•Capacity scalability is limited to a single device: Traditional NAS systems are based on a scale-up architecture — where a finite amount of storage capacity is added within an individual storage device. However, a volume or LUN is often the limiting factor to scalability, typically at 2 to 16 TBs.•Single point of failure: If the device or head fails, access to VMs may be lost. Traditional failover clustering options provided by NAS vendors is often not sufficient.•Management complexity at scale: While traditional NAS systems are relatively easy to setup and configure, they become complicated to manage in large numbers. As more ESX hosts require storage, multiple file systems and mount points need to be provisioned and managed across multiple storage devices — each of which representsa separate management point.•Performance scalability is limited to a single device: While individual NAS heads may provide adequate performance for a limited number of VMs ( like a traditional server), at some point the NAS system will run out of performance resources depending on the number of VMs (and associated application workloads) stored on the device.•Unsupported Features: Some features for ESX Server (or through vSphere) may not be available using traditional NAS-based solutions. These include the ability to boot ESX server from SAN, Raw Device Mapping (RDM) for accessing SAN LUNs directory, Microsoft server clustering services, and VMWare SRM 1.0 (however SRM 1.1 will support NAS storage).ing Isilon Scale-Out NAS with VMwareThe Isilon IQ platform is a “Scale-out” approach to NAS storage and offers significant reductions in management overhead by simplifying management of the cluster through a single, common management point. In an Isilon cluster, petabytes of storage can be admininstered in a single file system instead of many small islands of storage.Isilon s OneFS® operating system combine the conventional, separate layers for RAID, volume management and file system into one unified software layer, creating a single symmetric cluster file system that spans all nodes within a cluster.Scalability and PerformanceEach Isilon IQ node contains disk capacity, CPU, memory and network connectivity. As additional Isilon IQ nodes are added to a cluster, all aspects of the cluster scale symmetrically, including capacity, throughput, memory, CPU and network connectivity. In contrast to traditional NAS designs, adding capacity to an Isilon node does not create bottlenecks with other system resources. Isilon IQ offers aggregate throughput from a single file system of up to 45 GB/second, with up to 5.2 petabytes of storage.Availability and ReliabilityIsilon IQ is a fully distributed architecture where all nodes work together to form a unified file system, tolerant of any component failure, including entire nodes. The Isilon file system goes beyond traditional RAID to protect against multiple failures in a cluster without losing data availability, and leverages the compute power of all nodes to deliver fast drive rebuild times. In addition, Isilon IQ clusters provide flexible protection levels on a file-by-file basis, protecting files independently from the location where they are stored. Additionally, with local data protection available with the Isilon SnapshotIQ application and seamless NFS failover available with Isilon SmartConnect, Isilon provides high availability and reliability that is required for a virtualized datacenter.4.Isilon IQ Cluster ConfigurationThis section provides requirements and best practices for configuring an Isilon cluster for use with an ESX Server.Cluster Configuration Considerations•When an ESX datastore is created, the directory where the datastore will be pointed to must already exist.•Take care with directory ownership as in most cases, the directories for VM images will be created locally on the cluster by the root user. By default, root access to the cluster over NFS is limited by mapping the root user to the user nobody. If the directory is created by root, and the ownership isn t changed, the ESX server(s) can t write to the directory. Write access can be assured by one of two methods:ing chown to change the owner to nobody, i.e. chown nobody:wheel <directory>ing the NFS Exports page in the WebUI to map root access to the root user. However, this is notrecommended as it can be a security hole.•By default, the cluster s NFS write commit behavior is set to synchronous. This ensures every write operation written by a VM is committed to disk as soon as possible. This extra level of data consistencyincurs per operation latency overhead and may not be necessary for many virtualized applications. Disabling synchronous writes may increase performance but be careful when determining if applications can support asynchronous writes.•To configure this behavior:1.From the WebUI select File System File Sharing Services Configure NFS2.On the Configure NFS page, select Synchronous or Asynchronous in the Write commit behaviorsection.3.Click Save.Figure 1 - ESX Datastore on Isilon IQ ClusterIsilon Networking ConceptsFlexNet 2.0FlexNet™ is the OneFS subsystem used for configuring and managing network interfaces. With the introduction of OneFS 5.0, major improvements were made to FlexNet, now in version 2.0.•FlexNet 2.0 is designed to support complex and variable network topologies. It has several hierarchical and overlapping management objects that allow for extremely flexible configurations, simply defined and managed. •FlexNet 2.0 is tightly integrated with Isilon SmartConnect to provide increase network connectivity and availability, as well as easy management.The following terms are important for understanding the operation of FlexNet 2.0:1.Subnet – Specifies a network subnet, netmask, gateway and other parameters related to layer-3 networking.VLAN tagging is configured here. A subnet contains one or more pool objects, which assign a range of IPaddresses to network interfaces on the cluster nodes.2.Pool – Also referred to as an IP Address Pool, containing one or more network interfaces (e.g. External-1) and aset of IP addresses to be assigned to them. SmartConnect settings, such as the zone name and allocating IPs statically or dynamically, are also configured at the Pool level.3.SmartConnect provides the ability to distribute client connections across a set of IPs in the pool based on acommon DNS name. SmartConnect Advanced Dynamic IP allocation allows the IPs in the dynamic pool to migrate across all interface members in the pool and failover from one member to another in case of an interface or complete node failure.4.Provisioning Rule – Specifies subnet and IP pool assignment actions when a node is added to the cluster, basedon the node type and interface. For example, a rule could state when a node of type storage is added, External-1 and External-2 are assigned to two different pools, which in turn belong to two separate subnets.Figure 2 - FlexNet 2.0 Pools, Subnets and RulesFlexNet 2.0 and SmartConnectFlexNet 2.0 is tightly coupled with SmartConnect, the OneFS client load-balancing and failover application. At the subnet level, the SmartConnect Service IP Address, formerly known as the Virtual IP (VIP), is specified. This is the IP address used primarily by a DNS server to forward SmartConnect zone lookups to the cluster.SmartConnect options having to do with zone name, load balancing and failover are set at the pool level. Different pools inside the same subnet can have different configurations for different use cases.NOTE: There are limitations to using SmartConnect with Virtual Center. Please see “Limitations of Virtual Center and SmartConnect”, below, for details.Isilon Network ConfigurationInitial ConfigurationFlexNet 2.0 introduces a new process for configuring external networking. When initially configuring a cluster, the first external network interface (typically External-1) is setup as part of the configuration process. In order for this process to complete successfully, the following information is required:•Netmask•IP address range•Default gateway•Domain name server list (optional)•DNS search list (optional)•SmartConnect zone name (optional)•SmartConnect service address (optional)When this information is provided, the following actions occur:• A default external subnet is created, named subnet0, with the netmask and optional SmartConnect service address.• A default IP address pool is created, named pool0, with the specified IP address range, the gateway, the optional SmartConnect zone name, and the initial external interface with the first node in the cluster as the only member. • A default network provisioning rule is created, named rule0, which automatically assigns the first external interface for all newly added nodes to pool0.•pool0 is added to subnet0 and configured to use subnet0 as its SmartConnect service address.•The global outbound DNS settings are configured with the optional domain name server list and DNS search list, if provided.Upgrade ConfigurationWhen an Isilon cluster is upgraded to OneFS v5.0, the following external networking and connection balancing configuration changes will automatically occur:Each FlexNet profile in earlier versions of OneFS will be transformed into a subnet. You can view the new subnets by clicking Networking on the Cluster menu in the web administration interface.In a simple external network configuration consisting of one SmartConnect zone with dynamic IP addresses, the upgrade will retain all settings from the earlier OneFS version including dynamic IP addresses that were part of the FlexNet profile, the load balancing policy, the SmartConnect zone name, and the interface members.If your Isilon cluster contains multiple SmartConnect zones with both dynamic and static IP addresses, after upgrading to OneFS v5.0 all the dynamic IP addresses will be consolidated into one SmartConnect zone and all the static IP addresses will be consolidated into a second SmartConnect zone. External network settings can be edited using the WebUI or CLI. Figure 3 shows the Edit Subnet page from the WebUI.Figure 3 - WebUI Subnet ConfigurationVLAN TaggingVirtual LANs (VLANs) are used to logically group together network endpoints, and to partition network traffic, e.g. for security. VLANs are tagged with a unique identifier to segregate traffic. FlexNet 2.0 supports VLAN tagging for use in external networks using VLANs. In FlexNet, VLANs are configured at the Subnet level.Configuring Link AggregationIsilon OneFS supports the use of redundant NICs to provide layer-2 failover. OneFS link aggregation supports the IEEE 802.3ad static LAG protocol, and works with switches and clients that support this protocol.Note: OneFS uses link aggregation primarily for Network Interface Card (NIC) failover purposes. Both NICs are used for client I/O, but the two channels are not bonded into a single 2 Gigabit link. Each NIC is serving a separate TCP connection.Link Aggregation Switch SupportIsilon network link aggregation requires 802.3ad static support and proper configuration on the switch. Cisco switches offer this support using the EtherChannel feature. It is highly recommended to configure cross-stack EtherChannel to provide protection against switch failures as well as NIC failures,Link aggregation, can be configured for a new subnet or an existing one, requires creating an IP pool with the aggregated interface on each node as the pool s members:1.On the Edit Subnet page, at the top of the IP Address Pools section, click the Add pool link.2.In the Create Pool wizard, enter a name for the pool, and optional description, and a range of IP addresses touse for this pool. Click Next3.If SmartConnect is used, options for the pool can be set on the next page of the wizard. Once these options havebeen selected, click Next.4.On the next page, the interfaces to be members of this pool are selected. To use link aggregation, select the ext-agg interface for each node to be in the pool. The interface type is also listed as AGGREGATION.5.Click Submit to complete the wizard.Note: Link aggregation provides protection against NIC failures but does not increase performance. A recommended alternative is to assign both NICs in a node to the same dynamic IP pool to gain both performance increase and NIC failure redundancy through dynamic IP failover. This is covered section to follow, High-Availability with NFS Failover.Configuring SmartConnectLimitations of Virtual Center and SmartConnectDue to the way Virtual Center manages datastore location paths, it does not support a DNS infrastructure in which a hostname is bound to multiple IP addresses, which is required for SmartConnect zone names to work for datastore creation and use. This means datastores must be created using the IP addresses of a cluster.This limitation does not preclude the use of dynamic IP addresses to implement NFS failover on the cluster. NFS failover is supported with VI 3 and vSphere when the dynamic IP addresses of cluster nodes are used.High Availability with NFS FailoverHow NFS Failover WorksSmartConnect implements NFS failover by assigning one or more dynamic IP addresses to each node in the cluster from a configured range of addresses. If a single interface or an entire node experiences a failure, SmartConnect moves the dynamic IP address to the remaining interfaces or to another node. Any I/O taking place to the failed node continues without interruption.When a node s interface or the entire node is brought back online the dynamic IPs in the pool are redistributed across the new set of interface pool members. This failback mechanism can occur automatically or manually.NFS failover is configured at the FlexNet pool object level, either at the time the pool is created, or by changing the pool settings on the network configuration page of the WebUI. Please see the OneFS 5.0 User Manual for specific steps.When a VMware NFS datastore is created, the dynamic IP address of the node is used, not the static IP address assigned to the node during initial cluster configuration. In case of an NFS failover, NFS datastore traffic continues uninterruptedly to the new storage interface assigned with the dynamic IP that the datastore path is defined on. In case of failback, the dynamic IPs on the storage cluster are redistributed, again, without interruption to the NFS datastore traffic.Isilon Network Design Best PracticesIsilon has developed a network topology that provides maximum performance, flexibility and availability for VMware installations. The design is in effect, a mesh connectivity design, in which every ESX server is connected to every IP address on a cluster, up to configuration maximums (see next section). Connecting “everything to everything” enables the following capabilities:•Since by definition all servers are connect to all datastores, VMotion between all ESX servers can be performed knowing that both servers can see the same datastore and thus the migration will be successful.•VMs can be created on different datastores to balance the I/O load between ESX servers and the cluster;these can be easily moved between datastores to eliminate hot spots. The more NFS datastores arecreated, the more TCP connections an ESX host can leverage to balance VM I/O.Figure 4 illustrates an example of this recommended configuration. Each ESX host has a primary datastore, with secondary connections to additional datastores located on the cluster.Figure 4 – multiple datastores on a single NFS volumeIncreasing Performance with the Maximum Number of NFS MountsIn VMware every NFS datastore represents a separate TCP connection to the NFS server, increasing aggregate storage throughput by parallelizing VM I/O to the storage system.With the Isilon clustered storage architecture, multiple NFS datastores can all point to the same NFS mount on the Isilon cluster through different (preferably dynamic) IPs. These multiple NFS datastores can also share a single pool of storage granting each datastore access to all VMs, allowing an administrator to quickly register and unregister a VM across datastores. This single pool of storage increases availability, performance and adaptability to changing performance requirements and growth.When this topology is implemented with larger numbers of ESX servers and/or cluster IP addresses it may be necessary to increase the number of NFS mounts available to an ESX server machine from the default of eight.1.In the VI console, select the ESX Server, then select the Configuration tab.2.In the Software section, select Advanced Settings.3.In the Advanced Settings dialog, select NFS from the left-side list.4.Locate the setting NFS.MaxVolumes, then set the value to a number between 8 and 32 inclusive.5.Click OK.5.ESX ConfigurationThis section details the steps necessary to configure ESX Server for use with Isilon storage. Follow the below steps to configure a network between the ESX server machine and the Isilon cluster.Creating a Virtual SwitchThe first step is to create a virtual switch for all network traffic between the ESX server machine and the Isilon cluster.1.In the VMware Infrastructure or vSphere Client, select the ESX server machine in the left-side tree view, thenselect the Configuration tab in the right-side pane.2.Under Hardware, select Networking, then select Add Networking.3.In the Add Network Wizard, in the Connection Types section, select VMkernel, then click Next.4.On the Network Access screen, select Create a virtual switch, or select an existing virtual switch. Click Next.Best practice: Create the virtual switch using at least one dedicated network interface card (NIC) for networkstorage traffic. This will ensure good performance, as well as isolate any problems from other traffic.5.On the Connection Settings screen, enter a network label and optional VLAN ID. It s often helpful to give thevirtual switch a meaningful label, such as “NFS Storage”.Note: For more information on VLAN usage in ESX Server, see the VMware whitepaper VMware ESX Server 3 802.1Q VLAN Solutions.6.In the IP Settings section, enter an IP address and subnet mask for the VMkernel port.7.If necessary, click the Edit button to change the default gateway. Click Next to go to the Summary screen.8.On the Summary screen, review the settings, and if correct, click Finish.9.Figure 2 provides an example configuration with virtual machine and VMkernel networks using separate physicalNICs.Figure 5: Example Network ConfigurationConfiguring a Service Console (V3 Only)It s important to configure a service console on the virtual switch you just created. Without a service console, it is possible for the ESX server machine to lose connectivity to storage located on the virtual switch. This step is NOT necessary for vSphere and ESX 4.01.In the VI Client, on the Configuration tab for the ESX server machine, select Properties next to the virtual switchthat you just created.2.In the Properties dialog, on the Ports tab, click Add.3.In the Add Network Wizard, in the Connection Types section, select Service Console, then click Next.4.On the Connection Settings screen, enter a network label and optional VLAN ID.5.The console can be given a static IP address or obtain one via DHCP, then click Next.6.On the Summary screen, review the settings, and if correct, click Finish.Using JUMBO FramesBest practice: Islon recommends using JUMBO frame with MTU 9000 rather than the default 1500. This requires both the ESX NIC and Switch to support JUMBO Frames. Isilon storage nodes (4.7 and above) already support JUMBO frames. Enabling JUMBO frames on ESX is performed through the ESX service console CLI:1.Assuming the VMKernel port of NFS Storage is created on vSwitch1 run the following command line:esxcfg-vswitch -m 9000 vSwitch12. A quick run of “esxcfg-vswitch -l” (that s a lowercase L) will show the vSwitch s MTU is now 9000; in addition,“esxcfg-nics -l” (again, a lowercase L) will show the MTU for the NICs linked to that vSwitch are now set to 9000 as well.3.Create a VMkernel interface with JUMBO frames (unfortunately an existing VMKernel switch cannot be updatedto use JUMBO frames). This step is a bit more complicated, because we need to have a port group in place already, and that port group needs to be on the vSwitch whose MTU we set previously:esxcfg-vmknic -a -i 172.16.1.1 -n 255.255.0.0 -m 9000 “NFS Storage”4.This creates a port group called “NFS Storage” on vSwitch1—the vSwitch whose MTU was previously set to9000—and then creates a VMkernel port with an MTU of 9000 on that port group. Be sure to use an IP address that is appropriate for your network when creating the VMkernel interface.5.Go back to the Isilon Cluster Network Management WebUI page and make sure the subnet MTU is set to 90006.Setup your Switch to support JUMBO frame traffic. On a cisco catalyst 3750 switch the command is:system mtu jumbo 90007.To test that everything is working so far, use the vmkping command from the ESX service console:vmkping -s 9000 172.16.1.20Configuring Link AggregationLink aggregation, also known as NIC failover or NIC teaming, is one approach to ensuring higher network availability between the ESX server and Isilon cluster. NIC teaming is a layer-2 IEEE standard known as 802.3ad. Perform the following steps to configure NIC teaming on an ESX server.NOTE: NIC teaming requires that both NICs involved in the team are on the same subnet.1.If two NICs are not configured in the virtual switch, add a second NIC by selecting Properties on theConfiguration tab for the ESX Server.2.On the Properties dialog, select the Network Adapters tab, then click Add. Follow the instructions in the AddAdapter wizard.3.After adding the second NIC, the virtual switch diagram will look like Figure 5.Figure 6: NIC Teaming Configured on ESX host•Once the second NIC is added to the virtual switch, teaming is enabled using a default configuration. To change NIC teaming options, select Properties... for the virtual switch.。

基于知识图谱使用多特征语义融合的文档对匹配

第 54 卷第 8 期2023 年 8 月中南大学学报(自然科学版)Journal of Central South University (Science and Technology)V ol.54 No.8Aug. 2023基于知识图谱使用多特征语义融合的文档对匹配陈毅波1，张祖平2，黄鑫1，向行1，何智强1(1. 国网湖南省电力有限公司，湖南长沙，410004；2. 中南大学计算机学院，湖南长沙，410083)摘要：为了区分文档间的同源性和异质性，首先，提出一种多特征语义融合模型(Multi-Feature Semantic Fusion Model ，MFSFM)来捕获文档关键字，它采用语义增强的多特征表示法来表示实体，并在多卷积混合残差CNN 模块中引入局部注意力机制以提高实体边界信息的敏感性；然后，通过对文档构建一个关键字共现图，并应用社区检测算法检测概念进而表示文档，从而匹配文档对；最后，建立两个多特征文档数据集，以验证所提出的基于MFSFM 的匹配方法的可行性，每一个数据集都包含约500份真实的科技项目可行性报告。

研究结果表明：本文所提出的模型在CNSR 和CNSI 数据集上的分类精度分别提高了13.67%和15.83%，同时可以实现快速收敛。

关键词：文档对匹配；多特征语义融合；知识图谱；概念图中图分类号：TP391 文献标志码：A 文章编号：1672-7207（2023）08-3122-10Matching document pairs using multi-feature semantic fusionbased on knowledge graphCHEN Yibo 1, ZHANG Zuping 2, HUANG Xin 1, XIANG Xing 1, HE Zhiqiang 1(1. State Grid Hunan Electric Power Company Limited, Changsha 410004, China;2. School of Computer Science and Engineering, Central South University, Changsha 410083, China)Abstract: To distinguish the homogeneity and heterogeneity among documents, a Multi-Feature Semantic Fusion Model(MFSFM) was firstly proposed to capture document keywords, which employed a semantically enhanced multi-feature representation to depict entities. A local attention mechanism in the multi-convolutional mixed residual CNN module was introduced to enhance sensitivity to entity boundary information. Secondly, a keyword co-occurrence graph for documents was constructed and a community detection algorithm was applied to represent收稿日期： 2022 −05 −15；修回日期： 2022 −09 −09基金项目(Foundation item)：湖南省电力物联网重点实验室项目(2019TP1016)；电力知识图谱关键技术研究项目(5216A6200037)；国家自然科学基金资助项目(72061147004)；湖南省自然科学基金资助项目( 2021JJ30055) (Project (2019TP1016) supported by Hunan Key Laboratory for Internet of Things in Electricity; Project(5216A6200037) supported by key Technologies of Power Knowledge Graph; Project(72061147004) supported by the National Natural Science Foundation of China; Project(2021JJ30055) supported by the Natural Science Foundation of Hunan Province)通信作者：张祖平，博士，教授，从事大数据分析与处理研究；E-mail ：***************.cnDOI: 10.11817/j.issn.1672-7207.2023.08.016引用格式：陈毅波, 张祖平, 黄鑫, 等. 基于知识图谱使用多特征语义融合的文档对匹配[J]. 中南大学学报(自然科学版), 2023, 54(8): 3122−3131.Citation: CHEN Yibo, ZHANG Zuping, HUANG Xin, et al. Matching document pairs using multi-feature semantic fusion based on knowledge graph[J]. Journal of Central South University(Science and Technology), 2023, 54(8): 3122−3131.第 8 期陈毅波，等：基于知识图谱使用多特征语义融合的文档对匹配concepts, thus facilitating document was matching. Finally, two multi-feature document datasets were established to validate the feasibility of the proposed MFSFM-based matching approach, with each dataset comprising approximately 500 real feasibility reports of scientific and technological projects. The results indicate that the proposed model achieves an increase in classification accuracy of 13.67% and 15.83% on the CNSR and CNSI datasets, respectively, and demonstrates rapid convergence.Key words: document pairs matching; multi-feature semantic fusion; knowledge graph; concept graph识别文档对的关系是一项自然语言理解任务，也是文档查重和文档搜索工作必不可少的步骤。

如何克服计算机视觉技术中的数据不平衡与标注困难

如何克服计算机视觉技术中的数据不平衡与标注困难数据不平衡和标注困难是计算机视觉技术中常遇到的挑战之一。

这些问题可能导致模型训练不准确，并且对于实际应用的可靠性产生负面影响。

在本篇文章中，我们将探讨如何克服计算机视觉技术中的数据不平衡和标注困难。

数据不平衡是指在训练集中不同类别的样本数量差异较大的情况。

这种情况下，模型往往会对数量较多的类别更为敏感，而对数量较少的类别效果较差。

为了克服这一问题，我们可以采取以下策略：1. 数据增强技术：数据增强是通过对现有数据进行变换、旋转、缩放等操作，生成额外的训练样本。

这样可以增加数量较少类别的样本，使得训练集更加平衡。

例如，可以通过在图像上进行随机裁剪、旋转、反射等操作来增加数据的多样性。

2. 采样策略：可以使用一些采样策略来平衡类别间的样本数量。

我们可以使用欠采样方法从数量较多的类别中随机删除一些样本，或者使用过采样方法从数量较少的类别中复制一些样本，以达到类别平衡的效果。

同时，使用分层采样可以确保训练集中每个类别的样本数量均衡。

3. 类别权重调整：通过为不同类别设置不同的权重，可以使得模型更加重视数量较少的类别，从而减轻数据不平衡的影响。

常见的做法是使用交叉熵损失函数，并为每个类别设置相应的权重，以调整损失函数在训练过程中的权重分布。

标注困难是指在训练数据集中，样本的标注存在错误、不一致或者缺失的情况。

这会导致模型难以准确地学习到样本的特征和属性。

为了克服标注困难，可以采取以下方法：1. 标注检查与纠错：对训练数据进行仔细检查，并进行标注错误的纠正。

可以借助专门的标注工具，如标注比对工具和标注质量评估模型，来辅助标注检查和纠错工作。

此外，还可以采用协同标注的方式，通过多个标注者对同一样本进行标注，提高标注结果的准确性。

2. 半监督学习：半监督学习是指利用有标签的少量样本和无标签的大量样本来训练模型。

可以将一小部分样本进行准确标注，并结合无标签的大量样本进行训练，以增加可用的训练数据。

如何应对计算机视觉中的数据不平衡问题

如何应对计算机视觉中的数据不平衡问题数据不平衡是计算机视觉领域中常见的问题之一。

在训练和测试计算机视觉模型时，数据集中可能存在某些类别的样本数量远远多于其他类别的情况。

这种不平衡的数据分布可能导致模型训练结果偏向数量较多的类别，而忽略数量较少的类别。

因此，如何应对计算机视觉中的数据不平衡问题是一个非常重要的课题。

针对计算机视觉中的数据不平衡问题，我们可以采取以下几种方法来进行应对。

首先，一种常用的方法是欠采样（Undersampling）。

该方法通过减少数量较多的类别的样本数量，使得数据集中各个类别的样本数量趋于平衡。

欠采样的方式可以有多种，例如随机欠采样和聚类欠采样。

随机欠采样是指随机去除数量较多的类别的一些样本，使其与数量较少的类别样本数量相近。

聚类欠采样是将数量较多的类别样本进行聚类，然后从每个类别中选择代表性样本。

欠采样的方法可以有效减少数量较多类别的样本对模型的影响，提高模型对数量较少类别的学习能力。

除了欠采样，还可以采用过采样（Oversampling）的方法来应对数据不平衡问题。

过采样通过增加数量较少的类别的样本数量，使数据集中各个类别的样本数量更加均衡。

过采样的方式有SMOTE（Synthetic Minority Over-sampling Technique）等。

SMOTE通过合成新的样本来增加数量较少的类别的样本数。

具体而言，SMOTE针对数量较少类别的每个样本，在其周围随机挑选K个最近邻样本，并按一定比例生成新的样本。

过采样的方法可以提高模型对数量较少类别的学习能力，减小数据不平衡对模型的影响。

此外，还可以通过类别权重（Class Weighting）来应对数据不平衡问题。

类别权重可以对数量较少的类别赋予较高的权重，使得模型在训练过程中更加关注这些类别。

通常，类别权重可以通过计算每个类别在数据集中的样本数量比例的倒数来得到。

在训练过程中，将这些类别权重作为损失函数的权重，可以使模型对数量少的类别更加敏感，提高其学习能力。

基于图正则化的贝叶斯宽度学习系统

第4卷第1期智能科学与技术学报V ol.4No.1 2022年3月Chinese Journal of Intelligent Science and Technology March 2022 基于图正则化的贝叶斯宽度学习系统段俊伟1，许林灿1，全渝娟1，陈龙2，陈俊龙3（1. 暨南大学信息科学技术学院，广东广州 510632；2. 澳门大学科技学院，澳门 999078；3. 华南理工大学计算机科学与工程学院，广东广州 510006）摘要：作为一种前馈神经网络，宽度学习系统因其精度高、训练速度快且能有效代替深度学习方法而备受研究者的关注。

然而，宽度学习系统存在对网络中的特征节点个数比较敏感且求伪逆方式易使模型出现过拟合等问题。

为此，在宽度学习系统中引入贝叶斯推断和图正则化。

一方面，通过引入先验知识进行贝叶斯学习可以有效提高权重的稀疏性，提高模型的稳定性；另一方面，加入图正则化可充分考虑数据内在的图信息，进一步提高模型的泛化能力。

在UCI数据集和NORB数据集上对所提模型进行性能评估，实验结果表明，所提的基于图正则化的贝叶斯宽度学习系统模型能进一步提高宽度学习系统的分类精度且具有更好的稳定性。

关键词：宽度学习系统；贝叶斯推断；图正则化；模式识别中图分类号：TP181文献标志码：Adoi: 10.11959/j.issn.2096−6652.202203Graph-regularized Bayesian broad learning systemDUAN Junwei1, XU Lincan1, QUAN Yujuan1, CHEN Long2, CHEN C.L. Philip 31. College of Information Science and Technology, Jinan University, Guangzhou 510632, China2. Faculty of Science and Technology, University of Macau, Macau 999078, China3. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, ChinaAbstract: As a feed forward neural network, broad learning system (BLS) has attracted much attention because of its high accuracy, fast training speed, and the ability to effectively replace deep learning methods. However, it is sensitive to the number of feature nodes and the pseudo-inverse method is likely to result in the problem of over fitting for BLS mod-el. To address the above issues, Bayesian inference and graph regularization was introduced in to the BLS model. By in-troducing the prior knowledge for Bayesian learning, the sparsity of the weights and the stability of the model could be effectively improved; while the graph information mining from the data could be fully considered to improve the genera-lization ability of the model by regularization. The UCI and NORB dataset were adopted for evaluating the performance of the proposed model. The experiment results demonstrated that the proposed graph-regularized Bayesian broad learning system model can further improve the accuracy of classification and has better stability.Key words: board learning system, Bayesian inference, graph regularization, pattern recognition收稿日期：2021−02−01；修回日期：2021−04−17通信作者：全渝娟，Tquanyj@基金项目：国家重点研发计划基金资助项目（No.2018YFC2002500）；广东省基础与应用基础研究基金资助项目（No. 2021A1515011999）；广州市科技创新发展专项资金项目（No.201902010041）Foundation Items: The National Key Research and Development Program of China (No.2018YFC2002500), The Guangdong Basic and Applied Basic Research Foundation (No.2021A1515011999), The Guangzhou Science and Technology Innovation and Devel-opment Special Fund Project (No.201902010041)·110·智能科学与技术学报第3卷0引言目前，利用人工智能技术快速准确地获取数据并进行分析与处理已成为被广泛关注的问题之一[1-2]。

多流形

• 经典的MDS算法
• Node-Weighted MDS算法
思想1：Manifold Clustering
• K-manifolds可以聚
类互相相交的多流形，但是不能应用于分离较远的多流形。原因在于算法中要计算测底距离矩阵，不能处理不连通数据。
思想2：谱聚类
• SMMC（Spectral Multi-Manifold Clustering）
n
n
2
2qT ( D W )q
再定义一个 L 矩阵
L D W
L 称为拉普拉斯矩阵，W 为权重矩阵（也称邻接矩阵），D 为度矩阵
w (q q )
i 1 j 1 ij i j
n
n
2
2qT Lq
Spectral Clustering 谱聚类
Laplacian矩阵
1 n n q Lq wij (qi q j ) 2 0 2 i 1 j 1
许多高维采样数据都是由少数几个隐含变量所决定的。比如人脸采样由光线亮度, 人离相机的距离, 人的头部姿势, 人的脸部肌肉等因素决定。这些高维采样数据存在于一个低维流形上，人的认知过程就是基于这种低维流形的。
流形学习就是要根据有限的离散样本学习和发现嵌入在高维空间中的低维光滑流形，揭示隐藏在高维数据集中的内在低维结构，重构并进行非线性维数约简或者可视化。
n
n
2wij qi q j 2qi
i 1 j 1 i 1
n
n
n
2
w
j 1
n
ij
2qT ( D W )q
其中D为对角矩阵
n
Dii wij
j 1

ADAPTIVE DISTRIBUTED MULTIDIMENSIONAL SCALING FOR LOCALIZATION IN SENSOR NETWORKS

ADAPTIVE DISTRIBUTED MULTIDIMENSIONAL SCALING FOR LOCALIZATION INSENSOR NETWORKSJose A.Costa,Neal Patwari and Alfred O.Hero IIIDepartment of Electrical Engineering and Computer ScienceUniversity of Michigan,Ann Arbor,MI48109Emails:jcosta@,npatwari,hero@.ABSTRACTAccurate,distributed localization algorithms are needed for a widevariety of wireless sensor network applications.This paper in-troduces a scalable,distributed weighted-multidimensional scaling(dwMDS)algorithm that adaptively emphasizes the most accuraterange measurements and naturally accounts for communicationconstraints within the sensor network.For received signal-strength(RSS)based range measurements,we demonstrate via simulationthat location estimates are nearly unbiased with variance close tothe Cram´e r-Rao lower bound(CRB).Further,RSS and time-of-arrival(TOA)channel measurements are used to demonstrate per-formance as good as the centralized maximum-likelihood estima-tor(MLE)in a real-world sensor network.1.INTRODUCTIONFor monitoring and control applications using wireless sensor net-works,automatic localization of every sensor will be a key en-abling technology.Sensor data must be registered to its physical location to permit deployment of energy-efﬁcient routing schemes, source localization algorithms,and distributed compression tech-niques.Moreover,for applications such as inventory management and manufacturing logistics,localization and tracking of sensors are the primary purposes of the wireless network.For large-scale networks of inexpensive,energy-efﬁcient devices,it is not feasible to include GPS capability on every device or to require a system administrator to manually enter all device coordinates.In this pa-per,we consider the location estimation problem for which only a small fraction of sensors have a priori coordinate knowledge,and range measurements between many pairs of neighboring sensors permit the estimation of all sensor coordinates.Two major difﬁculties hinder accurate sensor location estima-tion:ﬁrst,accurate range measurements are expensive;and sec-ond,centralized estimation becomes impossible as the scale of the network increases.This paper proposes a distributed localiza-tion algorithm,based on a weighted version of multidimensional scaling(MDS),which naturally incorporates local communication constraints within the sensor network.Its key features are:(1)a weighted cost function that allows range measurements that are believed to be more accurate to be weighted more heavily;(2)an adaptive neighbor selection method that avoids the biasing effects of selecting neighbors based on noisy range measurements;(3)a majorization method which has the property that each iteration is guaranteed to improve the value of the cost function.,where,with accuracy,is believed to lie aroundand many pairwise range measurements,,ta-ken over time.The available range measurements indexes are in some subset of.We assume that this subset of measurements results in a connected network;other-wise,each connected subset should be considered individually.3.DISTRIBUTED WEIGHTED MULTIDIMENSIONALSCALING3.1.The dwMDS Cost FunctionWe deﬁne MDS as the solution to the minimization of the follow-ing global cost function(a.k.a.STRESS function[1]):,and we assume that for each pair, up to dissimilarity measurements are available.The arbitrary weight()can be selected to quantify the pre-dicted accuracy of measurement.If no such measurement is available between and,or its accuracy is zero,then.We assume that,and,i.e.,the weights are symmetric.Note that function(1)differs from the standard MDS objective function in that we have added a penalty term to account for prior knowledge about node locations.After simple manipulations,can be rewritten as follows:(2)where functions are deﬁned for each unknown-location node (i.e.),and measurement.As only depends on the measure-ments available at node and the positions of neighboring nodes, i.e.,nodes for which(for some),it can be viewed as the local cost function at node.3.2.Minimizing the dwMDS Cost FunctionMotivated by the additive structure of the global cost(2),we pro-pose an iterative distributed algorithm in which each sensor up-dates its position estimate by minimizing the corresponding local cost function,after taking measurements and receiving position estimates from its neighboring nodes.However,unlike classical MDS,no closed form expression ex-ists for the minimum of the cost function(or).We address this problem by minimizing iteratively using quadratic majorizing functions as in SMACOF(Scaling by MAjorizing a COmplicated Function[2]).This method has the attractive prop-erty of generating a sequence of non-increasing STRESS values. Due to space limitations,we omit the derivation of the majoriza-tion function and its minimization and present only theﬁnal update equations for the nodes positions.The corresponding details can be found in[3].If is the matrix whose columns contain the position estimates for all points at iteration,quadratic majoriza-tion of results in the following update equation for the position estimate of node:,,initial condition; Initialize:,,compute from equation(5);repeat;for tocompute from equation(6);(a)neighborhood selection using actual distances(b)neighborhood selection using measuredranges(c)adaptive neighborhood selectionFig.2.Estimator mean()and1-uncertainty ellipse(—)for each blindfolded sensor compared to the true location()and CRB on the 1-uncertainty ellipse(---).which are closer than a threshold distance.However,when ranges are measured with noise,this selection process will tend to choose devices whose measured distances are shorter than the true distances,creating a negative bias phenomenon.Motivated by this phenomenon,we propose a two stage neighborhood selection process,based on the predicted distances between sensors.In theﬁrst step,the dwMDS algorithm from Fig.1is run with a neighborhood structure based on the available range measure-ments,i.e.,set if.After convergence,this step provides an interim estimate of the sensors locations,that,with high probability,will have negatively biased predicted distances.In the second step,these predicted distances from the esti-mated sensor locations are used to compute a new neighborhood structure,by assigning if.Some neigh-bors with low range measurements will be dropped,while others with possibly longer range measurements will be added.Then,us-ing as an initial condition and the new neighborhood struc-ture,the dwMDS algorithm is re-run,resulting in theﬁnal location estimates.Note that the predicted distances are used only to select neighbors(i.e.,which weights are positive)–the measured ranges are still used to determine the weight values.We remark that this2-step algorithm does not imply twice the computation.The dwMDS algorithm is based on majoriza-tion,and each iteration brings it closer to convergence.Since the ﬁrst step only needs to provide coarse localization information,the dwMDS algorithm can be stopped quickly with a large.Next,the second step begins with very good(although biased)coordinate estimates,so the second run of the dwMDS algorithm will likely require fewer iterations to converge.4.EXPERIMENTAL RESULTS4.1.SimulationsIn this subsection,all the simulated data were generated according to the log-normal model for RSS range measurements(see[6]for details).If the received power in mW at sensor transmitted by sensor,mW,is log-normal,then received power in decibels,mW,is Gaussian.Typically is modeled aswhere is the mean power in decibel milliwatts at distance, is the variance of the shadowing and dBm is the received power at a reference distance.Typically m,and is calculated from the free space path loss formula.The path loss exponent is a parameter determined by the environment.This leads to the following expression for the range measurements:(8) In all the simulations presented in this subsection,.Weﬁrst demonstrate the performance of the proposed algo-rithms on a network of sensors arranged on a uniform grid of unit area,in which the corner devices are anchor nodes and the remaining are unknown location devices.For all experi-ments on this conﬁguration,we use m(yielding an av-erage of14neighbors per device).We ran Monte Carlo sim-ulation trials to determine conﬁdence ellipses,root-mean-square error(RMSE)and bias performance(per sensor)of the location estimates.The results are displayed in Figure2,where we plot the mean and1-uncertainty ellipse of the estimator,and compare it to the actual device location and the CRB lower bound on the un-certainty ellipses[6].We remark that the CRB shown is calculated assuming full connectivity(all devices measure range to all other devices),and as such provides only a loose lower bound on the best performance achievable by any unbiased estimator.In theﬁrst experiment,we provide a baseline best-case sce-nario by using perfect(noise-free)distance measurements to select neighborhoods.The baseline assumes that we have an oracle to tell us when the true distance between and is less than a threshold, ie.,.This is shown in Figure2(a),resulting in a RMSE of the location estimates of m and an average bias of m.For the second experiment,the(noisy)RSS measurements are used to select neighbors,i.e.,devices and are neighbors ifTable1.RMSE of location estimates in experimental networkClassical MDS dwMDSRSS m mTOA m m.The results are shown in Figure2(b).The estimates are strongly pulled towards the center of the square,due to the negative bias of the range estimates which are‘selected’by the connectivity condition.Now,the RMSE is m and the bias is m.A third experiment uses the adaptive neighborhood selection method proposed in subsection3.4.The results are displayed in Figure2(c),where it can be seen that this method succeeds in re-moving the negative bias effect.The bias has gone back down to m,while the RMSE is m,just slightly higher than the baseline experiment using the oracle.4.2.Localization in an Experimental Sensor NetworkTo test the performance of the proposed algorithm on real-world channel measurements,we used the RSS and TOA measurements presented in[6].This data set includes the RSS and TOA range measurements from a network of44devices(of which are an-chor nodes)using a wideband direct-sequence spread-spectrum transmitter and receiver pair operating at a center frequency of GHz.The measurements were made in an open plan ofﬁce build-ing,within a m square area.We compare the performance of the dwMDS algorithm with adaptive neighborhood selection to classical MDS and the MLE based solutions from[6].Table1summarizes the RMSE of the location estimates.Figures3(a)and3(b)show the location estimates using clas-sical MDS(which used all the pairwise range measurements be-tween sensors)and the dwMDS algorithm,for the RSS measure-ment data set.The true and estimated sensor positions are marked by’o’and’’,respectively,where the lines represent the estima-tion errors.It can be observed that the dwMDS algorithm does much better than classical MDS.On the other hand,the RMSE of the dwMDS algorithm is slightly higher than the RMSE of the centralized MLE reported in[6].However,that method,unlike the dwMDS,not only uses all pairwise range measurements,but also relies on previously estimating the channel parameters.If we allow to increase at the expense of increasing communication costs, the dwMDS algorithm can reach an RMSE as low as m for m.Figure3(c)and3(d)show again the same location estimates for the TOA measurement data set.From Table1,it can be seen that the dwMDS algorithm outperforms all other location estima-tors.5.CONCLUSIONThis paper introduced a distributed weighted-MDS method spe-cially suited for node localization in a wireless sensor network. The proposed algorithm is nonparametric in its nature,in the sense that it does not depend on any particular channel or range mea-surement models,making it applicable to a broad range of dis-tance measurements,e.g.,RSS,TOA,proximity,without the need to tweak any parameters.(a)Classical MDS(RSS)(b)dwMDS(RSS)(c)Classical MDS(TOA)(d)dwMDS(TOA)Fig.3.Location estimates using RSS and TOA range measure-ments from experimental sensor network.True and estimated sen-sor locations are marked,respectively,by’o’and’’,while anchor nodes are marked by’x’.The dwMDS algorithm uses adaptive neighbor selection,with m.6.REFERENCES[1]T.Cox and M.Cox,Multidimensional Scaling,Chapman&Hall,London,1994.[2]P.Groenen,The majorization approach to multidimensionalscaling:some problems and extensions,DSWO Press,Leiden University,1993.[3]J.A.Costa,,N.Patwari,and A.O.Hero III,“Distributedmultidimensional scaling with adaptive weighting for node lo-calization in sensor networks,”submitted to ACM Trans.on Sensor Networks,2004.[4] C.Savarese,J.M.Rabaey,and J.Beutel,“Locationing indistributed ad-hoc wireless sensor networks,”in Proc.of IEEE Int.Conf.on Acoust.Speech and Signal Proc.,May2001. [5]S.ˇCapkun,M.Hamdi,and J.P.Hubaux,“GPS-free position-ing in mobile ad-hoc networks,”in IEEE Hawaii Int.Conf.on System Sciences,Jan.2001.[6]N.Patwari,A.O.Hero III,M.Perkins,N.Correal,and R.J.O’Dea,“Relative location estimation in wireless sensor net-works,”IEEE Trans.Sig.Proc.,vol.51,no.8,pp.2137–2148, 2003.。

不平衡数据集的神经网络阈值优化方法

不平衡数据集的神经网络阈值优化方法李明方，张化祥，张雯，计华LI Ming-fang，ZHANG Hua-xiang，ZHANG Wen，JI Hua山东师范大学信息科学与工程学院，济南 250014School of Information Science and Engineering，Shandong Normal University，Jinan 250014，ChinaE-mail:lmgc21713@LI Ming-fang，ZHANG Hua-xiang，ZHANG Wen，et al.Approach to optimize threshold of ANN on imbalance datasets.Computer Engineering and Applications，2010，46(20):168-171.Abstract:The classification of imbalance datasets is a hot research area in the field of machine learning，and recently，many researchers have proposed several theories and algorithms to improve the performance of classical classification algorithms on imbalance datasets.One of the most important methods is adopting threshold selection criteria to determine the output threshold of an Artificial Neural Network(ANN).The commonly used threshold selection criteria have some drawbacks，such as failing to get optimal classification performances both on data in minority class and in majority class，only focusing on theclassification accuracy of the majority class data.This paper proposes a new threshold selection criterion based on which， both the data in the minority class and majority class can reach optimal classification accuracies without the impact of the sample proportion.When the newthreshold selection criterion is applied as a classifier evaluation criterion to classifiers trained using Artificial Neural Networks and Genetic approaches，good results can be obtained.Key words:imbalance datasets;threshold selectioncriterion;Artificial Neural Network(ANN);genetic method摘要:不平衡数据集分类为机器学习热点研究问题之一，近年来研究人员提出很多理论和算法以改进传统分类技术在不平衡数据集上的性能，其中用阈值判定标准确定神经网络中的阈值是重要的方法之一。

混频数据模型的演变发展

混频数据模型的演变发展混频数据模型的发展是科技发展的重要方向之一。

从简单的混合模型到多维度混合数据模型，它的发展史是持续的过程，解决了很多企业的信息管理的需求。

以下是混频数据模型发展的主要进展：一、多维度模型（Multi-dimensional Model）多维度模型又称为固定层次结构（fixed hierarchy）或分面模型（faceted model），它是通过构建一系列通用层级、隐含层级、流程层级的结构网来实现信息的整体管理，它可以有效地将复杂的信息转换为多个层次的结构，使数据以一种立体式结构的方式进行管理。

二、深度卷积神经网络（Deep Convolutional Neural Network）深度卷积神经网络把数据通过层层深度叠加的方式把数据进行多次分类精细化，增加数据维度，使数据更容易被管理，并达到理想的目标。

三、空间语义网络（Spatial Semantic Network）空间语义网络是一种多维度数据模型，其利用不同的层次来对数据进行结构化，同时考虑最小片段的信息变化，使混频数据更容易进行精准管理，可以推动数据的多层精细化，并发挥出最大的潜能。

四、隐含模型（Hidden Model）隐含模型是把多维度数据抽象为一个隐含的模型，可以根据不同的需求进行自适应，实现精细的信息调整，从而构建出一套可扩展的混频数据模型，能够快速发挥数据的优势。

五、AI技术应用（AI Technology Applications）AI技术的出现使得信息的管理变的更加智能化和实时化，可以根据具体的需求实时调整数据模型，以实现信息的更丰富和有效的使用和管理，避免重复工作。

六、混频数据表示（Hybrid Data Representation）混频数据表示是把多个数据源采用不同的维度和表示方式综合起来，使混合数据更易于管理，实现数据的多层精细化，同时具有可扩展性和可视化性，实现精细管理。

以上，就是混频数据模型发展的主要进展。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Flexibility
scalable, flexible, but use statistical application model
DAWN Meeting Sept, 2006
‹#›
© Rajive Bagrodia, 2006;
Simulation, Emulation or Physical ?
Physical testbed
Realistic, but not scalable, limited flexibility and controllability
flexible, transparent execution of application and protocol, but not scalable
Email, file, or content distribution server
Physical Testbed
Analytical Model
Backbone network
Simulation
Community mesh network
Emulation
WLAN with mobile hosts
Emulation
Fluidflow model(UMass), Bianchi-model (JSAC’03)
scalable, but use statistical application model, limited fidelity and flexibility
Scalability Analytical models
– 1~4 backlogged UDP sessions – Data rate: 11Mbps
backlogged UDP traffic
backlogged UDP traffic
‹#›
© Rajive Bagrodia, 2006;
DAWN Meeting Sept, 2006
Orthogonal channel
– Multi-hop MANET protocols – Impact of transport subsystems on end-end performance
‹#›
© Rajive Bagrodia, 2006;
DAWN Meeting Sept, 2006
Scalability of Emulation Entity
• Metric
– Ratio of late packets, i.e. cannot be delivered to real application before the scheduled time (= time_of_transmission + network_delay)
• Observation
Fidelity of past wireless emulation tools below expectation NIST Net (NIST), EMPOWER (MSU), MobiEmu (HRL)
Simulation
ns-2(ISI), QualNet(SNT), GloMoSim(UCLA), OpNet
Challenges
• Achieving emulation fidelity
– Realistically model characteristics of wireless channel in real time at microseconds granularity
• Seamless integration of MAC and PHY models in the real protocol stack
Single channel
• Observation
– Ratio of late packets remain rather constant, thanks to the optimization to aggregate emulation events with close timestamps. – Emulation fidelity remains rather constant as we increase size of emulated WLAN
– Due to small emulation overhead, the ratio of late packets remains very low, and its impacts on application throughput negligible
‹#› © Rajive Bagrodia, 2006; DAWN Meeting Sept, 2006
Others
(802.11b…)
‹#›
© Rajive Bagrodia, 2006;
Cross Layer
DAWN Meeting Sept, 2006
Past studies
Fidelity
Roofnet (MIT), PlanetLab (Princeton etc), ORBIT (Rutgers), MVWT (CalTech)
‹#› © Rajive Bagrodia, 2006; DAWN Meeting Sept, 2006
Achievements
• Scaling Emulations
– Emulate ‘multiple target nodes’ to a single host cpu – Integrate detailed simulations into emulation testbeds
Cover the entire range by moving along the curve
Reality
‹#›
© Rajive Bagrodia, 2006;
DAWN Meeting Sept, 2006
Our approach
• A multi-paradigm framework offering combined uses of analytical models, simulation, emulation, and physical testbed and used for heterogeneous networks and devices:
– Minimizing emulation overhead
– Support easy integration of new models
• Real time synchronization of simulation and emulation entities
– Accurately model the communications among emulated, simulated and physical hosts
Our Approach
Select and operate at any point in the scalability vs. reality curve Or,
Deployment
Simulation
Flexibility Scalability
Real code Emulation Physical Test-bed
Good
Bad
• Observation:
– Processing delay of emulation kernel is within that of actual wireless device (15~20s) at 94% probability
• Setup
– 1 laptop emulating 1~4 wireless hosts and 4 laptops each emulating one wireless host.
• Scenario 1: udp sources listen to orthogonal channels • Scenario 2: udp sources listen to single channel
1 ~ 9 hops
UDP traffic
‹#›
© Rajive Bagrodia, lable Emulations for MANET (2)
• Metric
– Processing delay of emulation kernel Delay between the timestamp of radio hardware interrupt to the actual delivery of emulation framework to network layer
Scaling Issues for Multi-Paradigm Network Modeling
Rajive L. Bagrodia Professor Computer Science Dept UCLA rajive@
Cross Layer Interactions
• ‘Cross-layer interactions’ are key to provisioning dynamic QoS among the voice, video, and data traffic of next generation wireless networks. • Traditional approaches based on simulation or physical testbed are unable to sufficiently capture the impact of the cross-layer interactions on performance of real applications and protocol stacks. • Networked systems increasingly encompass heterogeneous networks. • Need for a new generation of multi-paradigm evaluation approaches.