Large-Scale High-Precision Topic Modeling on Twitter

合集下载

刀架上最大回转直径 英文

刀架上最大回转直径 英文

In the realm of machining operations, especially in metalworking, the maximum swivel diameter (also known as the maximum swing over bed or chuck) is a critical parameter that significantly impacts the versatility and performance of a lathe machine. This essay provides an in-depth analysis of this concept from multiple perspectives, emphasizing its role in achieving high-quality and precision standards.The maximum swivel diameter refers to the largest diameter of a workpiece that a lathe can rotate around while being machined. It is essentially the distance from the center of the lathe spindle to the outermost edge of the machine's cross-slide or compound rest. This dimension determines the size of the largest cylindrical part that can be rotated and machined within the lathe’s capacity without interference.**Implications for Production Capacity and Versatility**The magnitude of the maximum swivel diameter directly affects the types and sizes of parts a lathe can handle. A larger diameter allows for the processing of bigger components, which is particularly important in industries such as aerospace and heavy machinery manufacturing where large-scale parts are common. Thus, it becomes a decisive factor when choosing a lathe for specific production requirements, ensuring the capability to meet high-volume and large-scale production needs with precision.**Precision and Accuracy**In the pursuit of high-quality standards, the maximum swivel diameter also plays a pivotal role. Lathes with larger diameters typically have more robust and stable structures to accommodate the increased mass and torque during machining operations. This structural stability contributes to improved dimensional accuracy and reduced vibration, thereby enhancing the overall precision of the machining process.Moreover, a lathe's ability to rotate a workpiece through a full 360-degree arc without obstruction by other machine components ensures uniform cutting and finishing across the entire diameter of the workpiece. This feature is crucialfor meeting stringent tolerances and surface finish requirements, hallmarks of high-quality machining.**Tooling and Machining Capabilities**The maximum swivel diameter also influences the choice and configuration of tooling. Larger diameters often necessitate longer tools or specialized holders to reach the workpiece. This aspect demands careful consideration of tool deflection and rigidity to maintain accuracy over long reaches. Advanced tool holding systems and intelligent tool path planning become essential to harness the full potential of a large swivel diameter while maintaining high-quality output.**Safety and Ergonomics**From a safety perspective, understanding and respecting the maximum swivel diameter helps prevent accidents caused by attempting to machine oversized workpieces. Overloading a lathe beyond its rated capacity can lead to equipment damage, compromised part quality, and potential hazards to operators.Additionally, a well-designed lathe with a suitable maximum swivel diameter promotes better ergonomics in the workplace. It allows for easier loading and unloading of workpieces and reduces operator fatigue, contributing to a safer, more efficient working environment that indirectly enhances product quality by reducing human error.**Conclusion**In summary, the maximum swivel diameter of a lathe is a fundamental characteristic that profoundly impacts a workshop's productivity, efficiency, and ability to produce high-quality, precision-machined components. It not only defines the physical limits of the machine but also influences the choice of materials, tooling, machining strategies, and even safety protocols. As such, selecting a lathe with the appropriate maximum swivel diameter is a strategic decision that reflects the commitment to achieving the highest levels of quality and precision in manufacturing processes.Understanding the multifaceted implications of this parameter underscoresits importance in today's industrial landscape, where technological advancements continue to push the boundaries of what's possible in terms of size, complexity, and precision in metalworking and beyond.。

节省开支的英语作文

节省开支的英语作文

In today's economic landscape, the art of saving money without compromising quality has become a paramount skill for individuals, businesses, and governments alike. This essay delves into various strategies and perspectives on how to effectively reduce expenditures while maintaining or even enhancing high-quality standards across different sectors.Firstly, personal finance management is an essential starting point. To save money without sacrificing quality in daily life, budgeting plays a pivotal role. It involves tracking every penny spent, distinguishing between needs and wants, and prioritizing spending accordingly. For instance, instead of frequent dining out, one could opt for home-cooked meals, which not only saves money but also promotes healthier living. Similarly, buying generic brands that meet quality standards can significantly cut down grocery bills. Moreover, investing in durable, high-quality products may initially cost more, but they often prove cost-effective over their lifespan due to lower maintenance costs and longer usability.In the business context, cost-saving measures should be strategic and systematic rather than haphazard. Streamlining operations through automation and digitalization reduces labor costs and minimizes errors. Adopting lean principles can eliminate waste and boost efficiency. For instance, just-in-time inventory management prevents overstocking, thereby saving storage and handling costs. Businesses can also leverage technology to conduct virtual meetings instead of expensive travel, maintaining quality communication while cutting expenses.Quality control is crucial when it comes to cost-saving. Investing in employee training and development ensures that tasks are executed with precision, reducing the likelihood of costly mistakes or rework. Furthermore, fostering a culture of innovation encourages employees to come up with creative solutions that can improve product quality while minimizing production costs. Outsourcing non-core functions can also lead to significant savings if done wisely, as it allows businesses to tap into specialized expertise at a lower cost.Public sector entities have their unique set of challenges and opportunities for cost-saving. Governments can employ public-private partnerships to deliver services at reduced costs without diluting service quality. They can also adopt green initiatives that might require initial investment but eventually lead to substantial savings through energy efficiency and reduced environmental impact costs. Additionally, transparency and accountability in procurement processes can prevent wasteful spending and ensure value for taxpayers' money.Another key aspect is negotiating effectively with suppliers. Long-term contracts, bulk purchasing, and strategic sourcing can lead to volume discounts without compromising the quality of goods and services. Regular review and audit of contracts help identify areas where costs can be optimized.Lastly, financial planning and foresight are integral to sustained cost-saving. Whether for an individual planning retirement, a business preparing for market fluctuations, or a government forecasting future needs, prudent financial planning ensures that resources are allocated efficiently and that contingency plans are in place to avoid sudden, large-scale expenses.In conclusion, achieving cost-saving while preserving high quality is a multifaceted process requiring meticulous planning, wise investments, efficient operations, and a commitment to continuous improvement. It transcends from the micro-level of personal spending habits to macro-level policies of corporations and governments. By embracing these strategies, we can foster a culture that values both fiscal responsibility and excellence, ensuring sustainability and resilience in our financial practices.Word count: 478 words.Please note that this answer exceeds your specified word limit of 1209 words due to its concise nature; however, I've covered several core aspects that would be expanded upon in a full-length essay meeting that requirement.。

Geometric Modeling

Geometric Modeling

Geometric ModelingGeometric modeling is a fundamental concept in computer graphics and design, playing a crucial role in various industries such as architecture, engineering, and entertainment. It involves creating digital representations of physical objects or environments using mathematical and computational techniques. Geometric modeling allows designers and engineers to visualize, analyze, and manipulate complex shapes and structures, leading to the development of innovative products and solutions. However, it also presents several challenges and limitations that need to be addressed to ensure its effectiveness and efficiency. One of the key challenges in geometric modeling is the accurate representation of real-world objects and environments. This requires the use of advanced mathematical algorithms and computational methods to capture the intricate details and complexities of physical entities. For example, creating a realistic 3D model of a human face or a natural landscape involves precise measurements, surface calculations, and texture mapping to achieve a lifelike appearance. This level of accuracy is essential in industries such as animation, virtual reality, and simulation, where visual realism is critical for creating immersive experiences. Another challenge in geometric modeling is the efficient manipulation and editing of geometric shapes. Designers and engineers often need to modify existing models or create new ones to meet specific requirements or constraints. This process can be time-consuming and labor-intensive, especially when dealing with large-scale or highly detailed models. As a result, there is a constant demand for more intuitive and user-friendly modeling tools that streamline the design process and enhance productivity. Additionally, the interoperability of geometric models across different software platforms and systems is a persistent issue that hinders seamless collaboration and data exchange. Moreover, geometric modeling also faces challenges in terms of computational resources and performance. Generating and rendering complex 3D models requires significant computing power and memory, which can limit the scalability and accessibility of geometric modeling applications. High-resolution models with intricate geometries may strain hardware capabilities and lead to slow processing times, making it difficult for designers and engineers to work efficiently. This is particularly relevant in industries such as gamingand virtual reality, where real-time rendering and interactive simulations are essential for delivering engaging and immersive experiences. Despite these challenges, geometric modeling continues to evolve and advance through technological innovations and research efforts. The development of advanced modeling techniques such as parametric modeling, procedural modeling, and non-uniform rational B-spline (NURBS) modeling has significantly improved the accuracy and flexibility of geometric representations. These techniques enable designersand engineers to create complex shapes and surfaces with greater precision and control, paving the way for more sophisticated and realistic virtual environments. Furthermore, the integration of geometric modeling with other disciplines such as physics-based simulation, material science, and machine learning has expanded its capabilities and applications. This interdisciplinary approach allows for the creation of interactive and dynamic models that accurately simulate physical behaviors and interactions, leading to more realistic and immersive experiences. For example, in the field of architecture and construction, geometric modeling combined with structural analysis and environmental simulation enables the design and evaluation of sustainable and resilient buildings and infrastructure. In conclusion, while geometric modeling presents several challenges and limitations, it remains an indispensable tool for innovation and creativity in various industries. The ongoing advancements in geometric modeling techniques and technologies continue to push the boundaries of what is possible, enabling designers and engineers to create increasingly realistic and complex digital representations of the physical world. As computational power and software capabilities continue to improve, the future of geometric modeling holds great promise for revolutionizing the way we design, visualize, and interact with the world around us.。

龙骨十字接头英文缩写

龙骨十字接头英文缩写

Abstract:This comprehensive analysis delves into the intricacies of the Dragon Bone Cross Joint (DBCJ), an engineering marvel that epitomizes high quality and stringent standards. Spanning over 1372 words, this discourse explores the DBCJ from multiple angles, encompassing its design principles, material selection, manufacturing processes, structural integrity, performance characteristics, installation methodologies, maintenance strategies, and its applications across various industries. The discussion is aimed at providing a holistic understanding of the DBCJ, emphasizing its significance as a benchmark for excellence in mechanical joint technology.1. **Design Principles and Conceptual Framework**The Dragon Bone Cross Joint (DBCJ) represents a paradigm shift in the realm of cross-joint design, drawing inspiration from the robust, interlocking structure of dragon bones. This innovative concept integrates advanced computational modeling, biomimetic principles, and finite element analysis to create a joint that boasts exceptional strength, stiffness, and fatigue resistance. The inherent modularity of the DBCJ design allows for easy customization, enabling it to accommodate diverse load conditions, geometries, and environmental factors, thus fulfilling the stringent requirements of modern engineering projects.2. **Material Selection and Properties**The DBCJ's superior performance is underpinned by the judicious choice of materials. High-strength alloys, such as titanium or corrosion-resistant steels, are typically employed due to their excellent strength-to-weight ratio, durability, and resistance to environmental degradation. These materials undergo rigorous quality control measures, including chemical composition analysis, tensile testing, and non-destructive evaluation, ensuring they meet or exceed international standards like ASTM, AISI, or ISO. Moreover, surface treatments like hardening, coating, or plating may be applied to enhance wear resistance, reduce friction, and prevent corrosion, further augmenting theDBCJ's service life and reliability.3. **Manufacturing Processes and Precision Engineering**The fabrication of the DBCJ involves a series of sophisticated, precision-engineered processes, including computer numerical control (CNC) machining, laser cutting, electron beam welding, and heat treatment. These processes ensure micron-level accuracy in component dimensions, tight tolerances, and impeccable surface finish, which are critical for optimal joint performance. Advanced quality control measures, such as in-process inspection, statistical process control, and dimensional verification using coordinate measuring machines, guarantee that each DBCJ meets the highest manufacturing standards.4. **Structural Integrity and Failure Modes**The DBCJ's structural integrity is meticulously assessed through rigorous analytical and experimental methods. Finite element analysis (FEA) simulates various loading scenarios, predicting stress distributions, deformation patterns, and potential failure modes, while also guiding design optimization. Experimental tests, including static and dynamic load testing, fatigue testing, and fracture toughness testing, validate these predictions and provide empirical evidence of the joint's exceptional strength and durability. Compliance with international codes and standards, such as ASME BPVC or Eurocode, ensures that the DBCJ can withstand extreme loads and environmental conditions without compromise.5. **Performance Characteristics and Advantages**The DBCJ exhibits several distinctive performance characteristics that set it apart from conventional cross joints. These include:- **High Strength and Stiffness**: The unique, interlocking design and use of high-performance materials endow the DBCJ with unparalleled load-bearing capacity and rigidity, minimizing deflections and ensuring stable operation even under severe loads.- **Excellent Fatigue Resistance**: The careful consideration of stressconcentrations, material properties, and surface finishes minimizes the risk of fatigue cracking, prolonging the joint's service life and reducing maintenance costs.- **Enhanced Vibration Damping**: The intricate geometry and material selection of the DBCJ can effectively attenuate vibrations, reducing noise and enhancing the overall dynamic performance of the system.- **Versatility and Customizability**: The modular design of the DBCJ allows for easy adaptation to different applications, load conditions, and spatial constraints, making it a versatile solution for diverse engineering challenges.6. **Installation Methodologies and Integration**The DBCJ is designed with ease of installation and seamless integration into existing systems in mind. Detailed installation guidelines, including step-by-step procedures, torque specifications, and alignment checks, ensure consistent, error-free assembly. Advanced joining techniques, such as interference fitting, bolted connections, or adhesive bonding, provide secure, reliable connections while minimizing stress concentrations and preserving the joint's structural integrity. Furthermore, the compact, streamlined design of the DBCJ facilitates integration into confined spaces or complex assemblies, enhancing its applicability across various industries.7. **Maintenance Strategies and Life-Cycle Management**Proactive maintenance strategies are crucial for maximizing the DBCJ's service life and maintaining its high-performance standards. Regular inspections, following standardized checklists and utilizing non-destructive testing methods, help identify potential issues early, allowing for timely intervention. Condition-based monitoring, employing sensors and data analytics, enables predictive maintenance, reducing downtime and maintenance costs. Additionally, the availability of spare parts, repair services, and retrofit solutions ensures that the DBCJ remains operational and compliant with evolving industry standards throughout its life cycle.8. **Applications and Industry Impact**The Dragon Bone Cross Joint has found widespread application across various sectors, testament to its versatility and adherence to high-quality, high-standard benchmarks. Some notable examples include:- **Aerospace**: In aircraft structures, engines, and landing gear systems, where lightweight, high-strength joints with exceptional fatigue resistance are paramount.- **Automotive**: In chassis, suspension systems, and powertrain components, where durability, vibration damping, and ease of assembly are crucial.- **Energy**: In wind turbines, oil & gas platforms, and nuclear reactors, where robust, corrosion-resistant joints capable of withstanding harsh environments and heavy loads are essential.- **Construction and Civil Engineering**: In bridges, skyscrapers, and offshore structures, where large-scale, high-strength connections with long service lives are required.In conclusion, the Dragon Bone Cross Joint (DBCJ) embodies the essence of high-quality, high-standard engineering. Its innovative design, meticulous material selection, precision manufacturing, robust structural integrity, outstanding performance characteristics, user-friendly installation, proactive maintenance strategies, and broad applicability across industries collectively establish it as a benchmark for excellence in mechanical joint technology. As engineering challenges continue to evolve, the DBCJ stands poised to redefine the boundaries of what is possible in joint design, offering unparalleled strength, durability, and versatility to drive progress in the modern world.。

步进电机及其驱动系统简介中英文翻译

步进电机及其驱动系统简介中英文翻译

About stepper motor and drive systemStep characteristics for machine for angular displacement for entering the electrical engineering is first kind will give or get an electric shocking the pulse signal conversion cowgirl or line potential moving battery carry outing a piece, having the fast stopping, accurate step entering and directly accepting the arithmetic figure measuring, because of but got the extensive application.Such as in the drafting machine, print the machine and optical instrument inside, and all adopt the inside of a place control system for entering the electrical engineering to positioning to paint the pen print head or optical prinipal, especially indrstry process the type control, and move to spread to feel the to can immediately attain the precision fixed position because of its precision and need not potential, and control the technique along with the calculator of continuously deveolp, applied to would be more and more extensive.Control and can is divided into the simple control sum the complicacy to control to motor two kind.The simple control points to proceeds to start to motor, the system move, positive and negative revolution and sequential plicacy the control point to the motor's revolving speed, screw angle, turning moment, tension, electric current etc. physics quantisty progress control.Control technique that the development that motor get force is in latest development achievement that micro-electronics technique, electric power electronics, spread to feel the the technique, automatic control the technique, tiny machine the application technique to wait.Exactly the advance of these techniques make the motor control the technique at near two 10-year insides change for turn overing the ground of day is take placed.Among them the motor's control division have already been controled by emulation gradually let locate to regard single flake machine as principle of microprocessor control, formation the mix control system of the arithmetic figure and emulation and the application of the pure arithmetic figure control system, combine control the direction to total amount word to quickly deveolp.The motor's drive part of power forusing the piece experienced a few renewals to change the on behalf, current switch speed sooner, more simple whole type power piece of control the MOSFET become the main current with IGBT.Stepper motors have the following benefits:•Low cost•Ruggedness•Simplicity in construction•High reliability•No maintenance•Wide acceptance•No tweaking to stabilize•No feedback components are needed•They work in just about any environment•Inherently more failsafe than servo motors.There is virtually no conceivable failure within the stepper drive module that could cause the motor to run away. Stepper motors are simple to drive and control in an open-loop configuration. They only require four leads. They provide excellent torque at low speeds, up to 5 times the continuous torque of a brush motor of the same frame size or double the torque of the equivalent brushless motor. This often eliminates the need for a gearbox. A stepper-driven-system is inherently stiff, with known limits to the dynamic position error.Stepper Motor DisadvantagesStepper motors have the following disadvantages:•Resonance effects and relatively long settlingtimes•Rough performance at low speed unless amicrostep drive is used•Liability to undetected position loss as a result ofoperating open-loop•They consume current regardless of loadconditions and therefore tend to run hot•Losses at speed are relatively high and can causeexcessive heating, and they are frequently noisy(especially at high speeds).•They can exhibit lag-lead oscillation, which isdifficult to damp. There is a limit to their availablesize, and positioning accuracy relies on themechanics (e.g., ballscrew accuracy). Many ofthese drawbacks can be overcome by the use ofa closed-loop control scheme.Note: The Compumotor Zeta Series minimizes orreduces many of these different stepper motor disadvantages.There are three main stepper motor types:•Permanent Magnet (P.M.) Motors•Variable Reluctance (V.R.) Motors•Hybrid MotorsWhen the motor is driven in its full-step mode, energizing two windings or “phases”at a time (see Fig. 1.8), the torque available on each step will be the same (subject to very small variations in the motor and drive characteristics). In the half-step mode, we are alternately energizing two phases and then only one as shown in Fig. 1.9. Assuming the drive delivers the same winding current in each case, this will cause greater torque to be produced when there are two windings energized. In other words, alternate steps will be strong and weak. This does not represent a major deterrent to motor performance—the available torque is obviously limited by the weaker step, but there will be a significant improvement in low-speed smoothness over the full-step mode.Applications in hazardous environmentsor in a vacuum may not be able to use a brushed motor. Either a stepper or a brushless motor is called for, depending on the demands of the load. Bear in mind that heat dissipation may be a problem in a vacuum when the loads are excessive.continuous duty applications suit the servo motor, and in fact a step motor should be avoided in such applications because the high-speed losses can cause excessive motor heating.are the natural domain of the stepper due to its high torque at low speeds, good torque-to-inertia ratio and lack of commutation problems.The brushes of the DC motor can limit its potential for frequent starts, stops and direction changes.continuous duty applications are appropriate to the step motor. At low speeds it is very efficient in terms of torque output relative to both size and input power. Microstepping can be used to improve smoothness in lowspeed applications such as a metering pump drive for very accurate flow control.Stepper motor is a stepper motor for precise electrical and mechanical actuators, which are widely used in industrial machinery, digital control, for the system reliability, interoperability, maintainability, and cost-optimal, according to the control system functional requirements and Control system through the microcontroller memory, I/O interface, interrupt, keyboard, LED display of the expansion of the annular distributor stepping motor, drive and protection circuit, man-machine interface circuit, interrupt system and reset circuit, a single voltage drive circuit, etc.designed to achieve a four-phase stepper motor rotating, and emergency stop functions.To achieve the stepping motor system in NC Machine Tools, system design, two external interrupts, in order to achieve within a certain period of time stepper motor repeated Reversible function, ie, the turret CNC automatic feed movement. With the continuous development of single chip microcomputer, microcontroller in household electronic products widely applied, since the since the early sixties, the stepper motor applications are greatly enhanced.People use it to drive the clock and other instruments with pointers, printers, plotters, disk CD-ROM drive, a variety of automatic control valves, various tools, as well as robots and other mechanical devices.In addition,as the acIn addition, as the actuator, stepper motor is one of mechanical and electrical integration of the key products are widely used in a variety of automatic control systems, microelectronics and computer technology with the development of its requirements with the Japanese fear of growing in all the field of application of the national economy has. Stepper motor digital control system of electromechanical actuators commonly used, due to its high precision, small size, flexible to control, so the smart meter and position control hasbeen widely used in large-scale integrated circuits technology development, and SCM The increasing popularity of design features, the lowest price of the stepper motor control driver provides advanced technology and adequate resources.步进电机及其驱动系统简介步进电机是一种将电脉冲信号转换成相应的角位移或线位移的机电执行元件,具有快速启停、精确步进以及直接接受数字量的特点,因而得到了广泛的应用。

大模型 知识蒸馏视觉模型

大模型 知识蒸馏视觉模型

大模型知识蒸馏视觉模型英文文档:Title: Distilling Visual Models with Large ModelsIn recent years, the advancement of large models in the field of artificial intelligence has been remarkable.These models, with their massive scale and profound learning capabilities, have pushed the boundaries of what is possible in various domains, including computer vision.However, despite their remarkable performance, these large models are often computationally expensive, requiring significant resources for both training and inference.To address this issue, researchers have turned to knowledge distillation, a technique that allows the transfer of knowledge from a large, expensive model to a smaller, more efficient one.In the context of visual models, knowledge distillation has shown great potential in reducing the model size and improving inference speed without compromising too much on performance.One approach to knowledge distillation in visual models involves using a teacher-student framework.The teacher model, which is usually a large and powerful model, is trained on a large dataset to learn the underlying patterns and representations of the visual data.The student model, on the other hand, is a smaller and less resource-intensive modelthat is trained to mimic the teacher model"s behavior.Another technique that can be used in knowledge distillation is the use of attention mechanisms.These mechanisms help the student model focus on the most important parts of the input data, thereby reducing the amount of information that needs to be processed and improving efficiency.In conclusion, knowledge distillation is a promising technique for distilling visual models with large models.By transferring knowledge from a large and powerful model to a smaller and more efficient one, researchers can create visual models that are more suitable for deployment in real-world applications, where resources may be limited.中文文档:标题:利用大模型进行视觉模型知识蒸馏近年来,人工智能领域的大型模型取得了显著的进步。

全站仪极坐标法 英文

全站仪极坐标法 英文

全站仪极坐标法英文The polar coordinate method is a measurement technique commonly used in surveying, which involves the use of a total station to obtain angular and distance measurements in a 3D coordinate system. The total station, which consists of a theodolite and a distance meter, can measure horizontal and vertical angles to a high degree of accuracy, as well as calculate distances using either a reflector or a laser. To use the polar coordinate method, the user sets up the total station at a known point and aims it towards a target point. The instrument then calculates the angles and distances to the target point, and uses these measurements to determine the precise location of the target in the 3D coordinate system. This method is particularly useful for determining the location and elevation of objects such as buildings, roads, and bridges, as well as for mapping large-scale terrain features such as mountains and valleys. The polar coordinate method is widely used in many different fields, including civil engineering, architecture, and land surveying, and is known for its high accuracy and precision. In addition, because the total station is automated and requires minimal user input, it is a highly efficient method of surveying that can save time and resources on large-scale projects.Overall, the polar coordinate method is an essential tool for many professionals who require accurate and precise spatial data, and it continues to be an important part of modern surveying practice.全站仪极坐标法是一种常用的测量技术,它涉及到利用全站仪在三维坐标系中获取角度和距离测量。

改进Oriented_R-CNN_的遥感尾矿库检测

改进Oriented_R-CNN_的遥感尾矿库检测

第5期李卫等: 基于重建辅助的压缩学习图像分类 115[10]LOHIT S, KULKARNI K, TURAGA P, et al. Reconstruction-Free Inference on Compressive Measurements[C]//2015 IEEEConference on Computer Vision and Pattern Recognition Workshops (CVPRW), June 7-12, 2015, Boston, MA, USA. IEEE, 2015: 16-24.[11]KULKARNI K, TURAGA P. Reconstruction-Free Action Inference from Compressive Imagers[J]. IEEE Transactions onPattern Analysis and Machine Intelligence, 2015, 38(4): 772-784.[12]CALDERBANK R, JAFARPOUR S, SCHAPIRE R. Compressed Learning: Universal Sparse Dimensionality Reductionand Learning in the Measurement Domain[EB/OL]. [2022-09-06]. https:///publication/228364241. [13]张春晓, 鲍云飞, 马中祺, 等. 基于卷积神经网络的光学遥感目标检测研究进展[J]. 航天返回与遥感, 2020, 41(6): 45-55.ZHANG Chunxiao, BAO Yunfei, MA Zhongqi, et al. Research Progress on Optical Remote Sensing Object Detection Based on CNN[J]. Spacecraft Recovery & Remote Sensing, 2020, 41(6): 45-55. (in Chinese)[14]LOHIT S, KULKARNI K, TURAGA P. Direct Inference on Compressive Measurements Using Convolutional NeuralNetworks[C]//IEEE International Conference on Image Processing (ICIP), September 25-28, 2016, Phoenix, AZ, USA.IEEE, 2016: 1913-1917.[15]ADLER A, ELAD M, ZIBULEVSKY M. Compressed Learning: A Deep Neural Network Approach[EB/OL]. [2022-09-06].https:///pdf/1610.09615.pdf.[16]XUAN V N, LOFFELD O. A Deep Learning Framework for Compressed Learning and Signal Reconstruction[C]//5thInternational Workshop on Compressed Sensing Applied to Radar, Multimodal Sensing, and Imaging (CoSeRa), September 10-13, 2018, University of Siegen, Germany. 2018: 1-5.[17]DEGERLI A, ASLAN S, YAMAC M, et al. Compressively Sensed Image Recognition[C]//7th European Workshop onVisual Information Processing (EUVIP), November 26-28, 2018, Tampere, Finland. IEEE, 2018: 1-6.[18]LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings ofThe IEEE, 1998, 86(11): 2278-2324.[19]KRIZHEVSKY A, NAIR V, HINTON G. The CIFAR-10 Dataset[EB/OL]. [2022-09-06]. /~kriz/cifar.html.[20]GLOROT X, BORDES A, BENGIO Y. Deep Sparse Rectifier Neural Networks[C]//Proceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, 2011, 15: 315-323.[21]WEN J, CHEN Z, HAN Y, et al. A Compressive Sensing Image Compression Algorithm Using Quantized DCT and NoiseletInformation[C]//Proceedings of the 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), March 14-19, 2010, Dallas, TX, USA. IEEE, 2010: 1294-1297.[22]PASTUSZCZAK A, SZCZYGIEL B, MIKOLAJCZYK M, et al. Modified Noiselet Transform and Its Application toCompressive Sensing with Optical Single-Pixel Detectors[C]//18th International Conference on Transparent Optical Networks (ICTON), July 10-14, 2016, Trento, Italy. IEEE, 2016: 1-4.[23]ZAGORUYKO S, KOMODAKIS N. Wide Residual Networks[EB/OL]. [2022-09-06]. https:///pdf/1605.07146.pdf.[24]HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recognition[C]//2016 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA. IEEE, 2016: 770-778.[25]HUANG G, LIU Z, MAATEN L V D, et al. Densely Connected Convolutional Networks[C]//2017 IEEE Conference onComputer Vision and Pattern Recognition, July 21-26, 2017, Honolulu, HI, USA. IEEE, 2017: 2261-2269.[26]KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet Classification with Deep Convolutional Neural Networks[J].Communications of the ACM, 2017, 60(6): 84-90.[27]SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition[EB/OL].[2022-09-06]. https:///pdf/1409.1556.pdf.[28]SHI W, CABALLERO J, HUSZAR F, et al. Real-Time Single Image and Video Super-Resolution Using an EfficientSub-Pixel Convolutional Neural Network[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA. IEEE, 2016: 1874-1883.作者简介李卫,男,1995年生,2018年获河南大学软件工程专业学士学位,现在河南大学电子信息专业攻读硕士学位。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Large-Scale High-Precision Topic Modeling on TwitterShuang Y angTwitter,Inc. syang@Alek KolczT witter,Inc.ark@Andy SchlaikjerT witter,Inc.hazen@Pankaj GuptaT witter,Inc.pankaj@ABSTRACTWe are interested in organizing a continuous stream of sparse and noisy texts,known as“tweets”,in real time into an on-tology of hundreds of topics with measurable and stringently high precision.This inference is performed over a full-scale stream of Twitter data,whose statistical distribution evolves rapidly over time.The implementation in an industrial set-ting with the potential of affecting and being visible to real users made it necessary to overcome a host of practical chal-lenges.We present a spectrum of topic modeling techniques that contribute to a deployed system.These include non-topical tweet detection,automatic labeled data acquisition, evaluation with human computation,diagnostic and correc-tive learning and,most importantly,high-precision topic in-ference.The latter represents a novel two-stage training algorithm for tweet text classification and a close-loop infer-ence mechanism for combining texts with additional sources of information.The resulting system achieves93%precision at substantial overall coverage.Categories and Subject DescriptorsI.2.6[Artificial Intelligence]:Learning1.INTRODUCTIONTwitter[1]is a global,public,distributed and real-time social and information network in which users post short messages called“tweets”.Users on Twitter follow other users to form a network such that a user receives all the tweets posted by the users he follows.Tweets are restricted to contain no more than140characters of text,including any links.This constraint fosters immense creativity leading to many diverse types of styles and information carried in the tweets.As of early2014,Twitter has more than240million monthly active users all over the world.These users send more than500million tweets every day,which corresponds to an average of5700tweets per second,with spikes at up to25times of that velocity during special events[2]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita-tion on thefirst page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise,or re-publish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.Request permissions from permissions@.KDD’14,August24–27,2014,New York,NY,USA.Copyright2014ACM978-1-4503-2956-9/14/08...$15.00./10.1145/2623330.2623336.Tweets generated from all over the world are expected to be about a variety of topics.Figuring out exactly which tweet is about which topic(s)is interesting to us at Twit-ter in our goal to serve our users better as it enables per-sonalization,discovery,targeted recommendations,organi-zation of content,as well as aiding in studies and analyses to gain better insights into our platform as a whole.Such broad usefulness,and the potential of affecting and being visible to real users,however,propose demanding precision requirement.In this paper,we consider the problem of high-precision topic modeling of tweets in real-time as they are flowing through the network.This task raises a set of unique challenges given Twitter’s scale and the short,noisy and am-biguous nature of tweets.We present how we address these challenges via a unique collection of techniques that we have employed in order to build a real-time high-precision tweet topic modeling system that is currently deployed in produc-tion inside Twitter.Our system achieves93%precision on ∼300topics and37%overall coverage on English tweets. 2.OVERVIEW&RELATED WORKGiven a tweet,and all the information associated with it (e.g.,author,entities,URLs,engagements,context),we are interested infiguring out what topic(s)this tweet is about. Topic modeling tasks have been commonly approached with unsupervised clustering algorithms(e.g.,k-means,pLSI and LDA[5]),informationfiltering approaches(e.g.,lan-guage models[19]),or weakly supervised models(e.g.,su-pervised LDA[5],labeled LDA[22]).These approaches are effective in grouping documents into a predefined number of coarse clusters based on inter-document similarity or the co-occurrence patterns of terms(with help from very mild supervision).They are also cheap to train as no or very few labeled data is required.Nevertheless,they are not suit-able for our use because it is very difficult to align the topic clusters produced by these approaches to a predefined on-tology and perform low-latency topic inference with measur-able and decent precision.In fact,none of these approaches could attain the precision close to what we demand. Instead,we primarily consider supervised approaches that classifying tweets into our taxonomy with controllable high-precision.Figure1illustrates the overview of our system. On the training path,our“training data collector”constantly listens to the tweet stream to automatically acquire labeled training data,which,once accumulated to a certain amount, are fed into the“trainer”module to produce classification models;These models are then validated and calibrated to be used in our service.On the operation path,the classifierence”module to provide high-precision topic annotation of tweets in real time.The inference results are used to gener-ate and refine the user-and-entity knowledge,making this a close-loop inference.The results are also scribed and used to evaluate and monitor the quality performance with help from crowd-sourced human annotation.Once the service is exposed to the user,we also employ a “feedback collector”and a “diagnosis module”to gather high-precision user cu-rated data,which are fed back into the trainer module to fine-tune the trained models.Related Works &Contributions.Topic modeling of tweets has been examined extensively in the literature.Most existing research has been based on un-or weekly supervised techniques such as variations of LDA [17,27].The inference in these approaches usually takes considerable latency,and the results are not directly controllable when precision is concerned.Recent studies employ information filtering ap-proaches to detect tweets on specific topics of interests[19,9].These approaches usually have low latency and can perform inference in real-time;and when configured properly,the top-ranked results obtained by them can usually achieve a reasonable precision target.Nonetheless,because they iden-tify only a small fraction of tweets that are apparently rel-evant to a given topic while disregarding the majority rest,the recall and coverage of these approaches are very poor,yielding heavily biased inference results that are less useful in many of our use rmation filtering approaches are also only good at specific topics,usually topics of rela-tively focused meanings (e.g.,“San Francisco”and “NBA”),and not directly applicable to a broad spectrum of topics such as an ontology,as what we study here.To the best of our knowledge,this is the first published documentation of a deployed topic modeling system that infers topics of short noisy texts at high precision in real-time.3.TAXONOMY CONSTRUCTIONWe create a taxonomy represented as a fixed (and poten-tially deep)ontology to encode the structure and knowledge about the topics of our interest.The goal is to find a topic structure that is as complete as possible to cover as many semantic concepts (and their relationships)as we can,while focusing on topics that are frequently discussed on TwitterTop Sports Technology Entertainment Movie & TV Performing Arts Visual Arts Action & AdventureHorrorRomanceSci Fi & FantasyDramaAnimation Music & RadioGovernment &PolicicsBusiness & Finance………example,by building topic filters based on ontology corpora (e.g.,ODP,Wikipedia),we can estimate coverage of top-ics on Twitter and trim or merge topics that are too niche;unsupervised models such as LDA were also used to map clusters to well defined ontology (e.g.,ODP,Freebase)in order to estimate coverage.This process has been iterated through multiple refinements and over different time periods to avoid bias.The current taxonomy consists of ∼300topics with a maximum of 6levels,a fragment of which is shown in Figure 2.4.TWEET TOPIC MODELING:TEXT CLASSIFICATIONA tweet can be associated with multiple modalities of in-formation such as text,named entities,users (the author and people who engaged with the tweet),etc.In this section,we consider topic classification of tweets based solely on its tex-tual features,which will be combined with other sources of information for integrative topic inference in Section 5.High-precision tweet classification faces unique challenges because unlike ordinary documents,tweets are much sparser and noisier and the scale is daunting.The demanding preci-sion criteria make this even more challenging.A tweet may belong to multiple topical categories at the same time,or be non-topical,which makes this a multi-label classification problem.To meet quality criteria,the participating classi-fier needs to be abstaining,i.e.,providing topic labels only if the expected precision or confidence is sufficiently high.4.1Chatter detectionOne key challenge for high-precision tweet classification is that a substantial proportion of tweets are non-topical [3],or at least their topics are not clear in the context of a single tweet.We use the term chatter to denote tweets that often primarily carry emotions,feelings or are otherwise related to personal status updates.While they are clearly an important part of user expression on Twitter,it is necessary to reject classifying these tweets (e.g.,to save system latency and to control quality)and restrict the domain of interest to tweets that are not chatter.To that end,we build content filters to eliminate chatter tweets,as described in previous work [3].4.2Training data acquisition Training high-quality tweet classificationquality labeled e of human annotation isfor a number of reasons.First,it is tooscale we consider(i.e.,300topics,millions ofthe sparseness(i.e.,only∼10s feature presenceany reliable estimation would require hundreds ofof labeled tweets and cost millions of dollars evencheapest crowd-sourced service in the market.even human-assigned labels could be too noisy.Aonomy like ours presents considerable cognitivehuman annotators,for example,when asked tovant topic labels(out of the300candidates)tohuman raters turn to identify only a fraction of theitive labels.While it is possible to obtain sampleswithin each category,such samples are likely toand to the extent that some of these instances also belongto other categories,most of these memberships will remain unknown.Human labeled data are also subject to expertise bias because a lot of niche topics(e.g.,“data mining”)ex-ceed the literacy of crowd-source workers,forcing them to respond with random-guess annotations.Because of these concerns,in our system,human labeled data(with enhanced quality assurance)are only used in quality evaluation(Sec-tion4.7).For model training,we devise scalable algorithmsto collect labeled data automatically from unlabeled tweets. The labeled data acquisition pipeline is illustrated as Fig-ure3.First,tweets are streamed through a set of topic priors to prefilter tweets that are weakly relevant to each topic.These topic priors are white-list rules that include: User-level priors:users who tweet predominantly abouta single topic(e.g.,@ESPN about“Sports”).Entity-level priors:unambiguous named entities and hash-tags that are highly concentrated to one or a few topics(e.g.,#NBA about“Basketball”),see Section5.2.URL-level priors:for tweets containing URLs,the URL string usually contains topic information about thewebpage it links to(e.g.,/nba/*.html). We also leverage the social annotation functionality to ex-tract topic priors.Many Twitter users represent authoritieson a small set of topics by virtue of their profession,experi-ence,and the fact that they tend to publish on those topicson/offTwittter.This is recognized by other Twitter users who sometimes group users“known for”a particular topic using Twitter’s List feature.The data coming through the topic priorfilters are only weakly relevant to each topic and thus very noisy.To obtain high-quality positive data for each topic,we employ a co-training[6]based data cleaning algorithm.In particular,we consider only these tweets containing URLs,and iteratively apply the following for each topic c:1.train a webpage classifier to remove tweets whose asso-ciated URLs satisfy:p(c|URL)is below 1-percentile.2.train a tweet classifier to remove URLs whose associ-ated tweets satisfy:with p(c|tweet)is below 2-percentile. The key assumption here is that when a URL is embeddedin a tweet,the webpage that the URL links to is highly likelyto be on the same topics as the tweet.To make this proce-dure more scalable and less coupled with thefinal classifier models,we use Naive Bayes classifiers in this step.User level rulesURL level rulesEntity level rulesSocial annotationpositivesnegativesFigure3:Training pipeline.Twitter(e.g.,“Sports”,“Technology”and“News”),randomly sampled negatives are very noisy.One-vs-all doesn’t work well for topics that are intercorrelated(e.g.,“Sports”and “Basketball”),because semantically equivalent documents can be presented to a learner as both positive and nega-tive examples.To this end,we employ algorithms devel-oped for learning from positive and unlabeled data(or PU-learning[12])to acquire high-precision negative examples for each topic.In particular,we select tweets/webpages as negative instances for a particular topic only when the similarity scores with the centroid of that class are below (1- 3)-percentile,i.e.,via a Rocchio classifier[20].4.3Feature extractionThe effectiveness and efficiency in feature extraction are vitally important to the success of our system given our scale and quality requirement.Conventional unigram text fea-tures cannot meet our needs in both regards.First,tweets are short(bounded with140characters),and amount to merely∼7unigram terms(including common stop-words) on average–very difficult for any machine learning model to infer topics with high confidence.Second,traditional imple-mentation of unigram extraction can take considerable com-putational resources and dominate our system latency espe-cially for webpages which could exceed10K terms,which is problematic as we require real-time scoring.The latter fac-tor is actually more critical and essentially prohibits us from using potentially more effective features such as n-grams, word2vec[21],or other time-consuming feature processing such as term pre-selection and stemming.Instead,in our system,we use the following two types of feature extrac-tions for tweets and webpages respectively:Binary hashed byte4gram We use a circular extractor that scans through the tweet text string1with a slid-ing window of size four:every four consecutive bytes(i.e.,UTF8chars)are extracted and the binary value(4-byte integer)is hashed into the range of0to d−1,where d is chosen to be a prime number to minimizecollision(e.g.,d=1,000,081in our experiment).Theoccurrence of these hashed indices are used as binaryfeatures in tweet classifiers.This extractor(Byte4Gram) yields significantly denser feature vectors than uni-gram,e.g.,for a tweet of length l,it produces exactlyl feature occurrence.1Tweet texts are preprocessed at the time of scoring such that less meaningful strings such as URLs and@mentions are stripped offand all strings are converted to lower case.Hashed unigram frequency Term unigrams are extracted and hashed to integers at an extremely fast speed[13],with also a feature size d of∼1million.The termfrequencies of the indices are then transformed to log-scale,i.e.,log(1+tf w)and the feature vectors arenormalized so that each webpage has the same 1-norm.This extractor(Unigram-logTF-norm)is ex-tremely fast and it doesn’t require any string prepro-cessing as Unicode conversion,lowercasing and wordboundary detection are automatically taken care of. We found that these feature extractors achieve the best bal-ance of effectiveness and efficiency in our system.4.4Model pretrainingBoth tweet and webpage topic inference can be casted naturally as multi-class multi-label classification problems, making regularized logistic regression a perfectfit.Given a set of n training instances x i∈R d together with their corresponding labels y i⊂{1,2,...,C},where C is the total number of topics,we consider two types of LR models: Multinomial logistic regression(MLR)also known as Multinomial logit or Softmax Regression model,whichmodels the conditional topic distribution given an in-stance x as the following multinomial distribution:p(c∈y|x)=exp(w c x+b c)Cc =1exp(wcx+b c ),∀c=1,...,C.One-vs-all logistic regression(LR)models“whether x belongs to a topic class c”as a Bernoulli variable: p(c∈y|x)=1c c,∀c=1,...,C.Let l(w,b|x,y)be the log-likelihood of parameter(w,b)given example(x,y).To estimate the parameters(w,b),we mini-mize the negative log likelihood plus a penalty term:min w,b λ(α||w||1+1−α2||w||22)−1nni=1l(w,b|x i,y i),(1)where the parameterλcontrols the strength of regulariza-tion.Here we consider the ElasticNet regularization which is a hybrid of 1and 2regularization types and includes both as special cases(whenαtakes1or0).The non-differentiable nature of this regularization,whenα>0,enforces sparsity of the model w,which is valuable to us because(1)a com-pacter model consumes less memory,loads faster and has better latency in real-time scoring;and(2)less useful(e.g., redundant or noisy)features will be automatically ruled out in training,which is important to us as we do not do at-tribute prepruning or stemming.In our machine learning library,we provide a number of versatile model trainers on Hadoop,including distributed streaming training via SGD[18],and batch-mode training based on L-BFGS,conjugate gradient(CG)or coordinate descent(CD)algorithms.We also provide full regularization path sweeping for 1and ElasticNet regularization.The major difference between the two LR models lies in the Softmax function MLR employs to normalize the poste-rior scores p(y|x).There are both pros and cons for doing so.On the one hand,MLR provides comparable topic scores which enables us to apply max-a-posterior inference to in-crease topic coverage.As topics are competing against one Table1:Streaming training(e.g.,Pegasos)gen-erates models with lower AUCs than batch-mode training.Increasing the number of iterations(i.e., number of scans over the data)slowly improves the AUCs but makes the training time much longer.#iteration Avg AUC Training time Batch NA baseline baselineStreaming1-1.9%-62%Streaming3-1.2%+57%Streaming10-1.0%+663%Table2:Comparison of MLR and LR1-vs-all classi-fiers,in terms of average AUC across topics and the %of topics with improved AUCs(%topic win).Model Avg AUC%topic winTweet MLR baseline baselineTweet LR+3.9%86%Webpage MLR baseline%baselineWebpage LR+1.7%77%another,the inference results of MLR are less ambiguous (e.g.,if topic cfires for a tweet x,topic c =c is less likely tofire for x due to the“explaining-away”effect).On the other hand,however,MLR requires the label space to be exhaustive(i.e.,a tweet must belong to at least one of the topics),and it discourages usage of examples with missing label,which is not perfectly compatible to our setting.Also, because topics are coupled with one another in MLR,it is hard to train models for multiple topics in parallel,or retrain model for a subset of topics without affecting others. Experiment Results.We use AUC(i.e.,area under the ROC curve)for model validation for its independence of the choice of decision threshold–evaluation of calibrated models at their desired thresholds are reported in Section4.9.We test models on a held-out subset of the“training data ac-quisition”output data,and estimate AUC with confidence intervals using standard Bootstrap.Because the standard errors are very small,we report here only the AUC scores. Historically,our classifiers are trained in streaming-mode on Hadoop[18]via stochastic optimization,such as Pega-sos[26].Streaming training consumes a constant amount of memory regardless of the scale of the training data,and it is very fast when the data is scanned through only once. However,we noticed that the model generated by stream-ing training are significantly worse than batch-mode trained ones(e.g.,via L-BFGS,CD or CG).For example,on tweet classification(with Byte4Gram feature),Pegasos-trained mod-els,with1scan over the data,are1.9%worse on average than batch-mode trained model,as seen in Table1.Increas-ing the number of iterations helps to make the gap smaller, but at the same time,also makes the training time much longer(e.g.,overhead in data I/O and communication).As we note in Section4.10,we were able to scale up batch-mode training on massive data with distributed optimiza-tion.Hence,in the following experiments,the models are all trained in batch-mode unless noted explicitly.Besides the advantages we mentioned previously,we found that1-vs-all LR also performs better than MLR on both tweet and webpage classification.The results are summa-Finally,using ElasticNet with regularization path sweep-ing(i.e.,optimizedλ),further improves average AUC by 0.75–0.98%,as shown in Table4.Due to the sparse na-ture of tweets,we found that relatively dense models have slightly better AUCs–models that are too sparse tend to generalize poorly on Byte4Grams that are not seen in train-ing data.In our experiments,we foundα≈0.05provides the best trade-offbetween test set AUC and training speed (smallerαturns to slow down the regularization sweeping).4.5Relational regularizationThe ontological relations among topics,as defined by the hierarchical tree structure of our taxonomy,can be used to make our models smarter in learning classifiers for concep-tually related topics.We consider three approaches here: Label expansion This approach regularizes the learner by applying ontology-based label propagation to the train-ing data,i.e.,via:(1)ancestor inclusion:if x belongs to topic c,then it should also belong to topic c if c is an ancestor of c;and(2)offspring exclusion:if x belongs to c,then it should not be used as an nega-tive example of c in1-vs-all splitting unless labeled so 2For webpages,Byte4Gram consumes far more extraction time than Unigram and could slow down our system in real-time classification especially for very long documents.explicitly,where c is an offspring of c.This method enables relational shrinkage via data sharing.Cost-sensitive learning The structure of the taxonomy indicates that we should penalize mistakes involving different topic pairs differently,e.g.,misclassifying a “movie”tweet as“entertainment”should receive less cost than misclassifying it as“sports”.This can be done by encoding the ontological relations into a cost matrix E,where the(c,c )-th element,e cc ,represents the cost of misclassifying an instance of topic c into topic c and optimizing the regularized expected cost E[e(y,ˆy)|x)]=kc=1e yc p(ˆy=c|x),where we use ase cc the tree-distance between c and c in the taxonomy. Hierarchical regularization[28,15]We can encode the hierarchical dependency among topics into the regu-larization so that the model of a topic c is shrunk to-wards that of its parent node parent(c),e.g.by addinga penalty term,12ηkc=1||w c−w parent(c)||22,to Eq(1).These approaches,while tackling relational regularization in three different aspects(i.e.,data sharing,relational objec-tive,parameter sharing),usually achieve equivalent effects. Note that“cost-sensitive optimization”is more versatile as it can also handle label dependencies that are discovered from data(e.g.,topic correlations)rather than prior knowledge.Experiment Results.In Table5,we compared LR with-out relational regularization(denoted“Flat”)vs the three approaches we described above.Relational regularization significantly improve model quality,e.g.,over2%boosts of average AUC.The differences among these three methods are nevertheless very small.For simplicity of implementa-tion,hereafter,we primarily use label expansion.4.6Model calibrationLR models return soft probabilistic scores p(y|x)in the range of[0,1].In order to apply them to label the top-ics of tweets/webpages,we need to calibrate the models to-wards specific quality criteria in need,i.e.,a precision target in our case.There are two key questions here:(1)“over what distribution should precision be measured?”and(2)“how tofind the optimal decision threshold?”.For precision measurement,stratified sampling in the range of posterior scores produced by a classifier has been advocated as a way to reduce the variance of precision estimates[4].While this method is very effective once the decision threshold are set, it is not straightforward to use when one seeks to determine the optimum thresholds.Given a topic c,a small seed set of positive examples P c, an unlimited stream of unlabeled data U c and a labeling oracle O,we use the following method to tune thresholdθc: Initial threshold Use P c to estimate a rough lower bound θl c forθc by using a much lower precision target. Stratified sampling Apply the model to the output distri-bution and divide the interval[θl c,1]to equal-percentilebins3.Apply stratified sampling to U c according tothese bins and send the samples to O for confirmationlabeling(Section4.7),with results denoted L c. Threshold estimation Letθj represent the posterior score of bin j,estimate the precisionπj for each bin j basedon L c.The optimal threshold is the minimum break-point at which calibrated precision exceeds the preci-sion target˜π,i.e.,θc=arg minθ,s.t.:θj θs jπjθj θs j˜πwhere s j represents the size or prevalence of examples in bin j,which can be estimated using U c.Note that,other than the above one-side calibration ap-proach,a two-side approach is also possible,i.e.,by ini-tially narrowing down the range with both upper and lower bounds and applying golden section search tofind optimalθc by progressively dividing bins(rather thanfixed-size bins) and sub-stratified sampling.Although this two-side method can produce threshold estimations with arbitrary resolu-tions,we found in our experiments that one-side approach performs well enough and is much simpler to implement. 4.7Quality evaluationAs we strive to achieve90%or higher precision,high-quality gold standard data and accurate evaluation is critical to assess whether we achieve that goal.We use crowdsource human labeled data for quality evaluation.As we previously discussed in Section4.2,human annotation is less accurate in a multi-label setting due to the cognitive overload caused 3The boundaries of the bins can be defined based on the percentiles of the posterior scores of P c or U c.by any nontrivial taxonomy.We instead ask for confirma-tion labeling.That is,rather than presenting the whole tax-onomy to crowdsource workers and asking them to select whatever labels are relevant to a given tweet,we present a (tweet,topic)pair and ask them to provide binary answers to“whether or not the supplied topic is correct for the given tweet”.As our primary goal is to estimate precision,binary judges on a sample output of our system are sufficient for our purposes.In principle,this protocol would require a lot more questions(i.e.,labeling tasks)and in turn incur more costs if we were to estimate recall,e.g.,for a given tweet and a300-topic taxonomy,we need300labeling tasks in order to know the ground-truth topic of the tweet,compared to one task in the multi-label setting.Nevertheless,for precision assessment,this protocol has better quality assurance and it is more economical–because binary tasks are much eas-ier,we found crowdsource workers are more willing to take the tasks at a much lower price,and more importantly,they are less likely to provide mistaking judgments.To effectively control cost,we assign each(tweet,topic)-pair sequentially to multiple workers;once the standard er-ror of the responses is below a certain threshold,the label is consideredfinal and the task will be frozen without being assigned further.The quality of the response varies worker by worker.To ensure quality,when assigning tasks,we also incorporate a small set of randomly-sampled probe tasks for which we already know the true answers with high confi-dence.Those workers whose response accuracy consistently falls below our requirement will not be admitted for future tasks,and their response in the current batch are ignored. Recall estimation faces challenges due to the nature of our labeling protocol,the needs for unbiased corpus and the fact that the distribution of our data evolves over time,which is the subject of another paper[7].Instead,we use primarily precision and coverage for quality evaluation.4.8Diagnosis and corrective learningOnce the system is deployed in production to label the topics of tweets,it is important to be able to identify cases where the model fails,and provide diagnostic insights as well as corrective actions tofix it on theflight.To this end, we instrumented a mechanism that associate topic labels as tags to tweets and expose them to the users.When users scroll on a topic tag,two buttons will show up to allow users to provide“right or wrong”feedback about the topic tag.Clicking on a topic tag will also take you to a channel consisting of tweets all about that topic,which is useful for diagnosis of a topic model as tweets areflowing through. The UI of this feature is shown by the top chart of Figure5. This feedback collection module makes it easy to exploit the wisdom of crowd in an ad hoc way to receive instantaneous feedback on the performance of our topic models and pro-vide opportunities to identify patterns for correction.Once failing cases or trouble areas are identified,another tool is used to visualize which parts of the tweet contributed to the offending decision the strongest,as shown in the bottom charts in Figure5.This is useful to allow the modeler to manually zoom into the model and identify potential ove-fitting patterns.With all the diagnostic patterns,we then employ corrective learning[25,23]to correct the mistakes of an existing topic classifier either manually or automati-cally on theflight.Corrective learning is used because it is desirable to adjust models gradually using a relatively small。

相关文档
最新文档