Smoothing methods in maximum entropy language modeling

合集下载

最大熵模型与自然语言处理MaxEntModelNLP 94页PPT文档

与Y的具体内容无关，只与|Y|有关。 • 两个Y(就是：y1y2)的表达能力是多少? • y况1可。以两表个达并三列种，情一况共，有y：2可3*以3表=9达种三情种况情
（乘法原理）。因此：
H y1H y2H (Y)H (Y)H (Y Y)
注 YY ： Y
称硬币(cont.)
称硬币-Version.2
《数据结构》：Huffman编码问题。
1
2
3
4
5
1/3 1/3 1/9
1/9
1/9
称硬币-Version.2
《数据结构》：Huffman编码问题。
3?5 1/3
1
2
3
4
5
1/3 1/3 1/9
1/9
1/9
称硬币-Version.2
《数据结构》：Huffman编码问题。
p(x1)p(x2)1
已知：
4
p( yi ) 1
i 1
“学习”可能是动词，也可能是名词。可以被标为主语、谓语、宾语、定语……
“学习”被标为定语的可能性很小，只有0.05p(y4)0.05
当“学习”被标作动词的时候，它被标作谓语的概率为
引0.9入5这个新的知识： p(y2|x1)0.95
求：y4
…
NLP与随机过程
yi可能有多种取值，yi被标注为a的概率有多少? 随机过程：一个随机变量的序列。
x1x2…xn x1x2…xn y1 x1x2…xn y1 y2 x1x2…xn y1 y2 y3 …
p(y1=a|x1x2…xn) p(y2=a|x1x2…xn y1) p(y3=a|x1x2…xn y1 y2) p(y4=a|x1x2…xn y1 y2 y3)

SCI论文摘要中常用的表达方法

SCI论文摘要中常用的表达方法要写好摘要，需要建立一个适合自己需要的句型库（选择的词汇来源于SCI高被引用论文）引言部分（1）回顾研究背景，常用词汇有review, summarize, present, outline, describe等（2）说明写作目的，常用词汇有purpose, attempt, aim等，另外还可以用动词不定式充当目的壮语老表达（3）介绍论文的重点内容或研究范围，常用词汇有study, present, include, focus, emphasize, emphasis, attention等方法部分（1）介绍研究或试验过程，常用词汇有test study, investigate, examine,experiment, discuss, consider, analyze, analysis等（2）说明研究或试验方法，常用词汇有measure, estimate, calculate等（3）介绍应用、用途，常用词汇有use, apply, application等结果部分（1）展示研究结果，常用词汇有show, result, present等（2）介绍结论，常用词汇有summary, introduce,conclude等讨论部分（1）陈述论文的论点和作者的观点，常用词汇有suggest, repot, present, expect, describe 等（2）说明论证，常用词汇有support, provide, indicate, identify, find, demonstrate, confirm, clarify等（3）推荐和建议，常用词汇有suggest,suggestion, recommend, recommendation, propose,necessity,necessary,expect等。

摘要引言部分案例词汇review•Author(s): ROBINSON, TE; BERRIDGE, KC•Title:THE NEURAL BASIS OF DRUG CRA VING - AN INCENTIVE-SENSITIZATION THEORY OF ADDICTION•Source: BRAIN RESEARCH REVIEWS, 18 (3): 247-291 SEP-DEC 1993 《脑研究评论》荷兰SCI被引用1774We review evidence for this view of addiction and discuss its implications for understanding the psychology and neurobiology of addiction.回顾研究背景SCI高被引摘要引言部分案例词汇summarizeAuthor(s): Barnett, RM; Carone, CD; 被引用1571Title: Particles and field .1. Review of particle physicsSource: PHYSICAL REVIEW D, 54 (1): 1-+ Part 1 JUL 1 1996:《物理学评论，D辑》美国引言部分回顾研究背景常用词汇summarizeAbstract: This biennial review summarizes much of Particle Physics. Using data from previous editions, plus 1900 new measurements from 700 papers, we list, evaluate, and average measuredproperties of gauge bosons, leptons, quarks, mesons, and baryons. We also summarize searches for hypothetical particles such as Higgs bosons, heavy neutrinos, and supersymmetric particles. All the particle properties and search limits are listed in Summary Tables. We also give numerous tables, figures, formulae, and reviews of topics such as the Standard Model, particle detectors, probability, and statistics. A booklet is available containing the Summary Tables and abbreviated versions of some of the other sections of this full Review.SCI摘要引言部分案例attentionSCI摘要方法部分案例considerSCI高被引摘要引言部分案例词汇outline•Author(s): TIERNEY, L SCI引用728次•Title:MARKOV-CHAINS FOR EXPLORING POSTERIOR DISTRIBUTIONS 引言部分回顾研究背景，常用词汇outline•Source: ANNALS OF STATISTICS, 22 (4): 1701-1728 DEC 1994•《统计学纪事》美国•Abstract: Several Markov chain methods are available for sampling from a posterior distribution. Two important examples are the Gibbs sampler and the Metropolis algorithm.In addition, several strategies are available for constructing hybrid algorithms. This paper outlines some of the basic methods and strategies and discusses some related theoretical and practical issues. On the theoretical side, results from the theory of general state space Markov chains can be used to obtain convergence rates, laws of large numbers and central limit theorems for estimates obtained from Markov chain methods. These theoretical results can be used to guide the construction of more efficient algorithms. For the practical use of Markov chain methods, standard simulation methodology provides several Variance reduction techniques and also gives guidance on the choice of sample size and allocation.SCI高被引摘要引言部分案例回顾研究背景presentAuthor(s): L YNCH, M; MILLIGAN, BG SC I被引用661Title: ANAL YSIS OF POPULATION GENETIC-STRUCTURE WITH RAPD MARKERS Source: MOLECULAR ECOLOGY, 3 (2): 91-99 APR 1994《分子生态学》英国Abstract: Recent advances in the application of the polymerase chain reaction make it possible to score individuals at a large number of loci. The RAPD (random amplified polymorphic DNA) method is one such technique that has attracted widespread interest.The analysis of population structure with RAPD data is hampered by the lack of complete genotypic information resulting from dominance, since this enhances the sampling variance associated with single loci as well as induces bias in parameter estimation. We present estimators for several population-genetic parameters (gene and genotype frequencies, within- and between-population heterozygosities, degree of inbreeding and population subdivision, and degree of individual relatedness) along with expressions for their sampling variances. Although completely unbiased estimators do not appear to be possible with RAPDs, several steps are suggested that will insure that the bias in parameter estimates is negligible. To achieve the same degree of statistical power, on the order of 2 to 10 times more individuals need to be sampled per locus when dominant markers are relied upon, as compared to codominant (RFLP, isozyme) markers. Moreover, to avoid bias in parameter estimation, the marker alleles for most of these loci should be in relatively low frequency. Due to the need for pruning loci with low-frequency null alleles, more loci also need to be sampled with RAPDs than with more conventional markers, and sole problems of bias cannot be completely eliminated.SCI高被引摘要引言部分案例词汇describe•Author(s): CLONINGER, CR; SVRAKIC, DM; PRZYBECK, TR•Title: A PSYCHOBIOLOGICAL MODEL OF TEMPERAMENT AND CHARACTER•Source: ARCHIVES OF GENERAL PSYCHIATRY, 50 (12): 975-990 DEC 1993《普通精神病学纪要》美国•引言部分回顾研究背景，常用词汇describe 被引用926•Abstract: In this study, we describe a psychobiological model of the structure and development of personality that accounts for dimensions of both temperament and character. Previous research has confirmed four dimensions of temperament: novelty seeking, harm avoidance, reward dependence, and persistence, which are independently heritable, manifest early in life, and involve preconceptual biases in perceptual memory and habit formation. For the first time, we describe three dimensions of character that mature in adulthood and influence personal and social effectiveness by insight learning about self-concepts.Self-concepts vary according to the extent to which a person identifies the self as (1) an autonomous individual, (2) an integral part of humanity, and (3) an integral part of the universe as a whole. Each aspect of self-concept corresponds to one of three character dimensions called self-directedness, cooperativeness, and self-transcendence, respectively. We also describe the conceptual background and development of a self-report measure of these dimensions, the Temperament and Character Inventory. Data on 300 individuals from the general population support the reliability and structure of these seven personality dimensions. We discuss the implications for studies of information processing, inheritance, development, diagnosis, and treatment.摘要引言部分案例•（2）说明写作目的，常用词汇有purpose, attempt, aimSCI高被引摘要引言部分案例attempt说明写作目的•Author(s): Donoho, DL; Johnstone, IM•Title: Adapting to unknown smoothness via wavelet shrinkage•Source: JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 90 (432): 1200-1224 DEC 1995 《美国统计学会志》被引用429次•Abstract: We attempt to recover a function of unknown smoothness from noisy sampled data. We introduce a procedure, SureShrink, that suppresses noise by thresholding the empirical wavelet coefficients. The thresholding is adaptive: A threshold level is assigned to each dyadic resolution level by the principle of minimizing the Stein unbiased estimate of risk (Sure) for threshold estimates. The computational effort of the overall procedure is order N.log(N) as a function of the sample size N. SureShrink is smoothness adaptive: If the unknown function contains jumps, then the reconstruction (essentially) does also; if the unknown function has a smooth piece, then the reconstruction is (essentially) as smooth as the mother wavelet will allow. The procedure is in a sense optimally smoothness adaptive: It is near minimax simultaneously over a whole interval of the Besov scale; the size of this interval depends on the choice of mother wavelet. We know from a previous paper by the authors that traditional smoothing methods-kernels, splines, and orthogonal series estimates-even with optimal choices of the smoothing parameter, would be unable to perform in a near-minimax way over many spaces in the Besov scale.Examples of SureShrink are given. The advantages of the method are particularly evident when the underlying function has jump discontinuities on a smooth backgroundSCI高被引摘要引言部分案例To investigate说明写作目的•Author(s): OLTV AI, ZN; MILLIMAN, CL; KORSMEYER, SJ•Title: BCL-2 HETERODIMERIZES IN-VIVO WITH A CONSERVED HOMOLOG, BAX, THAT ACCELERATES PROGRAMMED CELL-DEATH•Source: CELL, 74 (4): 609-619 AUG 27 1993 被引用3233•Abstract: Bcl-2 protein is able to repress a number of apoptotic death programs. To investigate the mechanism of Bcl-2's effect, we examined whether Bcl-2 interacted with other proteins. We identified an associated 21 kd protein partner, Bax, that has extensive amino acid homology with Bcl-2, focused within highly conserved domains I and II. Bax is encoded by six exons and demonstrates a complex pattern of alternative RNA splicing that predicts a 21 kd membrane (alpha) and two forms of cytosolic protein (beta and gamma). Bax homodimerizes and forms heterodimers with Bcl-2 in vivo. Overexpressed Bax accelerates apoptotic death induced by cytokine deprivation in an IL-3-dependent cell line. Overexpressed Bax also counters the death repressor activity of Bcl-2. These data suggest a model in which the ratio of Bcl-2 to Bax determines survival or death following an apoptotic stimulus.SCI高被引摘要引言部分案例purposes说明写作目的•Author(s): ROGERS, FJ; IGLESIAS, CA•Title: RADIATIVE ATOMIC ROSSELAND MEAN OPACITY TABLES•Source: ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 79 (2): 507-568 APR 1992 《天体物理学杂志增刊》美国SCI被引用512•Abstract: For more than two decades the astrophysics community has depended on opacity tables produced at Los Alamos. In the present work we offer new radiative Rosseland mean opacity tables calculated with the OPAL code developed independently at LLNL. We give extensive results for the recent Anders-Grevesse mixture which allow accurate interpolation in temperature, density, hydrogen mass fraction, as well as metal mass fraction. The tables are organized differently from previous work. Instead of rows and columns of constant temperature and density, we use temperature and follow tracks of constant R, where R = density/(temperature)3. The range of R and temperature are such as to cover typical stellar conditions from the interior through the envelope and the hotter atmospheres. Cool atmospheres are not considered since photoabsorption by molecules is neglected. Only radiative processes are taken into account so that electron conduction is not included. For comparison purposes we present some opacity tables for the Ross-Aller and Cox-Tabor metal abundances. Although in many regions the OPAL opacities are similar to previous work, large differences are reported.For example, factors of 2-3 opacity enhancements are found in stellar envelop conditions.SCI高被引摘要引言部分案例aim说明写作目的•Author(s):EDV ARDSSON, B; ANDERSEN, J; GUSTAFSSON, B; LAMBERT, DL;NISSEN, PE; TOMKIN, J•Title:THE CHEMICAL EVOLUTION OF THE GALACTIC DISK .1. ANALYSISAND RESULTS•Source: ASTRONOMY AND ASTROPHYSICS, 275 (1): 101-152 AUG 1993 《天文学与天体物理学》被引用934•Abstract:With the aim to provide observational constraints on the evolution of the galactic disk, we have derived abundances of 0, Na, Mg, Al, Si, Ca, Ti, Fe, Ni, Y, Zr, Ba and Nd, as well as individual photometric ages, for 189 nearby field F and G disk dwarfs.The galactic orbital properties of all stars have been derived from accurate kinematic data, enabling estimates to be made of the distances from the galactic center of the stars‘ birthplaces. 结构式摘要•Our extensive high resolution, high S/N, spectroscopic observations of carefully selected northern and southern stars provide accurate equivalent widths of up to 86 unblended absorption lines per star between 5000 and 9000 angstrom. The abundance analysis was made with greatly improved theoretical LTE model atmospheres. Through the inclusion of a great number of iron-peak element absorption lines the model fluxes reproduce the observed UV and visual fluxes with good accuracy. A new theoretical calibration of T(eff) as a function of Stromgren b - y for solar-type dwarfs has been established. The new models and T(eff) scale are shown to yield good agreement between photometric and spectroscopic measurements of effective temperatures and surface gravities, but the photometrically derived very high overall metallicities for the most metal rich stars are not supported by the spectroscopic analysis of weak spectral lines.•Author(s): PAYNE, MC; TETER, MP; ALLAN, DC; ARIAS, TA; JOANNOPOULOS, JD•Title:ITERA TIVE MINIMIZATION TECHNIQUES FOR ABINITIO TOTAL-ENERGY CALCULATIONS - MOLECULAR-DYNAMICS AND CONJUGA TE GRADIENTS•Source: REVIEWS OF MODERN PHYSICS, 64 (4): 1045-1097 OCT 1992 《现代物理学评论》美国American Physical Society SCI被引用2654 •Abstract: This article describes recent technical developments that have made the total-energy pseudopotential the most powerful ab initio quantum-mechanical modeling method presently available. In addition to presenting technical details of the pseudopotential method, the article aims to heighten awareness of the capabilities of the method in order to stimulate its application to as wide a range of problems in as many scientific disciplines as possible.SCI高被引摘要引言部分案例includes介绍论文的重点内容或研究范围•Author(s):MARCHESINI, G; WEBBER, BR; ABBIENDI, G; KNOWLES, IG;SEYMOUR, MH; STANCO, L•Title: HERWIG 5.1 - A MONTE-CARLO EVENT GENERA TOR FOR SIMULATING HADRON EMISSION REACTIONS WITH INTERFERING GLUONS SCI被引用955次•Source: COMPUTER PHYSICS COMMUNICATIONS, 67 (3): 465-508 JAN 1992:《计算机物理学通讯》荷兰Elsevier•Abstract: HERWIG is a general-purpose particle-physics event generator, which includes the simulation of hard lepton-lepton, lepton-hadron and hadron-hadron scattering and soft hadron-hadron collisions in one package. It uses the parton-shower approach for initial-state and final-state QCD radiation, including colour coherence effects and azimuthal correlations both within and between jets. This article includes a brief review of the physics underlying HERWIG, followed by a description of the program itself. This includes details of the input and control parameters used by the program, and the output data provided by it. Sample output from a typical simulation is given and annotated.SCI高被引摘要引言部分案例presents介绍论文的重点内容或研究范围•Author(s): IDSO, KE; IDSO, SB•Title: PLANT-RESPONSES TO ATMOSPHERIC CO2 ENRICHMENT IN THE FACE OF ENVIRONMENTAL CONSTRAINTS - A REVIEW OF THE PAST 10 YEARS RESEARCH•Source: AGRICULTURAL AND FOREST METEOROLOGY, 69 (3-4): 153-203 JUL 1994 《农业和林业气象学》荷兰Elsevier 被引用225•Abstract:This paper presents a detailed analysis of several hundred plant carbon exchange rate (CER) and dry weight (DW) responses to atmospheric CO2 enrichment determined over the past 10 years. It demonstrates that the percentage increase in plant growth produced by raising the air's CO2 content is generally not reduced by less than optimal levels of light, water or soil nutrients, nor by high temperatures, salinity or gaseous air pollution. More often than not, in fact, the data show the relative growth-enhancing effects of atmospheric CO2 enrichment to be greatest when resource limitations and environmental stresses are most severe.SCI高被引摘要引言部分案例介绍论文的重点内容或研究范围emphasizing •Author(s): BESAG, J; GREEN, P; HIGDON, D; MENGERSEN, K•Title: BAYESIAN COMPUTATION AND STOCHASTIC-SYSTEMS•Source: STATISTICAL SCIENCE, 10 (1): 3-41 FEB 1995《统计科学》美国•SCI被引用296次•Abstract: Markov chain Monte Carlo (MCMC) methods have been used extensively in statistical physics over the last 40 years, in spatial statistics for the past 20 and in Bayesian image analysis over the last decade. In the last five years, MCMC has been introduced into significance testing, general Bayesian inference and maximum likelihood estimation. This paper presents basic methodology of MCMC, emphasizing the Bayesian paradigm, conditional probability and the intimate relationship with Markov random fields in spatial statistics.Hastings algorithms are discussed, including Gibbs, Metropolis and some other variations. Pairwise difference priors are described and are used subsequently in three Bayesian applications, in each of which there is a pronounced spatial or temporal aspect to the modeling. The examples involve logistic regression in the presence of unobserved covariates and ordinal factors; the analysis of agricultural field experiments, with adjustment for fertility gradients; and processing oflow-resolution medical images obtained by a gamma camera. Additional methodological issues arise in each of these applications and in the Appendices. The paper lays particular emphasis on the calculation of posterior probabilities and concurs with others in its view that MCMC facilitates a fundamental breakthrough in applied Bayesian modeling.SCI高被引摘要引言部分案例介绍论文的重点内容或研究范围focuses •Author(s): HUNT, KJ; SBARBARO, D; ZBIKOWSKI, R; GAWTHROP, PJ•Title: NEURAL NETWORKS FOR CONTROL-SYSTEMS - A SURVEY•Source: AUTOMA TICA, 28 (6): 1083-1112 NOV 1992《自动学》荷兰Elsevier•SCI被引用427次•Abstract:This paper focuses on the promise of artificial neural networks in the realm of modelling, identification and control of nonlinear systems. The basic ideas and techniques of artificial neural networks are presented in language and notation familiar to control engineers. Applications of a variety of neural network architectures in control are surveyed. We explore the links between the fields of control science and neural networks in a unified presentation and identify key areas for future research.SCI高被引摘要引言部分案例介绍论文的重点内容或研究范围focus•Author(s): Stuiver, M; Reimer, PJ; Bard, E; Beck, JW;•Title: INTCAL98 radiocarbon age calibration, 24,000-0 cal BP•Source: RADIOCARBON, 40 (3): 1041-1083 1998《放射性碳》美国SCI被引用2131次•Abstract: The focus of this paper is the conversion of radiocarbon ages to calibrated (cal) ages for the interval 24,000-0 cal BP (Before Present, 0 cal BP = AD 1950), based upon a sample set of dendrochronologically dated tree rings, uranium-thorium dated corals, and varve-counted marine sediment. The C-14 age-cal age information, produced by many laboratories, is converted to Delta(14)C profiles and calibration curves, for the atmosphere as well as the oceans. We discuss offsets in measured C-14 ages and the errors therein, regional C-14 age differences, tree-coral C-14 age comparisons and the time dependence of marine reservoir ages, and evaluate decadal vs. single-year C-14 results. Changes in oceanic deepwater circulation, especially for the 16,000-11,000 cal sp interval, are reflected in the Delta(14)C values of INTCAL98.SCI高被引摘要引言部分案例介绍论文的重点内容或研究范围emphasis •Author(s): LEBRETON, JD; BURNHAM, KP; CLOBERT, J; ANDERSON, DR•Title: MODELING SURVIV AL AND TESTING BIOLOGICAL HYPOTHESES USING MARKED ANIMALS - A UNIFIED APPROACH WITH CASE-STUDIES •Source: ECOLOGICAL MONOGRAPHS, 62 (1): 67-118 MAR 1992•《生态学论丛》美国•Abstract: The understanding of the dynamics of animal populations and of related ecological and evolutionary issues frequently depends on a direct analysis of life history parameters. For instance, examination of trade-offs between reproduction and survival usually rely on individually marked animals, for which the exact time of death is most often unknown, because marked individuals cannot be followed closely through time.Thus, the quantitative analysis of survival studies and experiments must be based oncapture-recapture (or resighting) models which consider, besides the parameters of primary interest, recapture or resighting rates that are nuisance parameters. 结构式摘要•T his paper synthesizes, using a common framework, these recent developments together with new ones, with an emphasis on flexibility in modeling, model selection, and the analysis of multiple data sets. The effects on survival and capture rates of time, age, and categorical variables characterizing the individuals (e.g., sex) can be considered, as well as interactions between such effects. This "analysis of variance" philosophy emphasizes the structure of the survival and capture process rather than the technical characteristics of any particular model. The flexible array of models encompassed in this synthesis uses a common notation. As a result of the great level of flexibility and relevance achieved, the focus is changed from fitting a particular model to model building and model selection.SCI摘要方法部分案例•方法部分•（1）介绍研究或试验过程，常用词汇有test，study, investigate, examine,experiment, discuss, consider, analyze, analysis等•（2）说明研究或试验方法，常用词汇有measure, estimate, calculate等•（3）介绍应用、用途，常用词汇有use, apply, application等SCI高被引摘要方法部分案例discusses介绍研究或试验过程•Author(s): LIANG, KY; ZEGER, SL; QAQISH, B•Title: MULTIV ARIATE REGRESSION-ANAL YSES FOR CATEGORICAL-DATA •Source:JOURNAL OF THE ROY AL STA TISTICAL SOCIETY SERIES B-METHODOLOGICAL, 54 (1): 3-40 1992《皇家统计学会志，B辑：统计方法论》•SCI被引用298•Abstract: It is common to observe a vector of discrete and/or continuous responses in scientific problems where the objective is to characterize the dependence of each response on explanatory variables and to account for the association between the outcomes. The response vector can comprise repeated observations on one variable, as in longitudinal studies or genetic studies of families, or can include observations for different variables.This paper discusses a class of models for the marginal expectations of each response and for pairwise associations. The marginal models are contrasted with log-linear models.Two generalized estimating equation approaches are compared for parameter estimation.The first focuses on the regression parameters; the second simultaneously estimates the regression and association parameters. The robustness and efficiency of each is discussed.The methods are illustrated with analyses of two data sets from public health research SCI高被引摘要方法部分案例介绍研究或试验过程examines•Author(s): Huo, QS; Margolese, DI; Stucky, GD•Title: Surfactant control of phases in the synthesis of mesoporous silica-based materials •Source: CHEMISTRY OF MATERIALS, 8 (5): 1147-1160 MAY 1996•SCI被引用643次《材料的化学性质》美国•Abstract: The low-temperature formation of liquid-crystal-like arrays made up of molecular complexes formed between molecular inorganic species and amphiphilic organic molecules is a convenient approach for the synthesis of mesostructure materials.This paper examines how the molecular shapes of covalent organosilanes, quaternary ammonium surfactants, and mixed surfactants in various reaction conditions can be used to synthesize silica-based mesophase configurations, MCM-41 (2d hexagonal, p6m), MCM-48 (cubic Ia3d), MCM-50 (lamellar), SBA-1 (cubic Pm3n), SBA-2 (3d hexagonal P6(3)/mmc), and SBA-3(hexagonal p6m from acidic synthesis media). The structural function of surfactants in mesophase formation can to a first approximation be related to that of classical surfactants in water or other solvents with parallel roles for organic additives. The effective surfactant ion pair packing parameter, g = V/alpha(0)l, remains a useful molecular structure-directing index to characterize the geometry of the mesophase products, and phase transitions may be viewed as a variation of g in the liquid-crystal-Like solid phase. Solvent and cosolvent structure direction can be effectively used by varying polarity, hydrophobic/hydrophilic properties and functionalizing the surfactant molecule, for example with hydroxy group or variable charge. Surfactants and synthesis conditions can be chosen and controlled to obtain predicted silica-based mesophase products. A room-temperature synthesis of the bicontinuous cubic phase, MCM-48, is presented. A low-temperature (100 degrees C) and low-pH (7-10) treatment approach that can be used to give MCM-41 with high-quality, large pores (up to 60 Angstrom), and pore volumes as large as 1.6 cm(3)/g is described.Estimates 介绍研究或试验过程SCI高被引摘要方法部分案例•Author(s): KESSLER, RC; MCGONAGLE, KA; ZHAO, SY; NELSON, CB; HUGHES, M; ESHLEMAN, S; WITTCHEN, HU; KENDLER, KS•Title:LIFETIME AND 12-MONTH PREV ALENCE OF DSM-III-R PSYCHIATRIC-DISORDERS IN THE UNITED-STA TES - RESULTS FROM THE NATIONAL-COMORBIDITY-SURVEY•Source: ARCHIVES OF GENERAL PSYCHIATRY, 51 (1): 8-19 JAN 1994•《普通精神病学纪要》美国SCI被引用4350次•Abstract: Background: This study presents estimates of lifetime and 12-month prevalence of 14 DSM-III-R psychiatric disorders from the National Comorbidity Survey, the first survey to administer a structured psychiatric interview to a national probability sample in the United States.Methods: The DSM-III-R psychiatric disorders among persons aged 15 to 54 years in the noninstitutionalized civilian population of the United States were assessed with data collected by lay interviewers using a revised version of the Composite International Diagnostic Interview. Results: Nearly 50% of respondents reported at least one lifetime disorder, and close to 30% reported at least one 12-month disorder. The most common disorders were major depressive episode, alcohol dependence, social phobia, and simple phobia. More than half of all lifetime disorders occurred in the 14% of the population who had a history of three or more comorbid disorders. These highly comorbid people also included the vast majority of people with severe disorders.Less than 40% of those with a lifetime disorder had ever received professional treatment,and less than 20% of those with a recent disorder had been in treatment during the past 12 months. Consistent with previous risk factor research, it was found that women had elevated rates of affective disorders and anxiety disorders, that men had elevated rates of substance use disorders and antisocial personality disorder, and that most disorders declined with age and with higher socioeconomic status. Conclusions: The prevalence of psychiatric disorders is greater than previously thought to be the case. Furthermore, this morbidity is more highly concentrated than previously recognized in roughly one sixth of the population who have a history of three or more comorbid disorders. This suggests that the causes and consequences of high comorbidity should be the focus of research attention. The majority of people with psychiatric disorders fail to obtain professional treatment. Even among people with a lifetime history of three or more comorbid disorders, the proportion who ever obtain specialty sector mental health treatment is less than 50%.These results argue for the importance of more outreach and more research on barriers to professional help-seekingSCI高被引摘要方法部分案例说明研究或试验方法measure•Author(s): Schlegel, DJ; Finkbeiner, DP; Davis, M•Title:Maps of dust infrared emission for use in estimation of reddening and cosmic microwave background radiation foregrounds•Source: ASTROPHYSICAL JOURNAL, 500 (2): 525-553 Part 1 JUN 20 1998 SCI 被引用2972 次《天体物理学杂志》美国•The primary use of these maps is likely to be as a new estimator of Galactic extinction. To calibrate our maps, we assume a standard reddening law and use the colors of elliptical galaxies to measure the reddening per unit flux density of 100 mu m emission. We find consistent calibration using the B-R color distribution of a sample of the 106 brightest cluster ellipticals, as well as a sample of 384 ellipticals with B-V and Mg line strength measurements. For the latter sample, we use the correlation of intrinsic B-V versus Mg, index to tighten the power of the test greatly. We demonstrate that the new maps are twice as accurate as the older Burstein-Heiles reddening estimates in regions of low and moderate reddening. The maps are expected to be significantly more accurate in regions of high reddening. These dust maps will also be useful for estimating millimeter emission that contaminates cosmic microwave background radiation experiments and for estimating soft X-ray absorption. We describe how to access our maps readily for general use.SCI高被引摘要结果部分案例application介绍应用、用途•Author(s): MALLAT, S; ZHONG, S•Title: CHARACTERIZATION OF SIGNALS FROM MULTISCALE EDGES•Source: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 14 (7): 710-732 JUL 1992•SCI被引用508次《IEEE模式分析与机器智能汇刊》美国•Abstract: A multiscale Canny edge detection is equivalent to finding the local maxima ofa wavelet transform. We study the properties of multiscale edges through the wavelet。

SVM的SMO算法实现

SVM的SMO算法实现SVM（Support Vector Machine）是一种常用的分类算法，其原理是将数据集映射到一个高维空间中，使得不同类别的样本能够被一个超平面正确分割。

SMO（Sequential Minimal Optimization）算法是一种用于求解SVM问题的优化算法，其核心思想是将大问题分解为一系列的小问题，通过迭代求解这些小问题来得到最优解。

SMO算法允许一次只优化两个变量，即选择两个变量α_i和α_j进行优化。

具体的优化步骤如下：1. 选择一对需要优化的变量α_i和α_j，使用启发式方法选取这两个变量。

一般选择两个变量时，先遍历整个α向量，找到违反KKT条件最严重的点，KKT（Karush-Kuhn-Tucker）条件是SVM问题的最优性条件，通过判断α向量是否满足该条件来选择需要优化的变量。

2.固定其他变量，通过求解子问题的方式更新选择的两个变量。

通过求解两个变量的二次规划问题，得到更新后的α_i和α_j。

3.更新阈值b。

每次更新α_i和α_j之后，都需要计算新的阈值b。

根据KKT条件，选择满足条件的α_i或α_j来更新阈值b。

4.判断终止条件。

迭代过程中，根据一定的终止条件来决定是否终止算法，一般可以设置最大迭代次数或目标误差。

SMO算法的具体实现如下：1.初始化α向量、阈值b和错误率向量E。

2.选择需要优化的两个变量α_i和α_j。

3.计算变量α_i和α_j的边界。

4.根据变量α_i和α_j是否满足边界来选择优化方法。

5.在选择的两个变量上进行优化。

求解两个变量的二次规划子问题，得到更新后的α_i和α_j。

6.更新阈值b。

7.更新错误率向量E。

8.判断终止条件。

如果满足终止条件则停止迭代，否则返回第2步继续迭代。

完整的SMO算法实现如下：```pythondef smo(X, y, C, tol, max_iter):m, n = X.shapealpha = np.zeros(m)b=0iters = 0while iters < max_iter:alpha_changed = 0for i in range(m):E_i = np.sum(alpha * y * kernel(X, X[i, :])) + b - y[i]if (y[i] * E_i < -tol and alpha[i] < C) or (y[i] * E_i > tol and alpha[i] > 0):j = select_second_alpha(i, m)E_j = np.sum(alpha * y * kernel(X, X[j, :])) + b - y[j]alpha_i_old = alpha[i]alpha_j_old = alpha[j]if y[i] != y[j]:L = max(0, alpha[j] - alpha[i])H = min(C, C + alpha[j] - alpha[i])else:L = max(0, alpha[i] + alpha[j] - C)H = min(C, alpha[i] + alpha[j])if L == H:continueeta = 2 * kernel(X[i, :], X[j, :]) - kernel(X[i, :], X[i, :]) - kernel(X[j, :], X[j, :])if eta >= 0:continuealpha[j] = alpha[j] - y[j] * (E_i - E_j) / etaalpha[j] = clip_alpha(alpha[j], H, L)continuealpha[i] = alpha[i] + y[i] * y[j] * (alpha_j_old - alpha[j]) b1 = b - E_i - y[i] * (alpha[i] - alpha_i_old) *kernel(X[i, :], X[i, :]) - y[j] * (alpha[j] - alpha_j_old) * kernel(X[i, :], X[j, :])b2 = b - E_j - y[i] * (alpha[i] - alpha_i_old) *kernel(X[i, :], X[j, :]) - y[j] * (alpha[j] - alpha_j_old) * kernel(X[j, :], X[j, :])if 0 < alpha[i] < C:b=b1elif 0 < alpha[j] < C:b=b2else:b=(b1+b2)/2alpha_changed += 1if alpha_changed == 0:iters += 1else:iters = 0return alpha, b```以上是SMO算法的简单实现，其中使用了一些辅助函数（如选择第二个变量、计算核函数等），这些函数需要根据具体的问题进行实现。

求解线性方程组稀疏解的稀疏贪婪随机Kaczmarz算法

大小 k̂ 。②输出 xj。③初始化 S = {1，…，n}，x0 = 0，
j = 0。④当 j ≤ M 时，置 j = j + 1。⑤选择行向量
ai，i ∈
{
1，…，n
}，每一行对应的概率为
‖a‖i
2 2
‖A‖
2 F
。
⑥
( | ) 确定估计的支持集 S，S = supp xj-1 max { k̂，n-j+1 } 。
行从而达到加快算法收敛速度的目的。算法 3 给出
稀疏贪婪随机 Kaczmarz 算法。
算法 3 稀疏贪婪随机 Kaczmarz 算法。①输入
A∈ Rm×n，b ∈ Rm，最大迭代数 M 和估计的支持集的
大小 k̂ 。 ② 输出 xk。 ③ 初始化 S = {1，…，n}，x0 =
x
* 0
=
0。④
置
k
=
0
时，当
k
≤
M
-
1
时。⑤计算
( {| | } ϵk=
1 2
‖b
1 - Ax‖k 22
max
1≤ ik ≤ m
bik - aik xk 2
‖a
‖ ik
2 2
+
)1
‖A‖
2 F
（2）
⑥决定正整数指标集
{ | | } Uk =
ik|
bik - aik xk
2
≥
ϵ‖k b
-
Ax‖k
‖22 a
‖ ik
2 2
ï í
1
ï î
j
l∈S l ∈ Sc
其中，j 为迭代步数。当 j → ∞ 时，wj⊙ai → aiS，因此

手推SVM算法(含SMO证明)

手推SVM算法（含SMO证明）SVM（支持向量机）是一种二元分类模型，它通过在特征空间中找到一个最优的超平面来进行分类。

SVM算法的核心是构造最优的分类超平面，使得它能够有力地将两类样本分开，并且使得与超平面相距最近的样本点的距离最大化。

SMO（序列最小优化）算法是一种高效求解SVM问题的方法。

为了简化讲解，我们假设SVM的两类样本是线性可分的，即存在一个超平面可以将两类样本完全分开。

在此基础上，我们来推导最优化问题和SMO算法的推导。

1.SVM的最优化问题：我们可以将超平面w·x+b=0中的w归一化，并将超平面转化为w·x+b=0，其中，w，=1、其中，w表示超平面的法向量，b表示超平面的截距。

考虑到SVM的目标是使得距离超平面最近的点离超平面最远，我们可以引入几何间隔的概念。

对于一个样本点(xi, yi)，它距离超平面的几何间隔定义为γi=yi(w·xi+b)/，w。

SVM的最优化问题可以转化为如下的凸优化问题：min ，w，^2/2s.t. yi(w·xi+b)≥ 1, i=1,2,...,n这个优化问题的目标是最小化w的范数的平方，即使得超平面的间隔最大化。

约束条件确保了分类准确性。

2.SMO算法的推导：要解决SVM的最优化问题，可以使用Lagrange乘子法转化为对偶问题。

使用对偶问题可以通过求解其对偶变量来求解原始问题。

通过引入拉格朗日乘子αi≥0，对每个约束条件（yi(w·xi+b)≥1）引入拉格朗日乘子αi，可以得到拉格朗日函数：L(w, b, α) = 1/2，w，^2 - Σαi(yi(w·xi+b) - 1)其中，α=(α1,α2,...,αn)T是拉格朗日乘子向量。

然后，可以通过对L(w,b,α)分别对w和b求偏导并令其等于0，得到w和b的解：w = ΣαiyixiΣαiyi = 0将w代入拉格朗日函数中，可得到关于α的对偶问题：max Σα - 1/2ΣΣαiαjyiyj(xi·xj)s.t. Σαiyi = 0αi≥0,i=1,2,...,n这是一个凸优化问题，通过求解对偶问题得到的α可以进一步求解最优的超平面。

高斯中的优化

优化第一步：确定分子构型，可以根据对分子的了解通过GVIEW和CHEM3D等软件来构建，但更多是通过实验数据来构建（如根据晶体软件获得高斯直角坐标输入文件，软件可在大话西游上下载，用GVIEW可生成Z-矩阵高斯输入文件），需要注意的是分子的原子的序号是由输入原子的顺序或构建原子的顺序决定来实现的，所以为实现对称性输入，一定要保证第一个输入的原子是对称中心，这样可以提高运算速度。

我算的分子比较大，一直未曾尝试过，希望作过这方面工作的朋友能补全它。

以下是从本论坛，大话西游及宏剑公司上下载的帖子。

将键长相近的，如B12 1.08589B13 1.08581B14 1.08544键角相近的，如A6 119.66589A7 120.46585A8 119.36016二面角相近的如D10 -179.82816D11 -179.71092都改为一致，听说这样可以减少变量，提高计算效率，是吗？在第一步和在以后取某些键长键角相等，感觉是一样的。

只是在第一步就设为相等，除非有实验上的证据，不然就是纯粹的凭经验了。

在前面计算的基础上，如果你比较信赖前面的计算，那么设为相等，倒还有些依据。

但是，设为相等，总是冒些风险的。

对于没有对称性的体系，应该是没有绝对的相等的。

或许可以这么试试：先PM3，再B3LYP/6-31G.（其中的某些键长键角设为相等），再B3LYP/6-31G（放开人为设定的那些键长键角相等的约束）。

比如键长，键角，还有是否成键的问题，Gview看起来就是不精确，不过基本上没问题，要是限制它们也许就有很大的问题，能量上一般会有差异，有时还比较大如果要减少优化参数，不是仅仅将相似的参数改为一致，而是要根据对称性，采用相同的参数。

例如对苯分子分子指定部分如下：CC 1 B1C 2 B2 1 A1C 3 B3 2 A2 1 D1C 4 B4 3 A3 2 D2C 1 B5 2 A4 3 D3H 1 B6 2 A5 3 D4H 2 B7 1 A6 6 D5H 3 B8 2 A7 1 D6H 4 B9 3 A8 2 D7H 5 B10 4 A9 3 D8H 6 B11 1 A10 2 D9B1 1.395160B2 1.394712B3 1.395427B4 1.394825B5 1.394829B6 1.099610B7 1.099655B8 1.099680B9 1.099680B10 1.099761 B11 1.099604 A1 120.008632 A2 119.994165 A3 119.993992 A4 119.998457 A5 119.997223 A6 119.980770 A7 120.012795 A8 119.981142 A9 120.011343 A10 120.007997 D1 -0.056843 D2 0.034114 D3 0.032348 D4 -179.972926 D5 179.953248 D6 179.961852 D7 -179.996436 D8 -179.999514 D9 179.989175参数很多，但是通过对称性原则，并且采用亚原子可以将参数减少为：XX 1 B0C 1 B1 2 A1C 1 B1 2 A1 3 D1C 1 B1 2 A1 4 D1C 1 B1 2 A1 5 D1C 1 B1 2 A1 6 D1C 1 B1 2 A1 7 D1H 1 B2 2 A1 8 D1H 1 B2 2 A1 3 D1H 1 B2 2 A1 4 D1H 1 B2 2 A1 5 D1H 1 B2 2 A1 6 D1H 1 B2 2 A1 7 D1B0 1.0B1 1.2B2 2.2A1 90.0D1 60.0对于这两个工作，所用的时间为57s和36s，对称性为C01和D6H，明显后者要远远优于前者。

基于最大熵模型的邮件过滤系统研究

１引言随着因特网的迅速发展和普及，电子邮件以其方便、快捷、低成本等优点而逐渐成为人们日常生活中主要的通信手段之一。

但大量垃圾邮件的出现，给全球用户带来了巨大损失。

据《第三次中国反垃圾邮件市场调查报告》［１］显示，我国用户平均每人每周收到的垃圾邮件量占收到的总邮件数量的６５．７％。

垃圾邮件的泛滥已带来严重后果，因此有效地区分正常邮件和垃圾邮件成为一项紧迫的任务。

近年来，有关垃圾邮件过滤技术的研究逐渐兴起。

最初是从电子邮件的半结构化特性出发，寻找垃圾邮件的特征，从邮件头、邮件体等各方面展开邮件过滤工作。

常见的过滤方法有黑、白名单技术、过滤规则等，但由于邮件发送者在不断变化，规则难以维护、准确率不高等原因，这些方法都具有一定的局限性。

目前，把垃圾邮件过滤与机器学习、文本分类和信息过滤技术结合起来，对邮件正文内容进行分析，成为研究的热点。

基于内容的分析能够自动获得垃圾邮件的特征，是一种更为精确的邮件过滤技术［２］。

最大熵模型是一种广泛应用于信号处理领域的技术，近年来，最大熵模型被应用于自然语言处理中的多个方面，包括分词、词性标注、语义消歧、短语识别、机器翻译等。

ＡｄｗａｉｔＲａｔ－ｎａｐａｒｋｈｉ在他的博士论文中首次将最大熵模型应用于文本分类［３］，李荣陆等首次使用最大熵模型进行中文文本分类［４］。

鉴于电子邮件中正常邮件和垃圾邮件的概率特性，本文将最大熵模型引入到邮件过滤中，结合邮件的半结构化特性，改进传统模型中特征函数的定义，形成邮件特征向量，训练得到最大熵模型，利用此模型对测试集中的邮件进行过滤。

实验结果表明，基于改进定义特征函数的最大熵模型的邮件过滤方法表现出了良好的性能。

２最大熵模型最大熵原理的基本思想是：给定训练样本，选择一个模型，使其与训练样本完全一致，而对于未知事件，则尽可能使其保持均匀分布。

为使最大熵模型的概率估计与邮件过滤相适应，引入如下变量：Ａ＝｛ａ１，ａ２，…，ａｍ｝为邮件类别集，Ｂ＝｛ｂ１，ｂ２，…，ｂｎ｝为邮件特征集，ｎｕｍ（ａｉ，ｂｊ）为训练集中二元组（ａｉ，ｂｊ）出现的次数，对于任意给定的ａ∈Ａ，ｂ∈Ｂ，概率ｐ（ａ｜ｂ）表示包含特征ｂ的邮件属于类别ａ的概率。

英语翻译

┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊Mode-space spatial spectral estimation forcircular arraysR.Eiges,PhDH.D.Griffiths,PhD,CEng,MlEEFrom：IEE Proc.-Radar,Sonar Naoig.,Vol.141,No.6,December1994Abstract:The application of super resolution techniques to circular arrays in mode space is proposed.This formulation allows the use of algorithms that are otherwise specific to equispaced linear arrays,with the added benefit of full360°azimuth coverage. Coherent signals can be de-correlated by a mode-space version of the spatial smoothing technique,and,in the case of broadband signals,through an inherent property of direction-independent frequency-domain smoothing.Indexing term:Circular arrays,Super-resolution techniques,Spatial spectral estimation I.INTRODUCTIONOver the last three decades there has been tremendous interest in surpassing the Rayleigh resolution limit of spatial spectral estimation.A variety of superresolution algorithms have been developed and analysed,among them one-dimensional parameter-search methods such as Capon’s MVDR,Burg’s Max-Entropy,MUSIC and Min-Norm[l-51,search-free methods such as ESPRIT and TAM[6,71and multidimensional-search schemes such as IMP,Stochastic and Deterministic Max-Likelihood and WSF[8-111.A number of these estimators have either been restricted or better modelled,when applied to equispaced linear arrays.The inherent symmetry and full peripheral coverage that characterize sensor arrays with circular geometry have,rather surprisingly,attracted only limited interest[12-171.Superresolution estimators may be directly applied to the signals received by the array elements,but it often proves beneficial to preprocess the array outputs,transforming the superresolution scheme from‘element space’onto‘beam space’.In this paper we propose and discuss the merits of a particularly useful transformation from element space to circular-array mode space,sometimes referred to as spatial harmonics.Such preprocessing transformation has been suggested in the past┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊where a scheme akin to Prony’s method was applied for solving the multiple source problem[18].More recently,the Maximum Likelihood and the Root-MUSIC estimators as well as a linear prediction technique have been reformulated for the case of isotropic circular-array elements in mode space[14-161,with the Root-MUSIC algorithm also adapted to phase-mode-based beamspace[17].II.Circular-array mode spaceA phase mode is formed when a spatial discrete Fourier transform(DFT)is applied to the element outputs of a circular array.The excitation of each phase mode in a horizontally oriented equispaced M-element circular array has been shown to yield a far-field pattern),,(ωϕθuΦwhich,for sufficiently small inter-element spacing,and phase periodicity(mode number)lower than M/2,approaches omnidirectionality in amplitude at each constant-elevation observation cut,with linear phaseagainst-angle characteristics[19]2/),(),(),,()(MutermsdistortioneCeCjuqqqMujuqu<+==Φ−∞−∞=+−∑ωϕϕθωθωωϕθ(1)In eqn.1,)1(−=j,u is the mode number,w is the(angular)frequency, ),(ϕθare the usual spherical coordinate angles(withϕmeasured on the azimuth plane of the array)and{),(θωuqC}are phase mode coefficients whose frequency characteristics depend on the element patterns.For symmetrically identical element patterns of the forms∑∞−∞==ijiiehgωωθωϕθ),(),,(),(θωuC is given by)sin(),(),()(θωωθθωcRJhjCiuiiiuu+∞−∞=+∑=where R and c denote the array radius and the speed of propagation,respectively, and)(zJnis a Bessel function of the first kind of order n and argument z.A special case is that of azimuthally omnidirectional radiators,i.e.0,0≠=ihifor which)sin()(),(θωθθωcRJhjCuuu=Thus,whenever the frequency and the array radius are such that ()[]sin/θωcRJuhits one of its zeros,there will effectively be a'hole'in the far-fieldcircumferential coverage of that mode around elevationθθ=,[providedθis within the angular coverage of)(θh].In the vicinity of the zero,the far-field azimuth pattern is not completely cut out,but ripple from higher-order terms(especially those characterised by1±=q)will dominate,and thus limit the practical usefulness of that mode.If on the┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊other hand the element patterns are of the form()⎥⎦⎤⎢⎣⎡+==ϕθϕθϕθθθcos2121)(2cos)(,2gggthen⎥⎦⎤⎢⎣⎡⎟⎠⎞⎜⎝⎛−⎟⎠⎞⎜⎝⎛=θωθωθθωθsinsin)(2),('0cRjJcRJgjCuuuuwhere)('⋅vJ signifies the derivative of the Bessel function of the first kind withrespect to its argument.Since the zeros of)(⋅vJ and of)('⋅vJ do not coincide for any v,it follows that for this type of element pattern the phase mode coefficients never fall to zero. In fact the above phase mode coefficient is close to'ideal'in its broadband behaviour,as is revealed by examining the asymptotic expression for the relevant Bessel functions of large arguments()()[]θωπθθπωθθωsin)/(2/14/sin/2)(~,c RjjueRcegC−which is linear in its phase response,although requiring amplitude equalisation for broadband operation.In general,the broadband alignment of a set of phase modes for a given elevation angle involves the mode-wise deconvolution of their zero-order(q=0) coefficients.This has been demonstrated for a circular array of directional elements fed by an analogue Butler matrix[20]and is similarly implementable in the context of a digital DFT beamformer[21].III.Mode-space formulationDenoting the number of circular-array sensors and number of far-field signal sources, respectively,by M and K,the vector x of complex(analytic-al)signals received by the array sensors is expressible,under single-frequency narrowband conditions as)()()()(twt sAt x+=ω(2) where[]TKtststst s)(...)()()(110−=and[]TMtwtwtwtw)(...)()()(110−=(with[]T⋅denoting transposition)are the corresponding source signal and noise vectors,and)(ωA is the M x K steering matrix,each of whose columns denote the arrayresponse at frequencyω,to a plane wave incident from one of the K source bearings.The narrowband preprocessing transformation of a circular sensor array from M-dimensional element space onto M'-dimensional azimuth-plane mode space may be┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊represented by the matrix operation)()(t xEQt y HH=(3) where the M'x1vector y groups the complex representation of the linearlytransformed signals at the beam-former outputs,[]ΛΛ−=EEEE......is a phasing matrix whoseΛ+=21'M orthonormal columns(Λis the absolute mode number of the highest processed phase mode)are given by()[]T u M mjumjumjueeeME)1)(/2(2)/2()/2(2/1...1/1−=πππΛ≤≤Λ−u(4) and Q is an M'x M'diagonal matrix whose uu th element is given by[])2/,(/1*2/1πωuuuCMQ=Λ≤≤Λ−u(5) with(.)*and denoting the complex conjugation and complex conjugate transposition,respectively.A similar formulation also appears in Reference17for the case of isotropic array sensors.From eqns.2and3one has)()()()(twEQt sAEQt y HHHH+=ω(6)and the M'x M''mode-space'covariance matrix(assuming zero-mean signals uncorrelated with zero-mean noises)is given bywHsyRARAR~~~+=(7)Where is the expectation operatorAEQA HH=~(8)EQREQRwHHw=~(9)HsssRξ=and HwwwRξ=are the source signal and(element-space)noise covariance matrices,respectively,and for spatially-white homoscedastic noise ofelement-space variance(noise power)2wσQQR Hww2~σ=(10) Although the mode-space noise remains spatially white,the noise powers at the phase-mode outputs are not the same.However,the spatial whiteness assumption which leads to eqn.10,although reasonable as far as internally generated(thermal)noise is concerned,does not necessarily hold for spatial contributions from ambient noise fields. Denoting the spatial power density of the ambient noise field by),,(ϕθωN,and the cross-spectral density matrix of the ambient noise field contribution at the M'mode outputs by)(~ωwaP we have,for a circular array of closely spaced sensors in a noise field that is statistically independent with respect to direction┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊[]∫∫−−−×≈ππϕπϕθωϕθπωπωθωθωθω)(**0''''''''''),,(sin)2/,()2/,(),(),()(~uujuuuuuuweNdCCCCdP(11)where[]u uwP')(~ωenotes the'''uu th element ofwaP~.If the noise field is omnidirectional inφand concentrated around zero(2/πθ=)elevation,then[]⎪⎩⎪⎨⎧≠=×≈∫''''''sin),()2/,(/)2/,(2~'''''uuuuNCCdP uuuwauθθωπωπωθππ(12))(~ωwaP is then a diagonal matrix with equal elements,and,for the narrowband problem,so is the spatial covariance matrix contributed by the ambient noise field.A similar result is obtained for the(element-space)covariance matrix of a linear array in an isotropic or hemispherically-isotropic noise field,provided that the array sensors are isotropic and spaced half a wavelength apart[22].Taking the(internally generated) thermal noise power to be relatively small(which is a fair assumption in the case of a sonar system[22]),the noise at the phase-mode outputs of a(horizontal)circular-array is thus seen to be spatially white and homoscedastic,for a noise field that is omnidirectional in azimuth and impulsive at zero elevation.Note that in the context of an underwater sonar system,the elevation-wise confined noise model is not unreasonable,especially under a cylindrical array geometry in which the directivity in elevation is increased.This noise model allows the convenient eigen-decomposition of the mode-space covariance matrix in estimation algorithms such as MUSIC and the Min-Norm method.In contrast,the element-space covariance matrix for a circular array or a linear array with non-isotropic sensors in an isotropic noise-field,or that of a horizontal linear array under an azimuthally omnidirectional noise-field that varies in elevation are not white.From eqn.1it also follows that for closely spaced array elements each column of A~is approximately given by[]1...1~)1(21'−≤≤=Λ−−−KkzzzzAkTMkkkk(13)where for each direction of arrival(DOA)kφϕ=10−≤≤=Kkez k jkφ(14) The modified M'x K steering matrix A~may thus be written as┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊⎟⎟⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎜⎜⎝⎛×⎟⎟⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎜⎜⎝⎛=Λ−ΛΛ−−−−−−−−−−−−1)1()1()1(222111.......................111~KkMKMkMKkKkzzzzzzzzzzzzA(15)The first matrix on the right-hand side of eqn.15is characterized by a Vandermondestructure and consequently has full rank K for K distinct DOA angles{kφ}and therefore Kdistinctkz,while the diagonal matrix which multiplies it from the right is clearly non-singular.A~is thus a full rank matrix whose structure is identical to that of the steering matrix of a linear array of M'equal-pattern uniformly spaced sensors.Note though that since in the case of a linear array with half-wavelength inter-element spacing kjkezθπsin=,the equivalence is between360°in(circular-array)φ-space and180°in(linear-array)sinθspace.Since the mode-space signal-only covariance matrix and radiation pattern of a uniformly spaced circular array are of the same structure as the corresponding signal-only covariance matrix and radiation pattern of a uniformly spaced linear array under element-space formulation,and as,under isotropic noise-field conditions or-azimuthally-omnidirectional/elevation-wise-impulsive noise(for the circular array),the noise covariance matrix is in both cases diagonal and with equal elements,it follows that super-resolution schemes which are ordinarily restricted to uniformly spaced linear arrays in element-space,are equally applicable to uniformly spaced circular arrays in mode-space. That includes multiple invariance(overlapped)ESPRIT,TAM,the Max-Entropy and Min-Norm methods,as well as root-finding versions of all one-dimensional parameter-search algorithms.IV.Mode-space spatial smoothingPreprocessing in the form of spatial or frequency-domain smoothing is required by eigenstructure-based super resolution algorithms,such as MUSIC,Min-Norm and the ESPRIT method,to enable them to cope with coherent signals.When some of the receivedsignals are coherent,the signal covariance matrixsR,becomes rank-deficient,and consequently the subspace spanned by the eigenvectors of the covariance matrix sR becomes rank-deficient,and consequently the subspace spanned by the eigenvectors of the covariance matrix Hxxξ(or Hyyξ)associated with its minimal eigenvalue is no longer orthogonal to the columns of the steering matrix)~(AA.Thus,although these DOA estimators are asymptotically unbiased for uncorrelated or partially correlated signals,they┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊may completely fail in the presence of multipath image sources'or when subjected to coherent jamming.The spatial smoothing technique for the decorrelation of coherent signals was first introduced by Evans[23]and further developed and analysed by Shan[24]and by a number of other authors.As formulated for a uniformly spaced linear array of identical sensors,the scheme involves the reduction of the spatial covariance matrix into a set of 'partial'covariance matrices defined for a corresponding set of equal-size(interlaced)sub-arrays,each with a different phase-center.These matrices are averaged to form the 'smoothed'covariance matrix,which can be shown to have the same structure as the covariance matrix for noncoherent signals.The effective size of the array is,however, reduced to the sub-array size,which implies a lower angular resolution.Consider the M'aligned outputs of phase-modes{}Λ+Λ−Λ−...1formed by applying the preprocessing transformation(eqn.3)to the element channels of an M-sensor circular array,and let the approximate far-field radiation pattern of the phase-modes in the above set be given by eqn.1.Next,form the following(M'-M"+1) overlapping subsets of M"aligned phase-modes{}}...21{}1...11{1...1''''''Λ−+−Λ+−Λ+Λ−+Λ−+Λ−−+Λ−+Λ−Λ−MMMand denote byvy,v=0,1,...,M'-M the vector of aligned phase-modes belonging to the vth subset.Under narrowband formulation we then have)()()()()(*tyt sBtynvv+Ψ=φωv=0,1,...,M'-M"(16) Where)(tynvis the mode-space additive noise vector belonging to the vth subset, the M"x K matrix&I,)consists of the top M"rows of the modified steering matrix)(~ωA,andϕdenotes the vth power of the K x K diagonal matrix)...()(110−=ΨK jjj eeediagφφφφ(17) The covariance matrix for the vth subset is given byΙ+ΨΨ=2~wHHvsvvBRBRσI(18)where2~wσis the(white homoscedastic)noise power at the phase-mode outputs,and I is the M"x M"identity matrix.The construction of a spatially smoothed covariance matrix R follows the lines outlined for a uniformly spaced linear array[24],according to which R is simply given by the sample mean of the subset covariances┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊Ι+=+−=∑−=2'''~11~'''wHsMMvvyBRBRMMRσ(19)wheresR~,the modified signal covariance matrix,is given by∑−=ΨΨ+−=''''''11MMvHvsvsRMMR(20)Also,from Reference24,the modified signal covariance matrixsR will be of fullrank as long as(M'-M+1)>K,which together with the condition M">K needed for the subsequent eigen-decomposition procedure,means that we must also have M'2K.Sincethe smoothed covariance matrixyR is of exactly the same signal and noise structure as the(unsmoothed)covariance matrix for the incoherent signal case,it is equally applicable to eigen-structure based spatial spectral estimation algorithms.The dimension of the covariance matrix is,however,reduced from M'x M'to M"x M",which may be viewed as a decrease in the effective aperture of the array.V.Frequency-domain smoothingFrequency smoothing refers to frequency-domain averaging of the preprocessed cross-spectral density matrix which has a de-correlating effect on relatively delayed signals from wideband sources.Preprocessing is necessary to enable the(modified)steering matrix to maintain the same rank1description per source over the whole frequency band, so that(essentially narrowband)eigendecomposition algorithms may be applied.A number of coherent'focusing'techniques have been considered[25-281,of which the spatial resampling method[28]is the only approach to provide true direction-independent focusing,with a single evaluation of the transformed covariance matrix that does not require preliminary estimates of source locations.This method,however,has been limited to linear equispaced arrays.Consider again the spatial preprocessing transformation(eqn.3)of a circular sensor array from M-dimensional element-space,applied this time to M'-dimensional mode-space, applied this time to generally wideband signals convolved with the array responses to temporal impulses emanating from far-field sources.The signal received at the mth array sensor is given by∫∞∞−=+=tjmkmkmkmkmeAdtatwtstatxωωωπ)(21)()()(*)()(10,10−≤≤−≤≤KkMm(21)with*denoting convolution,and let each phase-mode be(digitally or analogue)filtered so that the response of its zero-order coefficient is de-convolved over the relevant frequency band.This means that the elements of the time-domain diagonal matrix Q in eqn.3are replaced by convolution operators,such that the temporal Fourier transformation of the covariance of the elements of At),as given by eqn.3,results in the following mode-space┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊cross-spectral density matrix(CSDM))(~)(~)()(~)()(ωωωωττωωτwHsjyyPAPAeRdP+==∫∞∞−−(22)where∫∞∞−−=ωτττωjsseRdP)()((23)∫∞∞−−=ωτττωjwweRdP)(~)((24) are the signal and mode-space noise CSDMs,respectively,and)(τsR,)(~τwR and)(τyR are the covariance matrices for the wideband time-domain signals,noise3and phase-mode outputs,respectively.The steering matrix is modified to)()(~)(~ωωωAEQA HH=(25) and in the relevant frequency band,the uu th element of)(~ωQ is given by[]Λ≤≤Λ−=uCMQuuu)2/,(/1)(~*2/1πωω(26) Note that if the inter-element spacing at the upper frequency is small in wavelengths,then A~as defined by eqn.25is approximately given by eqn.15throughout the relevant frequency band,which is the assumed bandwidth of the signals and noises and of their CSDMs.That means that the focusing stage,ordinarily required to render the steering matrix independent of frequency,is unnecessary and the mode-space covariance matrix, being the sum of the frequency-averaged signal and noise CSDMs,is also the required frequency-domain'smoothed'covariance matrix that enables a wideband source to be represented by a rank1model.VI.Simulation resultsThe spectral MUSIC algorithm has been applied to three arrays placed in an ambient Gaussian noise field(i)A10-element five-mode circular array of directional sensors[amplitude element patterns of)2/(cossin22/1ϕθ],embedded in a circumferentially isotropic and elevationwise impulsive ambient noise field,with modespace processing applied to modes numbered{-2to2).The interelement spacing is0.3wavelengths at the(narrowband) operating frequency or at the upper(octave-bandwidth)frequency.For the case under discussion these modes are effectively unrippled and may therefore'impersonate'isotropic linear-array elements.(ii)A five-element linear array of isotropic sensors at half-wavelength interelement spacing embedded in an isotropic or semi-isotropic ambient noise field.(iii)A five-element linear array of isotropic sensors at half-wavelength interelement spacing embedded in an elevation-wise-impulsive ambient noise field at2/πθ=.The signal scenario consists of two equipower zero-mean Gaussian sources at bearings o12±=φfor the circular array,and at180°o12sin±=φfor the linear array,which┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊constitutes a sixth of the null-to-null beamwidth separation for the uniformly illuminated five-mode circular array and five-element linear array,respectively.The simulated results shown below are aimed both at demonstrating mode-space superresolution processingand at comparing the performance of a(mode-space)circular array with that of an(element-space)linear array.It is important to realise that the linear-array to circulararray'equivalence'means that an M-element linear array receiving signals from directionφ,has a resolution factor of approximately(for half-wavelength interelement spacing)φπcos in its favour when compared with the M-fold mode-space processing of a circular array.This factor equals n for sources near broadside,but becomes smaller and eventually less than unity(for o72≥φ)as the location of sources moves away from broadside,and,of course,the linear array lacks the full circumferential coverage of the circular array.Also,a simple comparison of a circular and a linear arrays of equal number of elements is misleading as the generation of a set of M''well behaved'phase modes,requires a larger(typically twice)number of circular-array elements.Note for instance that mode number2/M±excited in an(even)M-element circular array is an amplitude mode that follows a spatial cosine far-field pattern[18].However,the physical size of the array will usually remain smaller than the length of a linear array of M'elements. The circular array we have chosen to simulate comprises twice the number of the simulated linear-array elements,but its diameter(for3/λinterelement spacing)is smaller than the long dimension of the corresponding(3/λ-spaced)linear array by an approximate factor of n.As far as our graphical output is concerned,all circular-array results are displayed in angle0ϕspace,whereas the linear-array outputs have been plotted againstϕsin1800. Fig.1depicts the narrowband MUSIC spectral patterns for the two arrays when the power received from each of the sources corresponds to a signal-to-noise ratio of26dB.It should be noticed that the two sources are easily resolved by both the linear array and the circular array,although the linear-array resolution appears to be somewhat better.In Fig.2, there is a99%correlationbetween the two sources,which are now also5dB more powerful.As no processing has been applied to'decorrelate'the sources,the result is an almost complete loss of resolution.This situation is remedied by applying spatial smoothing to the arrays,with the linear-array elements and similarly,the circular-array phase modes, divided into three sets of three interlaced elements and phase modes,respectively.The resulting spectral patterns are shown in Fig.3.Both the linear and the circular array have fully regained their resolving power,with the highest resolution exhibited by the circular array.But it should also be noted that the spatially-smoothed result for the linear array under an elevationwise impulsive noise field is biased by approximately3(transformed) degrees[i.e.by1sin−(3°/180°)],which may be.Attributed to its element noise not being spatially white.No such bias has been noticed in the smoothed MUSIC pattern for the circular array or for the linear array when the noise field is isotropic.┊┊┊┊┊┊┊┊┊┊┊┊┊装┊┊┊┊┊订┊┊┊┊┊线┊┊┊┊┊┊┊┊┊┊┊┊┊angleϕ-circular array,degreesϕsin180-linear array,degreesFig.1Narrowband MUSIC spectral pattern for two uncorrelated sourcesSNR=25dBa5-sensor linear array in(full or hemispherically)isotropic noiseb5-sensor linear array in elevationwise-impulsive noisec l0-sensor/5-mode circular array in elevationwise-impulsivenoiseangleϕ-circular array,degreesϕsin180-linear array,degreesFig.2Narrowband MUSIC spectral potternfor unsmoothed arrays excited by two99%correlated sources SNR=30dBa5-sensor linear array in(full or hemispherically)isotropic noiseb5-sensor linear array in elevationwise-impulsive noisec10-sensor/5-mode circular array in elevationwisc-impulve noise。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

SMOOTHING METHODS IN MAXIMUM ENTROPY LANGUAGE MODELINGS.C.Martin,H.Ney,J.ZaploLehrstuhl f¨u r Informatik VI,RWTH Aachen,University of Technology,D-52056Aachen,GermanyABSTRACTThis paper discusses various aspects of smoothing techniques in maximum entropy language modeling,a topic not sufﬁciently cov-ered by previous publications.We show(1)that straightforward maximum entropy models with nested features,e.g.tri–,bi–, and unigrams,result in unsmoothed relative frequencies models;(2)that maximum entropy models with nested features and dis-counted feature counts approximate backing–off smoothed rela-tive frequencies models with Kneser’s advanced marginal back–off distribution;this explains some of the reported success of max-imum entropy models in the past;(3)perplexity results for nested and non–nested features,e.g.trigrams and distance–trigrams,on a 4–million word subset of the Wall Street Journal Corpus,showing that the smoothing method has more effect on the perplexity than the method to combine information.1.MAXIMUM ENTROPY APPROACHThe maximum entropy principle[1,5]is a well-deﬁned method for incorporating different types of features into a language model[4, 9].For a word given its history it has the following functional form[2,pp.83-87]:with the–dependent auxiliary function and the–independent feature counts.There is no closed solution to the set of constraint equations.We train them with the Generalized Iterative Scaling(GIS)algorithm[3]implemented as described in [10]with the addition of Ristad’s speedup technique[11].In this paper the baseline maximum entropy model uses the nested trigram,bigram,and unigram features with:if and andotherwiseif andotherwiseifotherwiseMotivated by the good results in[10],the baseline model is ex-tended by non–nested features,either distance–2–trigrams with:if and andotherwiseif and andotherwiseor,alternatively,distance–3–bigrams and distance–4–bigrams with:if andotherwiseif andotherwiseAs opposed to the non–nested features,closed solutions exist in part for nested features,allowing some analysis of the maximum entropy models.2.SMOOTHING OF MAXIMUM ENTROPY MODELS 2.1.Unsmoothed Models:Relative FrequenciesFor the straightforward baseline maximum entropy model,there is a closed solution due to the nested features.The constraint equa-tionsresult in relative frequencies:Since the probabilities of all seen trigrams of a given historysum up to,the probability of unseen trigrams is not properly de-ﬁned using the model of Eq.(1)because of, even though bigram and unigram features exist for backing–off. Therefore,smoothing must be applied,a technique that redis-tributes probability mass from seen to unseen events[8].2.2.Smoothing Using Cut–Offs and Absolute Discounting We do not know an obvious smoothing technique for maximum entropy,so we adapted techniques from known smoothing meth-ods:Cut–Offs:Probability mass is gained by omitting featureswith a feature count of threshold and below.However,this results in a coarser model.Absolute Discounting:This method wasﬁrst presented in[10]without detailed analysis.All features with a positivefeature count are allowed.Probability mass is gained by re-ducing the feature count by aﬁxed discounting value.It is important to note that we now diverge from the max-imum likelihood principle and risk inconsistent constraintequations.In the experiments,we use three different dis-counting values,,and for trigram,bigram andunigram features,respectively.We analyse the effect of the smoothing methods for the case that all bigram features are seen and thus not smoothed.This is unrealistic but leads to a closed solution.If we apply both smoothing methods at the same time,we get the model:otherwise(2) and the constraint equations for the upper branch:resulting in:(4) withThe approximation is possible because almost all trigrams are un-seen in real cases.The computation,also using Eq.(3), results in(5) withNote thatifThus,the resulting model is a standard backing–off model[8], but with a back–off distribution not known from previous publications.However,this back–off distribution is not properly deﬁned if.For absolute dis-counting(,),we haveotherwiseTable1:CPU hours per GIS iteration for maximum entropy mod-els with different features on an Alpha21164500Mhz processoron the WSJ0–4M corpus,.FeaturesThus,the resulting model is a standard backing–off model[8]with a back–off distribution known as Kneser’s marginal distri-bution[7].A closed solution including unigram features is not yet found for both smoothing approaches,but we assume that the re-sulting models would be similar to the above.3.EXPERIMENTAL RESULTSFor the experiments,a4.5–million word text from the Wall Street Journal task was used(exact size:4,472,827words).The vocabu-lary consisted of approximately20,000words(vocab20o.nvp). All other words in the text were replaced by the label<UNK>for the unknown word.The test set perplexity was calculated on a separate text of325,000words.In the perplexity calculations,the unknown word was included.The corpora used are the same as in [8]and[9].The CPU time needed for the improved GIS training can be seen in Table1.For nested features,we compared the two smoothing meth-ods for maximum entropy with known smoothing methods for rel-ative frequencies[8]:(1)backing–off with absolute discounting and relative frequencies back–off distribution,(2)the same, but with Kneser’s marginal back–off distribution,and(3)the stan-dard smoothing method at our site,interpolation with absolute dis-counting and singleton back–off distribution:PPsmoothed relative frequencies:backing–off163.4backing–off,marginal distribution153.2standard model152.1interpolation of standard modeland maximum entropy(20iterations):cut–offs,,148.2absolute discounting,150.0back–off distributions.This underlines that the discounted max-imum entropy model only approximates the latter.The cut–off maximum entropy model performs worse,probably because of the poorer modeling and the problematic back–off distribution. The standard model performs best,because it employs interpo-lation instead of backing–off for smoothing.The superiority of smoothing by interpolation over smoothing by backing–off has been observed earlier[8].Interpolating the standard model with the maximum entropy modelsresults in a modest improvement only.All these results show that the performance of a language model with nested features is clearly dominated by the smoothing method,not by the way the features are combined.A baseline maximum entropy model with a better smoothing method or more efﬁcient features may exist but still has to be found.For non–nested features we compared the effects of extend-ing the models by distance–2–trigrams.For smoothed relative frequencies,each of the three trigram models was separately smoothed by absolute discounting and interpolation,like the stan-dard model,with and without the singleton back–off distribution .The discounting parameters were estimated using leaving–one–out.The three smoothed models were combined by linear interpolationwith interpolation parameters,estimated by a simpliﬁed cross validation method.The contesting maximum entropy model was extended by the distance–2–trigram features and initialized for GIS training with the parameters from the baseline nested tri-gram model.The discounting parameter for absolute discount-ing for both distance–2–trigram features and the number of GIS iterations was optimized on the testing data.Thus,the training procedure was slightly in favour of the maximum entropy models. Even though,as seen from Table3,the maximum entropy models are still outperformed by the smoothed relative frequencies model with marginal back–off distribution.The interpolation of the max-imum entropy model with the standard model results in a slight perplexity improvement only.Again,results are dominated by the smoothing method.Table3:Test set perplexities for trigram and distance–2–trigram language models on the WSJ0–4M corpus.Modelmaximum entropy:absolute discounting,,3iter.146.9PP interpolation of smoothed relative frequencies:singleton distribution148.6interpolation ofstandard model and maximum entropy:absolute discounting,,3iter.,142.7The extension of the trigram models by distance bigrams was performed in the very same way,but with a slightly different re-sult.As can be seen from Table4,the maximum entropy model now reaches the performance of the smoothed relative frequencies model.An explanation could be that smoothing has a weaker ef-fect on bigrams because bigrams are better trained than trigrams. Thus,the way in which the features are combined becomes more dominant,obviously in favour of the maximum entropy model,as theory suggests[1,9].Compared to the backing–off smoothed relative frequencies model without marginal back–off distribution we get a reduction in perplexity by10%for the maximum entropy model with distance––gram features.A similarﬁgure is reported in[9]using Turing–Good smoothing[6]for the maximum entropy model[9,p.204],a smoothing method comparable to absolute discounting[8].How-ever,as can be seen from Table2,roughly a third of this perplexity reduction is already achieved by the marginal back–off distribu-tion implicitly modeled by the maximum entropy model without distance––grams,a fact not discussed in earlier publications.4.CONCLUSIONIn this paper we discussed various aspects of smoothing techniques in maximum entropy language modeling.For nested features, the unsmoothed maximum entropy model leads to relativefrequencies without proper probabilities for events not seenin the training;discounted feature counts approximate the well–knownbacking–off smoothing implicitly using Kneser’s advancedmarginal back–off distribution;the discounted maximum entropy model is outperformed byrelative frequencies models with state–of–the–art smooth-ing.For non–nested features,no closed solutions are known;if smoothing is imortant,smoothing methods,not themethod of integrating information,dominate the global per-formance of language models;if the features become better trained,smoothing becomesless important,and maximum entropy appears to outper-form linear interpolation.The authors would like to thank Christoph Hamacher for his support in the experiments.5.REFERENCES[1] A.L.Berger,S.Della Pietra,V.Della Pietra:“A MaximumEntropy Approach to Natural Language Processing”,Com-putational Linguistics,V ol.22,No.1,pp.39–71,1996.[2]Y.M.M.Bishop,S.E.Fienberg,P.W.Holland:DiscreteMultivariate Analysis,MIT press,Cambridge,MA,1975.[3]J.N.Darroch,D.Ratcliff:“Generalized Iterative Scalingfor Log–Linear Models”,Annals of Mathematical Statis-tics,V ol.43,pp.1470–1480,1972.[4]S.Della Pietra,V.Della Pietra,R.L.Mercer,S.Roukos:“Adaptive Language Modeling Using Minimum Discrim-inant Information”,IEEE Int.Conf.on Acoustics,Speechand Signal Processing,San Francisco,CA,V ol.I,pp.633–636,1992.[5]S.Della Pietra,V.Della Pietra,fferty:“Inducing Fea-tures of Random Fields”,Technical Report CMU-CS-95-144,Carnegie Mellon University,Pittsburgh,PA,1995.[6]I.J.Good:“The Population Frequencies of Species and theEstimation of Population Parameters”,Biometrika,V ol.40,pp.237-264,Dec.1953.[7]R.Kneser,H.Ney:“Improved Backing-Off for–gramLanguage Modeling”,IEEE Int.Conf.on Acoustics,Speechand Signal Processing,Detroit,MI,V ol.I,pp.181–184,May1995.[8]H.Ney,F.Wessel,S.C.Martin:“Statistical LanguageModeling Using Leaving-One-Out”,In S.Young,G.Bloothooft(eds.):Corpus-Based Methods in Speech andLanguage,Kluwer Academic Publishers,pp.174-207,1997.[9]R.Rosenfeld:“A Maximum Entropy Approach to Adap-tive Statistical Language Modeling”,Computer Speech andLanguage,V ol.10,No.3,pp.187–228,July1996. [10]M.Simons,H.Ney,S.C.Martin:“Distant Bigram Lan-guage Modelling using Maximum Entropy”,IEEE Int.Conf.on Acoustics,Speech,and Signal Processing,Mu-nich,V ol.II,pp.787–790,April1997.[11] A.Stolcke,C.Chelba,D.Engle,V.Jimenez,L.Mangu,H.Printz,E.Ristad,R.Rosenfeld, D.Wu,F.Jelinek,S.Khudanpur:“Dependency Language Modeling”,1996Large Vocabulary Continuous Speech Recognition SummerResearch Workshop Technical Reports,Research Note24,Center for Language and Speech Processing,Johns Hop-kins University,Baltimore,MD,April1997.。