1996年温斯洛普大学学生的宗教调查数据挖掘_科研数据集

合集下载

introduction to statistical learning中文版 -回复

introduction to statistical learning中文版-回复Introduction to Statistical Learning（统计学习导论）是一本经典的统计学习教材，被广泛用于教学和研究领域。

这本书由Trevor Hastie和Robert Tibshirani共同撰写，于2013年首次出版。

它的中文版《统计学习导论》通过由韩家炜和吴斌两位教授翻译，于2019年出版。

本文将以《统计学习导论》中文版为主题，介绍这本书的内容、特点和影响。

首先，我们来了解一下《统计学习导论》的内容。

这本书主要涵盖了统计学习的基本概念、方法和应用。

它以统计学习的四个基本元素（模型、策略、算法和评估）为框架，全面介绍了统计学习的理论基础和实践技巧。

书中还涵盖了广义线性模型、决策树、支持向量机、神经网络等常见的统计学习方法，以及交叉验证、模型选择和非参数方法等重要的统计学习技术。

《统计学习导论》的特点在于它的系统性和全面性。

书中从统计学习的基本概念出发，逐步介绍了各种统计学习方法的原理和应用。

在介绍方法时，书中既给出了方法的基本原理和推导，又给出了具体的案例和代码实现。

这种系统性和全面性使得这本书适合作为统计学习的入门教材，也适合作为研究者和实践者的参考书。

《统计学习导论》的影响力也是不可忽视的。

它成为了许多统计学习相关课程的标准教材，受到了广大学生和教师的欢迎。

同时，这本书也对相关领域的研究产生了重要的影响。

它的方法和思想被广泛应用于机器学习、数据挖掘、模式识别等领域，为实际问题的解决提供了有力的工具。

总结起来，《统计学习导论》的中文版在统计学习领域具有重要的地位和影响力。

它的内容全面系统，既适合初学者入门，又适合研究者和实践者深入研究和应用。

它的方法和思想被广泛应用于实际问题的解决，推动了统计学习领域的发展。

无论是对于学生、教师还是研究者来说，《统计学习导论》都是一本不可多得的宝典。

Randomness

a rX iv:mat h /1186v2[mat h.PR]1Oct21Randomness Paul Vit´a nyi ∗CWI and Universiteit van Amsterdam Abstract Here we present in a single essay a combination and completion of the several aspects of the problem of randomness of individual objects which of necessity occur scattered in our text [10].The reader can consult diﬀerent arrangements of parts of the material in [7,20].Contents 1Introduction 21.1Occam’s Razor Revisited .......................31.2Lacuna of Classical Probability Theory ...............41.3Lacuna of Information Theory ....................42Randomness as Unpredictability 62.1Von Mises’Collectives ........................82.2Wald-Church Place Selection ....................113Randomness as Incompressibility 123.1Kolmogorov Complexity .......................143.2Complexity Oscillations .......................163.3Relation with Unpredictability ...................193.4Kolmogorov-Loveland Place Selection ...............204Randomness as Membership of All Large Majorities214.1Typicality ...............................214.2Randomness in Martin-L¨o f’s Sense .................244.3Random Finite Sequences ......................254.4Random Inﬁnite Sequences .....................284.5Randomness of Individual Sequences Resolved (37)5Applications375.1Prediction (37)5.2G¨o del’s incompleteness result (38)5.3Lower bounds (39)5.4Statistical Properties of Finite Sequences (41)5.5Chaos and Predictability (45)1IntroductionPierre-Simon Laplace(1749—1827)has pointed out the following reason why intuitively a regular outcome of a random event is unlikely.“We arrange in our thought all possible events in various classes;andwe regard as extraordinary those classes which include a very smallnumber.In the game of heads and tails,if head comes up a hundredtimes in a row then this appears to us extraordinary,because thealmost inﬁnite number of combinations that can arise in a hundredthrows are divided in regular sequences,or those in which we ob-serve a rule that is easy to grasp,and in irregular sequences,thatare incomparably more numerous”.[place,A PhilosophicalEssay on Probabilities,,Dover,1952.Originally published in1819.Translated from6th French edition.Pages16-17.]If by‘regularity’we mean that the complexity is signiﬁcantly less than maximal, then the number of all regular events is small(because by simple counting the number of diﬀerent objects of low complexity is small).Therefore,the event that anyone of them occurs has small probability(in the uniform distribution). Yet,the classical calculus of probabilities tells us that100heads are just as probable as any other sequence of heads and tails,even though our intuition tells us that it is less‘random’than some others.Listen to the redoubtable Dr. Samuel Johnson(1709—1784):“Dr.Beattie observed,as something remarkable which had hap-pened to him,that he chanced to see both the No.1and the No.1000,of the hackney-coaches,theﬁrst and the last;‘Why,Sir’,saidJohnson,‘there is an equal chance for one’s seeing those two num-bers as any other two.’He was clearly right;yet the seeing of twoextremes,each of which is in some degree more conspicuous than therest,could not but strike one in a stronger manner than the sightof any other two numbers.”[James Boswell(1740—1795),Life ofJohnson,Oxford University Press,Oxford,UK,1970.(Edited byR.W.Chapman,1904Oxford edition,as corrected by J.D.Fleeman,third edition.Originally published in1791.)Pages1319-1320.]Laplace distinguishes between the object itself and a cause of the object.2“The regular combinations occur more rarely only because they areless numerous.If we seek a cause wherever we perceive symmetry,itis not that we regard the symmetrical event as less possible than theothers,but,since this event ought to be the eﬀect of a regular causeor that of chance,theﬁrst of these suppositions is more probablethan the second.On a table we see letters arranged in this order Co n s t a n t i n o p l e,and we judge that this arrangementis not the result of chance,not because it is less possible than others,for if this word were not employed in any language we would notsuspect it came from any particular cause,but this word being inuse among us,it is incomparably more probable that some personhas thus arranged the aforesaid letters than that this arrangementis due to chance.”[place,Ibid.]Let us try to turn Laplace’s argument into a formal one.First we introduce some notation.If x is aﬁnite binary sequence,then l(x)denotes the length (number of occurrences of binary digits)in x.For example,l(010)=3.1.1Occam’s Razor RevisitedSuppose we observe a binary string x of length l(x)=n and want to know whether we must attribute the occurrence of x to pure chance or to a cause. To put things in a mathematical framework,we deﬁne chance to mean that the literal x is produced by independent tosses of a fair coin.More subtle is the interpretation of cause as meaning that the computer on our desk computes x from a program provided by independent tosses of a fair coin.The chance of generating x literally is about2−n.But the chance of generating x in the form of a short program x∗,the cause from which our computer computes x,is at least2−l(x∗).In other words,if x is regular,then l(x∗)≪n,and it is about 2n−l(x∗)times more likely that x arose as the result of computation from some simple cause(like a short program x∗)than literally by a random process.This approach will lead to an objective and absolute version of the classic maxim of William of Ockham(1290?–1349?),known as Occam’s razor:“if there are alternative explanations for a phenomenon,then,all other things being equal,we should select the simplest one”.One identiﬁes‘simplicity of an object’with‘an object having a short eﬀective description’.In other words,a priori we consider objects with short descriptions more likely than objects with only long descriptions.That is,objects with low complexity have high probability while objects with high complexity have low probability.This principle is intimately related with problems in both probability theory and information theory.These problems as outlined below can be interpreted as saying that the related disciplines are not‘tight’enough;they leave things unspeciﬁed which our intuition tells us should be dealt with.31.2Lacuna of Classical Probability TheoryAn adversary claims to have a true random coin and invites us to bet on the outcome.The coin produces a hundred heads in a row.We say that the coin cannot be fair.The adversary,however,appeals to probabity theory which says that each sequence of outcomes of a hundred coinﬂips is equally likely,1/2100, and one sequence had to come up.Probability theory gives us no basis to challenge an outcome after it has happened.We could only exclude unfairness in advance by putting a penalty side-bet on an outcome of100heads.But what about1010...?What about an initial segment of the binary expansion ofπ?Regular sequence1Pr(00000000000000000000000000)=226Random sequence1Pr(10010011011000111011010000)=being equally probable,this quantity is the number of bits needed to count all possibilities.This expresses the fact that each message in the ensemble can be communi-cated using this number of bits.However,it does not say anything about the number of bits needed to convey any individual message in the ensemble.To illustrate this,consider the ensemble consisting of all binary strings of length 9999999999999999.By Shannon’s measure,we require9999999999999999bits on the average to encode a string in such an ensemble.However,the string consisting of 99999999999999991’s can be encoded in about55bits by expressing9999999999 999999in binary and adding the repeated pattern‘1’.A requirement for this to work is that we have agreed on an algorithm that decodes the encoded string. We can compress the string still further when we note that9999999999999999 equals32×1111111111111111,and that1111111111111111consists of241’s.Thus,we have discovered an interesting phenomenon:the description of some strings can be compressed considerably,provided they exhibit enough regularity.This observation,of course,is the basis of all systems to express very large numbers and was exploited early on by Archimedes(287BC—212BC)in his treatise The Sand-Reckoner,in which he proposes a system to name very large numbers:“There are some,King Golon,who think that the number of sandis inﬁnite in multitude[...or]that no number has been named whichis great enough to exceed its multitude.[...]But I will try to showyou,by geometrical proofs,which you will be able to follow,that,of the numbers named by me[...]some exceed not only the massof sand equal in magnitude to the earthﬁlled up in the way de-scribed,but also that of a mass equal in magnitude to the universe.”[Archimedes,The Sand-Reckoner,pp.420-429in:The World ofMathematics,Vol.1,J.R.Newman,Ed.,Simon and Schuster,NewYork,1956.Page420.]However,if regularity is lacking,it becomes more cumbersome to express large numbers.For instance,it seems easier to compress the number‘one billion,’than the number‘one billion seven hundred thirty-ﬁve million two hundred sixty-eight thousand and three hundred ninety-four,’even though they are of the same order of magnitude.The above example shows that we need too many bits to transmit regular objects.The converse problem,too little bits,arises as well since Shannon’s theory of information and communication deals with the speciﬁc technology problem of data transmission.That is,with the information that needs to be transmitted in order to select an object from a previously agreed upon set of alternatives;agreed upon by both the sender and the receiver of the message. If we have an ensemble consisting of the Odyssey and the sentence“let’s go drink a beer”then we can transmit the Odyssey using only one bit.Yet Greeks5feel that Homer’s book has more information contents.Our task is to widen the limited set of alternatives until it is universal.We aim at a notion of ‘absolute’information of individual objects,which is the information which by itself describes the object completely.Formulation of these considerations in an objective manner leads again to the notion of shortest programs and Kolmogorov complexity.2Randomness as UnpredictabilityWhat is the proper deﬁnition of a random sequence,the‘lacuna in probability theory’we have identiﬁed above?Let us consider how mathematicians test ran-domness of individual sequences.To measure randomness,criteria have been developed which certify this quality.Yet,in recognition that they do not mea-sure‘true’randomness,we call these criteria‘pseudo’randomness tests.For instance,statistical survey of initial segments of the sequence of decimal dig-its ofπhave failed to disclose any signiﬁcant deviations of randomness.But clearly,this sequence is so regular that it can be described by a simple program to compute it,and this program can be expressed in a few bits.“Any one who considers arithmetical methods of producing randomdigits is,of course,in a state of sin.For,as has been pointed outseveral times,there is no such thing as a random number—there areonly methods to produce random numbers,and a strict arithmeticalprocedure is of course not such a method.(It is true that a problemwe suspect of being solvable by random methods may be solvable bysome rigorously deﬁned sequence,but this is a deeper mathematicalquestion than we can go into now.)”[John Louis von Neumann(1903—1957),Various techniques used in connection with randomdigits,J.Res.Nat.Bur.Stand.Appl.Math.Series,3(1951),pp.36-38.Page36.Also,Collected Works,Vol.1,A.H.Taub,Ed.,Pergamon Press,Oxford,1963,pp.768-770.Page768.]This fact prompts more sophisticated deﬁnitions of randomness.In his famous address to the International Congress of Mathematicians in1900,David Hilbert (1862—1943)proposed twenty-three mathematical problems as a program to direct the mathematical eﬀorts in the twentieth century.The6th problem asks for”To treat(in the same manner as geometry)by means of axioms,those physical sciences in which mathematics plays an important part;in theﬁrst rank are the theory of probability..”.Thus,Hilbert views probability theory as a physical applied theory.This raises the question about the properties one can expect from typical outcomes of physical random sources,which a priori has no relation whatsoever with an axiomatic mathematical theory of probabilities. That is,a mathematical system has no direct relation with physical reality.To6obtain a mathematical system that is an appropriate model of physical phe-nomena one needs to identify and codify essential properties of the phenomena under consideration by empirical observations.Notably Richard von Mises(1883—1953)proposed notions that approach the very essence of true randomness of physical phenomena.This is related with the construction of a formal mathematical theory of probability,to form a basis for real applications,in the early part of this century.While von Mises’objective was to justify the applications to the real phenomena,Andrei Niko-laevitch Kolmogorov’s(1903—1987)classic1933treatment constructs a purely axiomatic theory of probability on the basis of set theoretic axioms.“This theory was so successful,that the problem ofﬁnding the basisof real applications of the results of the mathematical theory of prob-ability became rather secondary to many investigators....[however]the basis for the applicability of the results of the mathematical the-ory of probability to real‘random phenomena’must depend in someform on the frequency concept of probability,the unavoidable natureof which has been established by von Mises in a spirited manner.”[A.N.Kolmogorov,On tables of random numbers,Sankhy¯a,SeriesA,25(1963),369-376.Page369.]The point made is that the axioms of probability theory are designed so that abstract probabilities can be computed,but nothing is said about what prob-ability really means,or how the concept can be applied meaningfully to the actual world.Von Mises analyzed this issue in detail,and suggested that a proper deﬁnition of probability depends on obtaining a proper deﬁnition of a random sequence.This makes him a‘frequentist’—a supporter of the frequency theory.The following interpretation and formulation of this theory is due to John Edensor Littlewood(1885—1977),The dilemma of probability theory,Little-wood’s Miscellany,Revised Edition,B.Bollob´a s,Ed.,Cambridge University Press,1986,pp.71-73.The frequency theory to interpret probability says, roughly,that if we perform an experiment many times,then the ratio of favor-able outcomes to the total number n of experiments will,with certainty,tend to a limit,p say,as n→∞.This tells us something about the meaning of probability,namely,the measure of the positive outcomes is p.But suppose we throw a coin1000times and wish to know what to expect.Is1000enough for convergence to happen?The statement above does not say.So we have to add something about the rate of convergence.But we cannot assert a certainty about a particular number of n throws,such as‘the proportion of heads will be p±ǫfor large enough n(withǫdepending on n)’.We can at best say‘the proportion will lie between p±ǫwith at least such and such probability(de-pending onǫand n0)whenever n>n0’.But now we deﬁned probability in an obviously circular fashion.72.1Von Mises’CollectivesIn1919von Mises proposed to eliminate the problem by simply dividing all inﬁ-nite sequences into special random sequences(called collectives),having relative frequency limits,which are the proper subject of the calculus of probabilities and other sequences.He postulates the existence of random sequences(thereby circumventing circularity)as certiﬁed by abundant empirical evidence,in the manner of physical laws and derives mathematical laws of probability as a con-sequence.In his view a naturally occurring sequence can be nonrandom or unlawful in the sense that it is not a proper collective.Von Mises views the theory of probabilities insofar as they are nu-merically representable as a physical theory of deﬁnitely observ-able phenomena,repetitive or mass events,for instance,as foundin games of chance,population statistics,Brownian motion.‘Prob-ability’is a primitive notion of the theory comparable to those of‘energy’or‘mass’in other physical theories.Whereas energy or mass exist inﬁelds or material objects,proba-bilities exist only in the similarly mathematical idealization of collec-tives(random sequences).All problems of the theory of probabilityconsist of deriving,according to certain rules,new collectives fromgiven ones and calculating the distributions of these new collectives.The exact formulation of the properties of the collectives is secondaryand must be based on empirical evidence.These properties are theexistence of a limiting relative frequency and randomness.The property of randomness is a generalization of the abundant experience in gambling houses,namely,the impossibility of a suc-cessful gambling system.Including this principle in the foundationof probability,von Mises argues,we proceed in the same way as thephysicists did in the case of the energy principle.Here too,the ex-perience of hunters of fortune is complemented by solid experienceof insurance companies and so forth.A fundamentally diﬀerent approach is to justify a posteriori theapplication of a purely mathematically constructed theory of prob-ability,such as the theory resulting from the Kolmogorov axioms.Suppose,we can show that the appropriately deﬁned random se-quences form a set of measure one,and without exception satisfyall laws of a given axiomatic theory of probability.Then it appearspractically justiﬁable to assume that as a result of an(inﬁnite)ex-periment only random sequences appear.Von Mises’notion of inﬁnite random sequence of0’s and1’s(collective)essen-tially appeals to the idea that no gambler,making aﬁxed number of wagers of ‘heads’,atﬁxed odds[say p versus1−p]and inﬁxed amounts,on theﬂips of a coin[with bias p versus1−p],can have proﬁt in the long run from betting ac-8cording to a system instead of betting at random.Says Alonzo Church(1903—):“this deﬁnition[below]...while clear as to general intent,is too inexact in form to serve satisfactorily as the basis of a mathematical theory.”[A.Church, On the concept of a random sequence,Bull.Amer.Math.Soc.,46(1940),pp. 130-135.Page130.]Definition1An inﬁnite sequence a1,a2,...of0’s and1’s is a random sequence in the special meaning of collective if the following two conditions are satisﬁed.1.Let f n is the number of1’s among theﬁrst n terms of the sequence.Thenf nlimn→∞we should distinguish between randomness proper(as absence of anyregularity)and stochastic randomness(which is the subject of prob-ability theory).There emerges the problem ofﬁnding reasons forthe applicability of the mathematical theory of probability to thereal world.”[A.N.Kolmogorov,On logical foundations of probabil-ity theory,Probability Theory and Mathematical Statistics,LectureNotes in Mathematics,Vol.1021,K.Itˆo and J.V.Prokhorov,Eds.,Springer-Verlag,Heidelberg,1983,pp.1-5.Page1.]Intuitively,we can distinguish between sequences that are irregular and do not satisfy the regularity implicit in stochastic randomness,and sequences that are irregular but do satisfy the regularities associated with stochastic randomness. Formally,we will distinguish the second type from theﬁrst type by whether or not a certain complexity measure of the initial segments goes to a deﬁnite limit. The complexity measure referred to is the length of the shortest description of the preﬁx(in the precise sense of Kolmogorov complexity)divided by its length. It will turn out that almost all inﬁnite strings are irregular of the second type and satisfy all regularities of stochastic randomness.“In applying probability theory we do not conﬁne ourselves to negat-ing regularity,but from the hypothesis of randomness of the ob-served phenomena we draw deﬁnite positive conclusions.”[A.N.Kol-mogorov,Combinatorial foundations of information theory and thecalculus of probabilities,Russian Mathematical Surveys,,38:4(1983),pp.29-40.Page34.]Considering the sequence as fair coin tosses with p=1/2,the second condition in Deﬁnition1says there is no strategyφ(principle of excluded gambling system) which assures a player betting atﬁxed odds and inﬁxed amounts,on the tosses of the coin,to make inﬁnite gain.That is,no advantage is gained in the long run by following some system,such as betting‘head’after each run of seven consecutive tails,or(more plausibly)by placing the n th bet‘head’after the appearance of n+7tails in succession.According to von Mises,the above conditions are suﬃciently familiar and a uncontroverted empirical generalization to serve as the basis of an applicable calculus of probabilities.Example1It turns out that the naive mathematical approach to a concrete formulation,admitting simply all partial functions,comes to grief as follows. Let a=a1a2...be any collective.Deﬁneφ1asφ1(a1...a i−1)=1if a i=1, and undeﬁned otherwise.But then p=1.Deﬁningφ0byφ0(a1...a i−1)=b i, with b i the complement of a i,for all i,we obtain by the second condition of Deﬁnition1that p=0.Consequently,if we allow functions likeφ1andφ0as strategy,then von Mises’deﬁnition cannot be satisﬁed at all.3102.2Wald-Church Place SelectionIn the thirties,Abraham Wald(1902—1950)proposed to restrict the a priori admissibleφto anyﬁxed countable set of functions.Then collectives do exist. But which countable set?In1940,Alonzo Church proposed to choose a set of functions representing‘computable’strategies.According to Church’s Thesis, this is precisely the set of recursive functions.With recursiveφ,not only is the deﬁnition completely rigorous,and random inﬁnite sequences do exist,but moreover they are abundant since the inﬁnite random sequences with p=1/2 form a set of measure one.From the existence of random sequences with proba-bility1/2,the existence of random sequences associated with other probabilities can be derived.Let us call sequences satisfying Deﬁnition1with recursiveφMises-Wald-Church random.That is,the involved Mises-Wald-Church place-selection rules consist of the partial recursive functions.Appeal to a theorem by Wald yields as a corollary that the set of Mises-Wald-Church random sequences associated with anyﬁxed probability has the cardinality of the continuum.Moreover,each Mises-Wald-Church random se-quence qualiﬁes as a normal number.(A number is normal in the sense of´Emile F´e lix´Edouard Justin Borel(1871—1956)if each digit of the base,and each block of digits of any length,occurs with equal asymptotic frequency.)Note however, that not every normal number is Mises-Wald-Church random.This follows,for instance,from Champernowne’s sequence(or number),0.1234567891011121314151617181920...due to David G.Champernowne(1912—),which is normal in the scale of10 and where the i th digit is easily calculated from i.The deﬁnition of a Mises-Wald-Church random sequence implies that its consecutive digits cannot be eﬀectively computed.Thus,an existence proof for Mises-Wald-Church random sequences is necessarily nonconstructive.Unfortunately,the von Mises-Wald-Church deﬁnition is not yet good enough, as was shown by Jean Ville in1939.There exist sequences that satisfy the Mises-Wald-Church deﬁnition of randomness,with limiting relative frequency of ones of1/2,but nonetheless have the property thatf nfor all n.2The probability of such a sequence of outcomes in randomﬂips of a fair coin is zero.Intuition:if you bet‘1’all the time against such a sequence of outcomes, then your accumulated gain is always positive!Similarly,other properties of randomness in probability theory such as the Law of the Iterated Logarithm do not follow from the Mises-Wald-Church deﬁnition.An extensive survey on these issues(and parts of the sequel)is given in[8].113Randomness as IncompressibilityAbove it turned out that describing‘randomness’in terms of‘unpredictability’is problematic and possibly unsatisfactory.Therefore,Kolmogorov tried another approach.The antithesis of‘randomness’is‘regularity’,and aﬁnite string which is regular can be described more shortly than giving it literally.Consequently,a string which is‘incompressible’is‘random’in this sense.With respect to inﬁnite binary sequences it is seductive to call an inﬁnite sequence‘random’if all of its initial segments are‘random’in the above sense of being‘incompressible’.Let us see how this intuition can be made formal,and whether leads to a satisfactory solution.Intuitively,the amount of eﬀectively usable information in aﬁnite string is the size(number of binary digits or bits)of the shortest program that,without additional data,computes the string and terminates.A similar deﬁnition can be given for inﬁnite strings,but in this case the program produces element after element forever.Thus,a long sequence of1’s such as10,000times11111 (1)contains little information because a program of size about log10,000bits out-puts it:for i:=1to10,000print1Likewise,the transcendental numberπ=3.1415...,an inﬁnite sequence of seemingly‘random’decimal digits,contains but a few bits of information.(There is a short program that produces the consecutive digits ofπforever.)Such a deﬁnition would appear to make the amount of information in a string(or other object)depend on the particular programming language used.Fortunately,it can be shown that all reasonable choices of programming languages lead to quantiﬁcation of the amount of‘absolute’information in indi-vidual objects that is invariant up to an additive constant.We call this quantity the‘Kolmogorov complexity’of the object.If an object is regular,then it has a shorter description than itself.We call such an object‘compressible’.More precisely,suppose we want to describe a given object by aﬁnite binary string.We do not care whether the object has many descriptions;however,each description should describe but one object.From among all descriptions of an object we can take the length of the shortest description as a measure of the object’s complexity.It is natural to call an object‘simple’if it has at least one short description,and to call it‘complex’if all of its descriptions are long.But now we are in danger of falling in the trap so eloquently described in the Richard-Berry paradox,where we deﬁne a natural number as“the least natural number that cannot be described in less than twenty words”.If this number12does exist,we have just described it in thirteen words,contradicting its deﬁni-tional statement.If such a number does not exist,then all natural numbers can be described in less than twenty words.(This paradox is described in[Bertrand Russell(1872—1970)and Alfred North Whitehead,Principia Mathematica,Ox-ford,1917].In a footnote they state that it“was suggested to us by Mr.G.G. Berry of the Bodleian Library”.)We need to look very carefully at the notion of‘description’.Assume that each description describes at most one object.That is,there is a speciﬁcation method D which associates at most one object x with a description y.This means that D is a function from the set of descriptions,say Y,into the set of objects,say X.It seems also reasonable to require that,for each object x in X,there is a description y in Y such that D(y)=x.(Each object has a description.)To make descriptions useful we like them to beﬁnite.This means that there are only countably many descriptions.Since there is a description for each object,there are also only countably many describable objects.How do we measure the complexity of descriptions?Taking our cue from the theory of computation,we express descriptions as ﬁnite sequences of0’s and1’s.In communication technology,if the speciﬁcation method D is known to both a sender and a receiver,then a message x can be transmitted from sender to receiver by transmitting the sequence of0’s and1’s of a description y with D(y)=x.The cost of this transmission is measured by the number of occurrences of0’s and1’s in y,that is,by the length of y. The least cost of transmission of x is given by the length of a shortest y such that D(y)=x.We choose this least cost of transmission as the‘descriptional’complexity of x under speciﬁcation method D.Obviously,this descriptional complexity of x depends crucially on D.The general principle involved is that the syntactic framework of the description language determines the succinctness of description.In order to objectively compare descriptional complexities of objects,to be able to say“x is more complex than z”,the descriptional complexity of x should depend on x alone.This complexity can be viewed as related to a universal description method which is a priori assumed by all senders and receivers.This complexity is optimal if no other description method assigns a lower complexity to any object.We are not really interested in optimality with respect to all description methods.For speciﬁcations to be useful at all it is necessary that the mapping from y to D(y)can be executed in an eﬀective manner.That is,it can at least in principle be performed by humans or machines.This notion has been formalized as‘partial recursive functions’.According to generally accepted mathematical viewpoints it coincides with the intuitive notion of eﬀective computation.The set of partial recursive functions contains an optimal function which minimizes description length of every other such function.We denote this func-tion by ly,for any other recursive function D,for all objects x,there is a description y of x under D0which is shorter than any description z of x13。

Protein Sequences

Digital Signal Processing Techniques in the Analysis ofDNA/RNA and Protein SequencesbyIvan V.Baji´cSubmitted to the Department of Electronic Engineeringin partial ful…llment of the requirements for the degree ofBachelor of Science in Engineering(Electronic)at theUNIVERSITY OF NATALDecember1998c°University of Natal1998Signature of Author...................................................................Department of Electronic Engineering26October1998 Certi…ed by...........................................................................Professor Anthony D.BroadhurstChairman of the School of Electrical and Electronic EngineeringResearch Head Accepted by...........................................................................Digital Signal Processing Techniques in the Analysis of DNA/RNA andProtein SequencesbyIvan V.Baji´cSubmitted to the Department of Electronic Engineeringon26October1998,in partial ful…llment of therequirements for the degree ofBachelor of Science in Engineering(Electronic)AbstractThis report investigates the possibility of using signal processing techniques in the analysis of biological sequences:DNA,RNA and proteins.The starting point in the investigation was the Resonant Recog-nition Model(RRM)and the method proposed here builds on the foundations set by RRM.However, some fundamental modi…cations to RRM had to be madein order to make it suitable for the analysis of a greater variety of sequences.The concepts were tested on the problem of promoter recognition. During the course of the project,several methods for extracting the features from the spectra of bio-logical sequences and several types of classi…ers were tested.For promoter classi…cation the suitable choices were found to be Principal Component Analysis(PCA)feature extraction and General Regres-sion Neural Network(GRNN)classi…er.Promoter recognition system designed around this classi…er showed very good performance in comparison to other promoter recognition tools.Results indicate that signal processing methods may be very suitable for analyzing biological sequences.Research Head:Professor Anthony D.BroadhurstTitle:Chairman of the School of Electrical and Electronic Engineering2Contents1Introduction51.1Structure and function of DNA (6)1.2The role of promoters in the synthesis of proteins (7)1.3Promoter recognition (8)2The Resonant Recognition Model(RRM)102.1Application of the RRM to promoter characterization (12)2.2Guidelines for development of a signal processing based method for biological sequenceanalysis (13)3Spectral analysis of DNA/RNA and protein sequences173.1Problem de...nition. (18)3.2Reduction of spectral resolution (18)3.3An example (20)4Promoter classi…cation234.1Selection of features for the input space (23)4.1.1Principal Component Analysis(PCA)feature extraction (24)4.1.2Canonical Discriminant Analysis(CDA)feature extraction (26)4.2Arti...cial Neural Network(ANN)classi...er (29)4.2.1Generalized Regression Neural Network(GRNN) (29)4.3The choice of feature extraction method (37)5Promoter recognition385.1Basic promoter recognition system (38)5.2Improved promoter recognition system (41)5.3The experiment (45)36Conclusion and future work497Appendix1-Glossary54 8Appendix2-Typical data record from the EMBL genetic sequence database559Appendix3-Some…gures from Chapter4589.1Structure of the space of PSDVs from group1 (58)9.2Characteristics of the classi...er with PCA feature extraction.. (61)10Appendix4-Test sequences used in Chapter564 11Appendix5-MATLAB program code654Chapter1IntroductionTowards the end of this century,the…elds of genetics and genetic engineering have experienced rapid development.Since the introduction of reliable techniques for experimental determination of deoxyri-bonucleic acid(DNA)sequences of various organisms,the number of such sequences available for the analysis has increased greatly.This has prompted the need for the development of methods for the analysis of such sequences and extraction of information encoded in them.This new science that deals with the extraction and utilization of information contained in biological sequences(DNA,RNA and proteins)has become known as bioinformatics.It draws its inspiration from various natural sciences such as biological,physical,mathematical and computational.At this point,we need to explain what is meant by’DNA sequence analysis’.Just as electrical circuit analysis involves determination of voltages and currents throughout the circuit,DNA sequence analysis involves determination of the function of speci…c portions of the DNA sequence.It has been found that some regions(segments)of the DNA sequence are responsible for some speci…c tasks within the overall function of the DNA.A group of DNA segments(sub-sequences)that perform the same function within the DNA is called a functional group.Several functional groups have been experimentally discovered so far(e.g.promoters,terminators,enhancers,attenuators,etc.).In this report we concentrate on development of a method for recognition and localization of sequences from a particular functional group-promoters.As will be seen,however,the method can be easily modi…ed for recognition of other functional groups.The rest of this chapter gives introductory details about the DNA and the function that promoters perform within it(material is presented at the level of the introductory text in genetics).Chapter 2presents a particular signal processing-based method for biological sequence analysis,the Resonant Recognition Model(RRM),and investigates its applicability to the problem of promoter recognition.In5Chapter3an alternative method is proposed and Chapters4and5are devoted to the investigation of the ways in which it can be used for promoter recognition.Appendices include the glossary of genetics terminology used throughout the report,some…gures and tables related to the…rst…ve chapters and the supporting MATLAB program code.1.1Structure and function of DNADNA molecule has a form of a double helix[1],with each strand(helix)represented by a sequence composed of four nucleotides:adenine(A),thymine(T),guanine(G),cytosine(C).The two helices are bonded to each other with a hydrogen bond between each nucleotide pair.The only bindings possible between the two helices,due to the physical shape of nucleotide molecules,are A-T and G-C.Hence, if we know the sequence of one of the helices,we also know the sequence of the other(for example,if a portion of one of the DNA helices is AGATC,the corresponding portion of the other helix of that DNA is TCTAG).It is widely accepted belief that the function of the DNA is determined mainly by the sequence of the nucleotides from which it is composed,which is why methods for DNA analysis are usually termed’DNA sequence analysis’methods.Due to the one-to-one correspondence between the sequences of the two helices,as described above,most of these methods analyze the sequence of only one of the helices.DNA molecule itself is not directly involved in the functioning of the cell-in eukaryotic(non-bacterial) organisms it is located within the nucleus of the cell and isolated from most of the chemical processes taking place in the cytoplasm of the cell.Proteins are the main’workhorse’of the cell and of the organism as a whole.They account for more than half of the dry weight of most cells and they perform almost all of the biochemical functions of an organism[1]:structural support,storage,transport of other substances,signalling between di¤erent parts of an organism and defense against alien substances.(Of course,each speci…c protein has only one function,e.g.hemoglobin,a protein present in red blood cells, transports oxygen around the organism;lysozyme,a protein present in tears,protects the surface of the eyes by destroying speci…c molecules on the surface of many bacteria,etc.).But the DNA sequence itself determines the structure of proteins that the cell is able to synthesize,which is why the information it contains is so important for the functioning of any organism.Promoters,as parts of the DNA sequence, are crucial in the synthesis of proteins.61.2The role of promoters in the synthesis of proteinsTo explain the role of promoters in the synthesis of proteins,we will look into this process in more detail. The process essentially consists of two steps[1]:transcription and translation.In eukaryotic cells transcription takes place in the nucleus of the cell(prokaryotic(bacterial)cells do not have nucleus,so transcription takes place in the cytoplasm,but the process is otherwise very similar to the eukaryotic transcription;in further text we concentrate on the eukaryotic case).During this process a messenger ribonucleic acid(mRNA)is synthesized as a replica of a portion of one strand of the DNA.The process of transcription of a DNA sequence x DNA(n);n=0;1;:::;N¡1of length N into a mRNA sequence x mRNA(n);n=0;1;:::;N¡1of the same length can be described as follows:;n=0;1;:::;N¡1(1.1) x mRNA(n)=8<:x DNA(n);x DNA(n)2f A;G;C gU;x DNA(n)=TThus,the information contained in the DNA sequence is exactly transcribed(bijective mapping)into a mRNA sequence,the only di¤erence between the two sequences being that each T in DNA is replacedmRNA.by a very similar nucleotide uracil(U)inFigure1-1:Illustration of the transcription processIllustration of the physical process of transcription is depicted in Figure1-1.The…gure shows how a mRNA chain is created by copying one coding region-a portion of a DNA sequence containing the information of how to synthesize a particular protein.Coding region is bounded by the promoter(region where transcription starts)and the terminator(region where transcription stops).A particular molecule, called RNA polymerase,enables the transcription.RNA polymerase recognizes the promoter and binds to it,thereby starting the process of transcription.As it continues to move along the DNA strand it attracts the free nucleotides‡oating around in the nucleus and binds them together according to the sequence determined by the DNA.When it reaches the terminator it detaches itself from the DNA and releases the newly created mRNA chain that is a replica(in the sense of(1.1))of the particular coding region.The role of a promoter in this process is to show the RNA polymerase exactly where to start7the transcription.Obviously,recognition of the correct position for the start of transcription by RNA polymerase is of crucial importance-unless the transcription starts at the correct point,wrong protein would be synthesized,which may ultimately result in the malfunctioning of the whole organism.This newly created mRNA molecule travels from the nucleus into the cytoplasm where translation takes place.This process is analogous to that of a transcription,but this time a chain of amino-acids (instead of nucleotides)is created according to the mRNA sequence.This chain of amino-acids(also called a polypeptide)represents the protein.The process of translation will not be discussed here,since promoters do not play a signi…cant role in it.Details on translation can be found in[1].1.3Promoter recognitionSince promoters are not a part of a coding region,they do not in‡uence the structure of a subsequently synthesized protein,and therefore it is coding regions and not promoters that are of primary interest in the study of genetics.However,the task of locating coding regions in the DNA sequence is not an easy one.Since coding regions are responsible for encoding a great variety of proteins,it should be expected that there is a great structural variety among them,so the knowledge of a structure of several particular coding regions would not help us to recognize the others.In other words,it would be hard to make some sort of generalization in this case.Fortunately,however,we know that each coding region is preceded by a promoter,so the search for a coding region may be reduced to the search for a promoter. As we have seen above,promoter sequence contains some unique information that enables the RNA polymerase molecule to recognize it.Thus,although the number of di¤erent promoters is large,it would be reasonable to expect that they all share some structural similarity that enables their recognition.This is the main reason that sparked the interest in the problem of computational promoter recognition,that is,the automated(computer-based)recognition and localization of promoters within the DNA sequence.It is still not known exactly what type of similarity is shared by promoters,and many approaches to the problem(arising from many theories about what this similarity may be)have been developed[3]. They range from homology-based methods,to the ones based on Generalized Hidden Markov Models (GHMM)and Arti…cial Neural Networks(ANN).A list of some of the currently used programs based on these methods is shown in Table1:1,along with their accuracies determined experimentally in[3].More details about the methods used in these programs can be found in[3].Here we will simply explain the information given in the table.The…rst row(TP)gives the so called’sensitivity’or the number of true positives(TP)-the number of correctly recognized promoters on a test segment of the DNA sequence.For example,the…rst program(Audic)recognized5out of24promoters,or24%(second row).Third row(FP)gives the so called’speci…city’or the number of false positives-the number of8Table1.1:Programs currently used for promoter recognition and their accuraciessequences on the same segment of DNA that are recognized by the program as promoters,but are not really promoters.For example,Audic recognized33false positives at a rate of1false positive per1004 nucleotides(1=1004-last row).As we can see from the table,by the time this survey was completed(March1997),promoter recognition was still in its embryonic stage,and the situation has not changed much since then[4].The best sensitivity was achieved by a ANN-based program NNPP(54%),but at the same time this program had the worst speci…city(72false positives at a rate of1=460).This trade-o¤between sensitivity and speci…city can be noticed throughout the table(e.g.the best speci…city was achieved by P’Scan,only6 false positives at a rate1=5520,but this program had the worst sensitivity,only13%).The type of behavior seen in the table above suggests that,perhaps,the input space is not su¢ciently good for the purpose of promoter recognition.It appears that in the input space described by the sequence itself(e.g.AGTTC...)promoters overlap with other types of sequences(i.e.there are non-promoter sequences in the neighborhood of many promoter sequences)so the more promoters we are able to recognize,the more false positives we get.What we need is a way to transform the input space (i.e.the sequence)into a space in which promoters will emerge as a distinct class i.e.they will occupy a su¢ciently convex region not occupied by non-promoter sequences.Spectral analysis may provide this transformation and a possible method is presented in the following chapter.It is based on signal processing and performs the extraction of information from a biological sequence in the normalized spatial frequency domain.9Chapter2The Resonant Recognition Model (RRM)The Resonant Recognition Model(RRM)evolved from the Information Spectrum Method(ISM)[5] developed in the attempt to utilize methods of signal processing in the analysis of biological sequences-DNA/RNA and proteins.Details of the theory and applications of RRM can be found in[6]and[7].In this section we describe the procedure used by the RRM to extract relevant information from biological sequences.Suppose we have a set of M biological sequences S=f s i g;i=1;2;:::;M with the same biological function(e.g.promoters).RRM postulates that in this case,this common biological function corresponds to a speci…c frequency f RRM in the normalized spatial frequency domain that can be determined in the following way:Step1.Substitute each of the nucleotides in each of the sequences s i with a value of its Electron-Ion Interaction Potential(EIIP)-a quantity that relates some of the physical properties of nucleotides and amino-acids to biological properties of organic molecules[8].Values of EIIP for the four nucleotides are given in Table2:1(from[6]).Thus,a set S n=f y i g;i=1;2;:::;M of M numerical sequences is obtained.Step2.Perform the detrending operation(removal of the mean value-DC component)from eachTable2.1:Values of EIIP for the four nucleotides10of the sequences y i .Thus,a set of M detrended numerical sequences S nd =f x i g ;i =1;2;:::;M is obtained.This step is necessary since all the values of EIIP are positive so each of the sequences y i has a strong DC component,which is not important for the RRM analysis,but may cause errors due to spectral leakage later in the process.Step 3.Let the longest sequence in S nd be x i 0of length N .Perform the zero-padding operation onall sequences from S nd nf x i 0g by adding zeros to the end of each of the sequences up to the length N .Thus,a set of zero-padded sequences S (0)nd =f x (0)i g ;i =1;2;:::;M ,each of length N ,is obtained.Note that x (0)i 0=x i 0since the longest sequence is not zero-padded.Also,if there are several sequences in S nd of the same (largest)length N ,then neither of these are zero-padded.Step 4.A spectrum of each of the sequences from S (0)nd is found as X (0)i =DF T (x (0)i )whereDF T denotes the Discrete Fourier Transform.The distance between elements of x (0)iis assumed to be constant (since,indeed,distance between nucleotides in the DNA sequence is very nearly constant [6])and normalized to d =1.Thus,each of the spectra X (0)i are de…ned on a normalized spatial frequency interval [0;1),with non-redundant components in the interval [0;0:5].(This is analogous to the case where a signal is sampled with a sampling frequency of 1).Step 5.A cross-spectrum (or consensus spectrum)of sequences from S (0)nd is found asX c (j )=M Y i =1¯¯¯X (0)i (j )¯¯¯;j =0;1;:::;N ¡1(2.1)and a signal-to-noise ratio (SNR)of the consensus spectrum is found as a magnitude of the largest frequency component relative to the mean value of the spectrum [6].Step 6.The largest frequency component in the consensus spectrum is considered to be signi…cant if the value of SNR is at least 20[6].Signi…cant frequency component is the characteristic RRM frequency (f RRM )for the entire group of biological sequences having the same biological function as those in S ,since it is the strongest frequency component common to all of the biological sequences from that particular functional group.Apart from being an interesting approach to the analysis of biological sequences,RRM also o¤ers some physical explanation of the selective interactions between biological macromolecules,based on their structure.RRM proposes that these selective interactions (that is,recognition of a target molecule by another molecule,e.g.recognition of a promoter by RNA polymerase)are caused by resonant electromag-netic energy exchange -hence the name ’Resonant Recognition Model’.According to RRM,charge that is being transferred along the backbone of a macromolecule travels through the changing electric …eld described by a sequence of EIIPs,causing the radiation of some small amount of electromagnetic energy at particular frequencies that can be recognized by other molecules.If there is a frequency component11common to the radiation patterns of two molecules,then they can recognize each other and interact.So far,RRM has had some success in the design of new proteins with desired biological functions[9]and [10],but there have also been some problems in its application,as will be seen later in the text.In the following section we apply the RRM procedure in the attempt to characterize a group of human promoters by a common RRM frequency(f RRM).If we were able to…nd a single strong frequency component common to all promoter sequences,the search for promoters within the large segment of a DNA sequence would be signi…cantly simpli…ed.2.1Application of the RRM to promoter characterizationIn this section we examine the applicability of the RRM to the problem of promoter recognition,in the similar way as it was done in[11].For this purpose,set H of41promoters was arbitrarily extracted from the EMBL and GenBank databases(public databases that contain segments of DNA sequences examined and documented so far).Three subsets were arbitrarily chosen from set H.Sets E and F are distinct subsets of H(i.e.E\F=?and E[F=H),while G was created by combining some promoters from E with some promoters from F.RRM procedure,as described in the previous section was applied to each of the sets E,F,G and H.Results are summarized in Table2:2and resulting normalized consensus spectra are shown in Figure2-1.Results show that each of the four sets considered is characterized by a di¤erent RRM frequency. In each case SNR is much greater than20and therefore each of the frequencies may be considered as a characteristic RRM frequency for all promoters,which contradicts the fundamental hypothesis of the RRM that each group of biological sequences with the same biological function is characterized by a single frequency.Also,each of the obtained frequencies is signi…cantly di¤erent from the characteristic frequency for promoters quoted in[6]as0:34375.Reason for these discrepancies is suspected to be zero-padding in step3of the RRM procedure and is discussed in more detail in[12]and in the following section.Here it will be su¢cient to say that zero-padding results in occurrence of some frequency components in the spectrum of the zero-padded sequence whose existence is not guaranteed by the original sequence(this phenomenon,represented by the creation of sidelobes between the original frequency components,is called spectral leakage).These newly introduced frequency components enter the process of cross-spectral multiplication in step5of the RRM procedure and become candidates to be recognized as a characteristic RRM frequencies.This is probably the reason why di¤erent characteristic frequencies were obtained for the four sets of promoters examined in this section.12Table2.2:Results of RRM analysis of set H of41promoters and three of its subsetsFigure2-1:RRM consensus spectra for promoters from(a)set E;(b)set F;(c)set G;(d)set H.2.2Guidelines for development of a signal processing basedmethod for biological sequence analysisIn the previous section we saw that RRM,in its current form,cannot be applied to the problem of promoter recognition,since it is not possible to characterize promoters by a single frequency compo-13nent in the normalized spatial frequency domain.This does not mean that the concept of resonant recognition between macromolecules is wrong.Rather,it seems that signal processing aspects in the RRM are incorrectly applied in the form of zero-padding.In this section we present basic guidelines for development of a method for spectral analysis of biological sequences that attempts to avoid problems associated with the RRM that were demonstrated in the previous section.In order to do this,we must …rst examine the essence of the problem we are faced with.Consider two sequences x1and x2of lengths N1and N2,respectively,such that N1<N2.Let X1=DF T(x1)and X2=DF T(x2).Now,if we consider the elements of x1and x2to be equidistant, with normalized distance d=1,then X1and X2are de…ned on the following two sets of points, respectivelyS1=f k=N1j k=0;1;:::;N1¡1g(2.2)S2=f k=N2j k=0;1;:::;N2¡1gwhich,if N2is not a multiple of N1,do not have any points in common(apart from0).Equivalently,we say that X1and X2have di¤erent resolutions:1=N1and1=N2,respectively.Therefore,direct comparison of X1and X2is not possible.As an example,spectra of sequences x1=0;1;3;2and x2=1;2;3;2;1;2;0 are shown in Figure2-2.Figure shows the complete spectra on interval[0;1).Non-redundant components are,of course,lo-cated in the interval[0;0:5].Since the sequences are of di¤erent lengths(4and7),their spectra X1and X2 are de…ned on two di¤erent sets of points S1=f0;1=4;2=4;3=4g and S1=f0;1=7;2=7;3=7;4=7;5=7;6=7g, respectively.Our problem is how to compare these two spectra and see to what extent are they similar.Obviously,we have to modify sets S1and S2in some way.RRM proposes zero-padding as a way to increase the length of x1up to the length of x2in order to make sets S1and S2equal.This provides a14way of interpolation in the frequency domain and increases the computational resolution of the spectrum of x1to1=N2[13].However,due to the Uncertainty Principle of Fourier analysis[14],physical resolution of the spectrum of x1is limited by its original length and remains1=N1[13].This e¤ectively means that we cannot discover any new information about the spectrum of the signal by performing the zero-padding operation-interpolation is done on the wrong curve[13](unless the signal indeed happens to be zero outside of its original domain of de…nition,which is,in case of biological sequences,physically virtually never possible[12]).We therefore conclude that zero-padding introduces an error into the spectrum of the signal.This error can often be large enough to make zero-padding unsuitable as a way to enable the comparison of spectra of di¤erent resolutions.Hence,if this comparison cannot be made by increasing the resolution of the spectrum with lower resolution(X1),it seems natural to attempt to enable the comparison by reducing the resolution of the spectrum with higher resolution(X2).This reduction in resolution may be achieved in the following way.First,physical resolution of X1is limited to1=N1,which means that it is limited to a set of intervalsI1=f[a¡12N1;a+12N1]j a2S1g.We can therefore take as the estimate of the power of x1in anyparticular frequency interval from I1the square of the amplitude of the frequency component of x1that belongs to that interval(e.g.for the signal x1given above,estimate of its power in frequency interval [0:125;0:375]would be(3:16)2=4¼2:5,since component of X1at point0:25in the normalized frequency domain has an amplitude of3:16-refer to Figure2-2;note:division by4above is necessary since the power of the frequency component X1(k)is j X1(k)j2=N1-see[15]).Equivalently,we estimate the power of x2in any particular frequency interval from I1as the sum of squares amplitudes of frequency components of x2that belong to that interval(e.g.for the signal x2given above,estimate of its power in frequency interval[0:125;0:375]would be(3:36)2=7+(2:21)2=7¼2:31,since the components of X2that belong to[0:125;0:375]have amplitudes of3:36and2:21-refer to Figure2-2).Performing this process for all intervals from I1for signals x1and x2given above,we arrive at the following two graphs shown in Figure2-3.From the…gure,we see that estimated power of signals represented by sequences x1and x2is di¤erent in intervals[0;0:125]and[0:375;0:625],but is fairly similar in intervals[0:125;0:375]and [0:425;0:875],which is what might have been expected from the graphs in Figure2-2.On the other hand,consider what happens when x1is zero-padded up to length7(Figure2-4).Now the region of the greatest similarity between the spectra of x1and x2is actually closer to0:5.Also, note that the cross-spectral multiplication(2.1)would in this case give non-zero values at all7points in the spectrum,even at3=7¼0:43and4=7¼0:57,although in the original spectrum of x1in Figure 2-2there is no evidence that x1contains any power in the interval[0:4;0:6].This does not happen if we multiply the spectra shown in Figure2-3.Comparing the spectra of sequences of di¤erent lengths by reducing the resolution of the spectrum15。

A quantitative analysis of measures of quality in science

a r X i v :p h y s i c s /0701311v 1 [p h y s i c s .s o c -p h ] 27 J a n 2007A Quantitative Analysis of Measures of Quality in ScienceSune Lehmann ∗Informatics and Mathematical Modeling,Technical University of Denmark,Building 321,DK-2800Kgs.Lyngby,Denmark.Andrew D.Jackson and Benny utrupThe Niels Bohr Institute,Blegdamsvej 17,DK-2100København Ø,Denmark.(Dated:February 2,2008)Condensing the work of any academic scientist into a one-dimensional measure of scientiﬁc quality is a difﬁ-cult problem.Here,we employ Bayesian statistics to analyze several different measures of quality.Speciﬁcally,we determine each measure’s ability to discriminate between scientiﬁc ing scaling arguments,we demonstrate that the best of these measures require approximately 50papers to draw conclusions regarding long term scientiﬁc performance with usefully small statistical uncertainties.Further,the approach described here permits the value-free (i.e.,statistical)comparison of scientists working in distinct areas of science.PACS numbers:89.65.-s,89.75.DaI.INTRODUCTIONIt appears obvious that a fair and reliable quantiﬁcation of the ‘level of excellence’of individual scientists is a near-impossible task [1,2,3,4,5].Most scientists would agree on two qualitative observations:(i)It is better to publish a large number of articles than a small number.(ii)For any given pa-per,its citation count—relative to citation habits in the ﬁeld in which the paper is published—provides a measure of its qual-ity.It seems reasonable to assume that the quality of a scientist is a function of his or her full citation record 1.The question is whether this function can be determined and whether quan-titatively reliable rankings of individual scientists can be con-structed.A variety of ‘best’measures based on citation data have been proposed in the literature and adopted in practice [6,7].The speciﬁc merits claimed for these various measures rely largely on intuitive arguments and value judgments that are not amenable to quantitative investigation.(Honest people can disagree,for example,on the relative merits of publishing a single paper with 1000citations and publishing 10papers with 100citations each.)The absence of quantitative support for any given measure of quality based on citation data is of concern since such data is now routinely considered in mat-ters of appointment and promotion which affect every work-ing scientist.Citation patterns became the target of scientiﬁc scrutiny in the 1960s as large citation databases became available through the work of Eugene Garﬁeld [8]and other pioneers in the ﬁeld of bibliometrics.A surprisingly,large body of work on the statistical analysis of citation data has been performed by physicists.Relevant papers in this tradition include the pio-neering work of D.J.de Solla Price,e.g.[9],and,more re-cently,[7,10,11,12].In addition,physicists are a driving force in the emerging ﬁeld of complex networks.Citation net-works represent one popular network specimen in which pa-pers correspond to nodes connected by references (out-links)2We use the Greek alphabet when binning with respect to to m and the Ro-man alphabet for binning citations.2 aﬁxed author bin,α.Bayes’theorem allows us to invert thisprobability to yieldP(α|{n i})∼P({n i}|α)p(α),(1)where P(α|{n i})is the probability that the citation record{n i}was drawn at random from author binα.By considering theactual citation histories of authors in binβ,we can thus con-struct the probability P(α|β),that the citation record of an au-thor initially assigned to binβwas drawn on the the distribu-tion appropriate for binα.In other words,we can determinethe probability that an author assigned to binβon the basisof the tentative quality measure should actually be placed inbinα.This allows us to determine both the accuracy of theinitial author assignment its uncertainty in a purely statisticalfashion.While a good choice of measure will assign each author tothe correct bin with high probability this will not always be thecase.Consider extreme cases in where we elect to bin authorson the basis of measures unrelated to scientiﬁc quality,e.g.,by hair/eye color or alphabetically.For such measures P(i|α)and P({n i}|α)will be independent ofα,and P(α|{n i})willbecome proportional to prior distribution p(α).As a conse-quence,the proposed measure will have no predictive powerwhatsoever.It is obvious,for example,that a citation recordprovides no information of its author’s hair/eye color.Theutility of a given measure(as indicated by the statistical ac-curacy with which a value can be assigned to any given au-thor)will obviously be enhanced when the basic distributionsP(i|α)depend strongly onα.These differences can be for-malized using the standard Kullback-Leibler divergence.Aswe shall see,there are signiﬁcant variations in the predictivepower of various familiar measures of quality.The organization of the paper is as follows.Section II isdevoted to a description of the data used in the analysis,Sec-tion III introduces the various measures of quality that we willconsider.In Sections IV and V,we provide a more detaileddiscussion of the Bayesian methods adopted for the analysisof these measures and a discussion of which of these measuresis best in the sense described above of providing the maximumdiscriminatory power.This will allow us in Section VI to ad-dress to the question of how many papers are required in orderto make reliable estimates of a given author’s scientiﬁc qual-ity;ﬁnally,Section A discusses the origin of asymmetries insome the measures.A discussion of the results and variousconclusions will be presented in Section VII.II.DATAThe analysis in this paper is based on data from theSPIRES3database of papers in high energy physics.Our data3FIG.2:Logarithmically binned histogram of the citations in bin6of the median measure.The△points show the citation distributionof theﬁrst25papers by all authors.The points marked by⋆showthe distribution of citations from theﬁrst50papers by authors whohave written more than50papers.Finally,the data points showthe distribution of all papers by all authors.The axes are logarithmic.histogram4.Studies performed on theﬁrst25,ﬁrst50and all papers fora given value of m show the absence of temporal correlations.It is of interest to see this explicitly.Consider the followingexample.In Figure2,we have plotted the distribution for bin6of the median measure5.There are674authors in this bin.Two thirds of these authors have written50papers or more.Only this subset is used when calculating theﬁrst50papersresults.In this bin,the means for the total,ﬁrst25andﬁrst50papers are11.3,12.8,and12.9citations per paper,respec-tively.The median of the distributions are4,6,and6.Theplot in Figure2conﬁrms these observations.The remainingbins and the other measures yield similar results.Note that Figure2conﬁrms the general observations on theshapes of the conditional distributions made above.Figure2also shows two distinct power-laws.Both of the power-laws inthis bin areﬂatter than the ones found in the total distributionand the transition point is lower than in the total distributionfrom Figure1.III.MEASURES OF SCIENTIFIC EXCELLENCEDespite differing citation habits in differentﬁelds of sci-ence,most scientists agree that the number of citations of agiven paper is the best objective measure of the quality of thatpaper.The belief underlying the use of citations as a measureof quality is that the number of citations to a paper provides6We realize that there are a number of problems related to the use of cita-tions as a proxy for quality.Papers may be cited or not for reasons otherthan their high quality.Geo-and/or socio-political circumstances can keepworks of high quality out of the mainstream.Credit for an important ideacan be attributed incorrectly.Papers can be cited for historical rather thanscientiﬁc reasons.Indeed,the very question of whether authors actuallyread the papers they cite is not a simple one[18].Nevertheless,we assumethat correct citation usage dominates the statistics.7Diverging higher moments of power-law distributions are discussed in theliterature.E.g.[19].4 1000citations is of greater value to science than the author of10papers with100citations each(even though the latter is farless probable than the former).In this sense,the maximallycited paper might provide better discrimination between au-thors of‘high’and‘highest’quality,and this measure meritsconsideration.Another simple and widely used measure of scientiﬁc ex-cellence is the average number of papers published by an au-thor per year.This would be a good measure if all paperswere cited equally.As we have just indicated,scientiﬁc pa-pers are emphatically not cited equally,and few scientists holdthe view that all published papers are created equal in qualityand importance.Indeed,roughly50%of all papers in SPIRESare cited≤2times(including self-citation).This fact alone issufﬁcient to invalidate publication rate as a measure of sci-entiﬁc excellence.If all papers were of equal merit,citationanalysis would provide a measure of industry rather than oneof intrinsic quality.In an attempt order to remedy this problem,Thomson Sci-entiﬁc(ISI)introduced the Impact Factor8which is designedto be a“measure of the frequency with which the‘averagearticle’in a journal has been cited in a particular year or pe-riod”9.The Impact Factor can be used to weight individualpapers.Unfortunately,citations to articles ina given journalalso obey power-law distributions[12].This has two conse-quences.First,the determination of the Impact Factor is sub-ject to the largeﬂuctuations which are characteristic of power-law distributions.Second,the tail of power-law distributions displaces the mean citation to higher values of k so that the majority of papers have citation counts that are much smaller than the mean.This fact is for example expressed in the large difference between mean and median citations per paper.For the total SPIRES data base,the median is2citations per pa-per;the mean is approximately15.Indeed,only22%of the papers in SPIRES have a number of citations in excess of the mean,cf.[11].Thus,the dominant role played by a relatively small number of highly cited papers in determining the Impact Factor implies that it is subject to relatively largeﬂuctuations and that it tends overestimate the level of scientiﬁc excellence of high impact journals.This fact was directly veriﬁed by Seglen[20],who showed explicitly that the citation rate for individual papers is uncorrelated to the impact factor of the journal in which it was published.An alternate way to measure excellence is to categorize each author by the median number of citations of his papers, k1/2.Clearly,the median is far less sensitive to statisticalﬂuc-tuations since all papers play an equal role in determining its value.To demonstrate the robustness of the median,it is use-ful to note that the median of N=2N+1random draws on any normalized probability distribution,q(x),is normally dis-tributed in the limit N→∞.To this end we deﬁne the integral1!N!N!q(x)Q(x)N[1−Q(x)]N.(3)For large N,the maximum of P x1/2(x)occurs at x=x1/2where Q(x1/2)=1/2.Expanding P x1/2(x)about its maximum value, we see thatP x1/2(x)=12πσ2exp[−(x−x1/2)24q(x1/2)2N.(4)A similar argument applies for every percentile.The statis-tical stability of percentiles suggests that they are well-suited for dealing with the power laws which characterize citation distributions.Recently,Hirsch[7]proposed a different measure,h,in-tended to quantify scientiﬁc excellence.Hirsch’s deﬁnition is as follows:“A scientist has index h if h of his/her N p papers have at least h citations each,and the other(N p−h)papers have fewer than h citations each”[7].Unlike the mean and the median,which are intensive measures largely constant in time, h is an extensive measure which grows throughout a scientiﬁc career.Hirsch assumes that h grows approximately linearly with an author’s professional age,deﬁned as the time between the publication dates of theﬁrst and last paper.Unfortunately, this does not lead to an intensive measure.Consider,for exam-ple,the case of authors with large time gaps between publica-tions,or the case of authors whose citation data are recorded in disjoint databases.A properly intensive measure can be obtained by dividing an author’s h-index by the number of his/her total publications.We will consider both approaches below.The h-index represents an attempt to strike a balance be-tween productivity and quality and to escape the tyranny of power law distributions which place strong weight on a rel-atively small number of highly cited papers.The problem is that Hirsch assumes an equality between incommensurable quantities.An author’s papers are listed in order of decreasing citations with paper i having C(i)citations.Hirsch’s measure is determined by the equality,h=C(h),which posits an equal-ity between two quantities with no evident logical connection. While it might be reasonable to assume that hγ∼C(h),there is no reason to assume thatγand the constant of proportionality are both1.We will also include one intentionally nonsensical choice in the following analysis of the various proposed measures of author quality.Speciﬁcally,we will consider what happens when authors are binned alphabetically.In the absence of his-torical information,it is clear that an author’s citation recordshould provide us with no information regarding the author’s name.Binning authors in alphabetic order should thus fail any statistical test of utility and will provide a useful calibration of the methods adopted.The measures of quality described in this section are the ones we will consider in the remainder of this paper.IV.A BAYESIAN ANALYSIS OF CITATION DATA The rationale behind all citation analyses lies in the fact that citation data is strongly correlated such that a‘good’scientist has a far higher probability of writing a good(i.e., highly cited)paper than a‘poor’scientist.Such correlations are clearly present in SPIRES[11,21].We thus categorize each author by some tentative quality index based on their to-tal citation record.Once assigned,we can empirically con-struct the prior distribution,p(α),that an author is in author binαand the probability P(N|α)that an author in binαhas a total of N publications.We also construct the conditional probability P(i|α)that a paper written by an author in binαwill lie in citation bin i.As we have seen earlier,studies per-formed on theﬁrst25,ﬁrst50and all papers of authors in a given bin reveal no signs of additional temporal correlations in the lifetime citation distributions of individual authors.In performing this construction,we have elected to bin authors in deciles.We bin papers into L bins according to the number of citations.The binning of papers is approximately logarithmic (see Appendix A).We have conﬁrmed that the results stated below are largely independent of the bin-sizes chosen. We now wish to calculate the probability,P({n i}|α),that an author in binαwill have the full(binned)citation record {n i}.In order to perform this calculation,we assume that the various counts n i are obtained from N independent random draws on the appropriate distribution,P(i|α).Thus,P(i|α)n iP({n i}|α)=P(N|α)N!L∏i=1P({n i})p(α)P(N|α)∏j P(j|α)n j=ΑP Α AΑP Α AΑP Α AΑP Α A (e)Max(f)Mean(g)Median(h)65th percentileΑPΑPΑPΑP FIG.3:A single author example.We analyze the citation record of author A with respect to the eight different measures deﬁned in the text.Author A has written a total of 88papers.The mean of this citation record is 26citations per paper,the median is 13citations,the h -index is 29,the maximally cited paper has 187citations,and papers have been published at the average rate of 2.5papers per year.The various panels show the probability that author A belongs to each of the ten deciles given on the corresponding measure;the vertical arrow displays the initial assignment.Panel (a)displays P (ﬁrst initial |A )(b)shows P (papers per year |A ),(c)shows P (h /T |A ),(d)shows P (h /N |A ),panel (e)shows P (k max |A ),panel (f)displays P ( k |A ),(g)shows P (k 1/2|A ),and ﬁnally (h)shows P (k .65|A ).abilities that an author initially assigned to bin αbelongs in decile bin β.This probability is proportional to the area of the corresponding squares.Obviously,a perfect measure would place all of the weight in the diagonal entries of these plots.Weights should be centered about the diagonal for an accurate identiﬁcation of author quality and the certainty of this iden-tiﬁcation grows as weight accumulates in the diagonal boxes.Note that anassignment ofa decile based on Eq.(6)is likely to be more reliable than the value of the initial assignment since the former is based on all information contained in the citation record.Figure 4emphasizes that ‘ﬁrst initial’and ‘publications per year’are not reliable measures.The h -index normalized by professional age performs poorly;when normalized by num-ber of papers,the trend towards the diagonal is enhanced.We note the appearance of vertical bars in each ﬁgure in the top row.This feature is explained in Appendix A.All four mea-sures in the bottom row perform fairly well.The initial as-signment of the k max measure always underestimates an au-thor’s correct bin.This is not an accident and merits comment.Speciﬁcally,if an author has produced a single paper with ci-tations in excess of the values contained in bin α,the prob-ability that he will lie in this bin,as calculated with Eq.(6),is strictly 0.Non-zero probabilities can be obtained only for bins including maximum citations greater than or equal to the maximum value already obtained by this author.(The fact that the probabilities for these bins shown in Fig.4are not strictly 0is a consequence of the use of ﬁnite bin sizes.)Thus,binning authors on the basis of their maximally cited paper necessarily underestimates their quality.The mean,median and 65th per-centile appear to be the most balanced measures with roughly equal predictive value.It is clear from Eq.(6)that the ability of a given measure to discriminate is greatest when the differences between the con-ditional probability distributions,P (i |α),for different author bins are largest.These differences can quantiﬁed by measur-ing the ‘distance’between two such conditional distributions with the aid of the Kullback-Leibler (KL)divergence (also know as the relative entropy).The KL divergence between two discrete probability distributions,p and p ′is deﬁned 10asKL [p ,p ′]=∑ip i lnp i10The non-standard choice of the natural logarithm rather than the logarithm base two in the deﬁnition of the KL divergence,will be justiﬁed below.11Figure 5gives a misleading picture of the k max measure,since the KL di-vergences KL [P (i |α+1),P (i |α)]are inﬁnite as discussed above.123456789ΑΒ123456789ΑΒ123456789ΑΒ123456789ΑΒ(e)Max (f)Mean (g)Median(h)65thpercentile12345678910123456789ΑΒ12345678910123456789ΑΒ12345678910123456789ΑΒ12345678910123456789ΑΒFIG.4:Eight different measures.Each horizontal row shows the average probabilities (proportional to the areas of the squares)that authors initially assigned to decile bin αare predicted to belong in bin β.Panels as in Fig.3.0.020.040.060.080.1FIG.5:The Kullback-Leibler divergences KL [P (i |α),P (i |α+1)].Results are shown for the following distributions:h -index normal-ized by number of publications,maximum number of citations,mean,median,and 65th percentile.dramatically smaller than the other measures shown except for the extreme deciles.The reduced ability of all measures to discriminate in the middle deciles is immediately apparent from Fig.5.This is a direct consequence any percentile binning given that the dis-tribution of author quality has a maximum at some non-zero value,the bin size of a percentile distribution near the maxi-mum will necessarily be small.The accuracy with which au-thors can be assigned to a given bin in the region around the maximum is reduced since one is attempting to distinguishln P(β|{n i}) N9limit the utility of such analyses in the academic appointment process.This raises the question of whether there are more ef-ﬁcient measures of an author’s full citation record than those considered here.Our object has been toﬁnd that measure which is best able to assign the most similar authors together. Straightforward iterative schemes can be constructed to this end and are found to converge rapidly(i.e.,exponentially) to an optimal binning of authors.(The result is optimal in the sense that it maximizes the sum of the KL divergences, KL[P(•|α),P(•|β)],over allαandβ.)The results are only marginally better than those obtained here with the mean,me-dian or65th percentile measures.Finally,it is also important to recognize that it takes time for a paper to accumulate its full complement of citations.While their are indications that an author’s early and late publications are drawn(at random)on the same conditional distribution [11],many highly cited papers accumulate citations at a con-stant rate for many years after their publication.This effect, which has not been addressed in the present analysis,repre-sents a serious limitation on the value of citation analyses for younger authors.The presence of this effect also poses the ad-ditional question of whether there are other kinds of statistical publication data that can deal with this problem.Co-author linkages may provide a powerful supplement or alternative to citation data.(Preliminary studies of the probability that au-thors in binsαandβwill co-author a publication reveal a striking concentration along the diagonalα=β.)Since each paper is created with its full set of co-authors,such informa-tion could be useful in evaluating younger authors.This work will be reported elsewhere.APPENDIX A:VERTICAL STRIPESThe most striking feature of the calculated P(β|α)shown in Fig.4is presence of vertical‘stripes’.These stripes are most pronounced for the poorest measures and disappear as the re-liability of the measure improves.Here,we offer a schematic but qualitatively reliable explanation of this phenomenon.To this end,imagine that each author’s citation record is actually drawn at random on the true distributions Q(i|A).For sim-plicity,assume that every author has precisely N publications, that each author in true class A has the same distribution of citations with n A i=NQ(i|A),and that there are equal num-bers of authors in each true author class.These authors are then distributed into author bins,α,according to some cho-sen quality measure.The methods of Sections IV and V can then be used to determine P(i|α),P({n(A)i}|β),P(β|{n(A)i}) and P(β|α).Given the form of the n(A)i and assuming that N is large,weﬁnd thatP(β|{n(A)i})≈exp(−N KL[Q(•|A),P(•|β)])(A1) and˜P(β|α)∼∑AP(A|α)exp(−N KL[Q(•|A),P(•|β)]),(A2)where P(A|α)is the probability that the citation record of an author assigned to classαwas actually drawn on Q(i|A).The(a)Papers/year˜P(α′′12345678910123456789ΑΒ12345678910123456789ΑΒFIG.8:A comparison of the approximate˜P(β|α)from Eq.(A2)and the exact P(β|α)for the papers published per year measure.results of this approximate evaluation are shown in Fig.8and compared with the exact values of P(β|α)for the papers per year measure.The approximations do not affect the qualita-tive features of interest.We now assume that the measure deﬁning the author bins,α,provides a poor approximation to the true bins,A.In this case,authors will be roughly uniformly distributed,and the factor P(A|α)appearing in Eq.(A2)will not show large vari-ations.Signiﬁcant structure will arise from the exponential terms,where the presence of the factor N(assumed to be large),will amplify the differences in the KL divergences.The KL divergence will have a minimum value for some value of A=A0(β),and this single term will dominate the sum.Thus,˜P(β|α)reduces to˜P(β|α)∼P(A0|α)exp(−N KL[Q(•|A0),P(•|β)]).(A3) The vertical stripes prominent in Figs.4(a)and(b)emerge as a consequence of the dominantβ-dependent exponential fac-tor.The present arguments also apply to the worst possible measure,i.e.,a completely random assignment of authors to the binsα.In the limit of a large number of authors,N aut, all P(i|β)will be equal except for statisticalﬂuctuations.The resulting KL divergences will respond linearly to theseﬂuc-tuations.12Theseﬂuctuations will be ampliﬁed as before pro-vided only that N aut grows less rapidly than N2.The argument here does not apply to good measures where there is signif-icant structure in the term P(A|α).(For a perfect measure, P(A|α)=δAα.)In the case of good measures,the expected dominance of diagonal terms(seen in the lower row of Fig.4) remains unchallenged.APPENDIX B:EXPLICIT DISTRIBUTIONSFor convenience we present all data to determine the prob-abilities P(α|{n i})for authors who publish in the theory sub-section of SPIRES.Data is presented only for case of the mean10P(i|α)Bin number Total paper rangek=1m=1k=2m=22<k≤4m=34<k≤8m=48<k≤16m=516<k≤32m=632<k≤64m=764<k≤128m=8128<k≤256m=9i=10512<k≤k maxTABLE I:The binning of citations and total number of papers.Theﬁrst and second column show the bin number and bin ranges for thecitation bins used to determine the conditional citation probabilitiesP(i|α)for eachα,shown in Table III.The third and fourth columndisplay the bin number and total number of paper ranges used in thecreation of the conditional probabilities P(m|α)for eachα,displayedin Table IV.α#authors¯n(α)0–1.690.11.69–3.080.13.08–4.880.14.88–6.940.16.94–9.400.19.40–12.560.112.56–16.630.116.63–22.190.122.19–33.990.133.99–285.880.111α=10.4330.1880.1810.1220.0550.0160.0040.0000.0000.0000.000α=30.2630.1430.1780.1840.1400.0670.0190.0050.0010.0000.000α=50.1770.1130.1500.1810.1730.1260.0580.0170.0040.0010.000α=70.1180.0800.1210.1550.1820.1690.1100.0480.0120.0030.000α=90.0680.0450.0710.1070.1450.1710.1660.1210.0670.0270.012TABLE III:The distributions P(i|α).This table displays the conditional probabilities that an author writes a paper in paper-bin i given that his author-bin isα.α=10.0580.0490.1030.1870.2360.2170.1220.0250.003α=30.0430.0490.0950.1410.1980.2470.1620.0610.004α=50.0310.0390.0680.1260.1620.2450.2150.0990.015α=70.0280.0240.0490.0960.1780.2430.2480.1010.033α=90.0270.0280.0430.0770.1310.2120.1990.2230.061TABLE IV:The conditional probabilities P(m|α).This table contains the conditional probabilities that an author has a total number of publications in publication-bin m given that his author-bin isα.networks.Reviews of modern physics,74:47,2002.[15]S.N.Dorogovtsev and J.F.F.Mendes.Evolution of networks.Advances in Physics,51:1079,2002.[16]M.E.J.Newman.The structure and function of complex net-works.SIAM Review,45:167,2003.[17]S.Lehmann.Spires on the building of science.Master’s the-sis,The Niels Bohr Institute,2003.May be downloaded from www.imm.dtu.dk/∼slj/.[18]M.V.Simkin and V.P.Roychowdhury.Read before you cite!Complex Systems,14:269,2003.[19]M.E.J.Newman.Power laws,pareto distributions and zipf’slaw.Contemporary Physics,46:323,2005.[20]P.O.Seglen.Casual relationship between article citedness andjournal impact.Journal of the American Society for Information Science,45:1,1994.[21]S.Lehmann,A.D.Jackson,and utrup.Life,death,andpreferential attachment.Europhysics Letters,69:298,2005. [22]A.J.Lotka.The frequency distribution of scientiﬁc productiv-ity.Journal of the Washington Academy of Sciences,16:317, 1926.。

1112 A Sparse SVD Method for High-dimensional Data

Key Words: Cross-validation; Denoising; Low-rank matrix approximation; Penalization; Principal component analysis; Power iterations; Thresholding.
Байду номын сангаас
1
Introduction
∗
1
duction, data visualization, data compression and information extraction by extracting the ﬁrst few singular vectors or eigenvectors; see, for example, Alter et al. (2001), Prasantha et al. (2007), Huang et al. (2009), Thomasian et al. (1998). In recent years, the demands on multivariate methods have escalated as the dimensionality of data sets has grown rapidly in such ﬁelds as genomics, imaging, ﬁnancial markets. A critical issue that has arisen in large datasets is that in very high dimensional settings classical SVD and PCA can have poor statistical properties (Shabalin and Nobel 2010, Nadler 2009, Paul 2007, and Johnstone and Lu 2009). The reason is that in such situations the noise can overwhelm the signal to such an extent that traditional estimates of SVD and PCA loadings are not even near the ballpark of the underlying truth and can therefore be entirely misleading. Compounding the problems in large datasets are the diﬃculties of computing numerically precise SVD or PCA solutions at aﬀordable cost. Obtaining statistically viable estimates of eigenvectors and eigenspaces for PCA on high-dimensional data has been the focus of a considerable literature; a representative but incomplete list of references is Lu (2002), Zou et al. (2006), Paul (2007), Paul and Johnstone (2007), Shen and Huang (2008), Johnstone and Lu (2009), Shen et al. (2011), Ma (2011). On the other hand, overcoming similar problems for the classical SVD has been the subject of far less work, pertinent articles being Witten et al. (2009), Lee et al. (2010a), Huang et al. (2009) and Allen et al. (2011). In the high dimensional setting, statistical estimation is not possible without the assumption of strong structure in the data. This is the case for vector data under Gaussian sequence models (Johnstone, 2011), but even more so for matrix data which require assumptions such as low rank in addition to sparsity or smoothness. Of the latter two, sparsity has slightly greater generality because certain types of smoothness can be reduced to sparsity through suitable basis changes (Johnstone, 2011). By imposing sparseness on singular vectors, one may be able to “sharpen” the structure in data and thereby expose “checkerboard” patterns that convey biclustering structure, that is, joint clustering in the row- and column-domains of the data (Lee et al. 2010a and Sill et al. 2011). Going one step further, Witten and

荣格分析心理学在中国的研究现状与发展趋势——基于Citespace的知识图谱分析

收稿日期：２０２１０１０４作者简介：谢伟，南京师范大学心理学院博士研究生，主要研究方向：理论心理学与心理学史；郭本禹，南京师范大学心理学教授，博士生导师，主要研究方向：理论心理学与心理学史。

①　弗洛伊德著，高觉敷译：《精神分析引论》，商务印书馆１９８６年版。

②　车文博，郭本禹：《弗洛伊德主义新论》，上海教育出版社２０１８年版。

③　沃尔曼著，胡寄南译：《荣格：分析心理学》，《现代外国哲学社会科学文摘》１９６１年第１１期，第２０２５页。

④　霍尔等：《荣格心理学入门》，生活·读书·新知三联书店１９８７年版。

⑤　荣格著，苏克译：《寻找灵魂的现代人》，贵州人民出版社１９８７年版。

⑥　荣格著，刘国彬译：《回忆·梦·思考》，辽宁人民出版社１９８８年版。

⑦　荣格著，高岚主编：《荣格文集》，长春出版社２０１４年版。

⑧　荣格：《荣格精选集》，译林出版社２０１９年版。

２０２１年３月第２期南京晓庄学院学报ＪＯＵＲＮＡＬＯＦＮＡＮＪＩＮＧＸＩＡＯＺＨＵＡＮＧＵＮＩＶＥＲＳＩＴＹＭａｒ．２０２１Ｎｏ．２荣格分析心理学在中国的研究现状与发展趋势———基于Ｃｉｔｅｓｐａｃｅ的知识图谱分析谢　伟，郭本禹（南京师范大学心理学院，江苏南京２１００９７）摘　要：运用Ｃｉｔｅｓｐａｃｅ文献分析可视化工具对ＣＮＫＩ数据库中１９６０—２０２０年的２０９４篇荣格分析心理学中文文献进行分析，以期揭示近６０年来国内分析心理学的研究现状与趋势。

研究结果表明：国内荣格分析心理学研究经历了引进探索期、爆发期和稳定发展期；国内核心作者有１４位，其中申荷永和高岚的研究团队在该领域独树一帜，其他个人合作较弱；山西大学教育科学学院和吉林师范大学文学院是稳定的研究机构，其他机构间合作较弱；研究热点集中于荣格、原型、集体无意识、人格面具等，集体无意识是研究的中心点，总体研究趋势从大而泛的研究热点转向小而精；研究前沿领域分为１２类，研究热点仍是荣格古典分析心理学，较少涉及后荣格心理学学派的研究。

利用数据挖掘技术分析热力学相关数据

利用数据挖掘技术分析热力学相关数据数据挖掘技术已经成为了许多领域中，解决复杂问题的有力工具。

其中，在热力学领域中，数据挖掘技术的应用，可以使我们更好地理解热力学过程，并且可以为我们的热力学研究提供更加精确和具体的数据分析。

本文将会介绍几种利用数据挖掘技术分析热力学相关数据的方法。

首先，我们可以利用聚类算法来达到我们的目的。

聚类算法将数据集中的数据点划分到若干个类别中，使得同一类别内的数据点彼此相似，而不同类别的数据点差异较大。

例如，在汽车制造业中，聚类算法可以将车辆的型号，驱动方式和排量等信息划分到几个类别中，以此来研究不同类别的车辆的热力学性能差异。

在这个过程中，我们可能会遇到一些难以处理的问题，例如：如何确定聚类的数目？如何衡量聚类的效果？这些问题的解决需要我们深入研究聚类算法的特点和性能，并结合问题的具体情况进行决策。

其次，我们可以通过决策树算法来分析更加复杂的数据结构。

例如，工业生产过程中经常需要测量各种工艺参数来确定产品质量，我们可以将这些参数输入到决策树算法中，以达到决策树分类预测的目的。

这个过程中需要考虑一些重要的问题，例如：如何选择正确的特征？如何处理无效值和错误数据？如何评估决策树的性能？在解决这些问题之前，我们需要首先掌握决策树的算法结构，以及如何构建和评估决策树模型。

最后，我们可以通过神经网络算法来进行更加复杂的热力学分析。

神经网络模型是一种可以学习和记忆数据集中相关关系的数学模型，特别适合处理高度连续，复杂的热力学问题。

例如，在材料物理学中，神经网络模型可以被用来预测材料中丰度等性能，从而指导材料制造过程的调控。

当然，在这个过程中，我们也需要注意神经网络模型的搭建和训练，以及如何评估和优化其在问题上的性能。

综上所述，数据挖掘技术对于热力学领域的应用是非常有前景的。

通过合理的算法选择、特征思考和数据处理，我们可以为热力学领域提供更加准确和全面的分析结果。

当然，这个过程中我们也需要不断调整优化，不断追求变成的算法技术，才能让我们在热力学分析领域中驰骋自如，不断挖掘出热力学问题的深层次的本质规律。

数据分析报告写作书单

数据分析报告写作书单数据分析报告书单1. "数据分析实战：Python编程与案例应用" - 作者：韦洪瑞该书是一本面向初学者的实践性数据分析指南。

通过Python 编程语言和实际案例应用，读者能够学习到数据分析的基本概念、方法和工具。

书中包含了大量的示例代码和实际数据集，能够帮助读者快速入门并掌握数据分析的基本技能。

2. "Python数据科学手册" - 作者：Jake VanderPlas这本书是一本非常全面的Python数据科学指南，旨在帮助读者利用Python进行数据分析和数据可视化。

书中涵盖了数据处理、数据分析、机器学习等各个方面的内容，对于想要深入学习数据科学的读者来说是一本必备的参考书。

3. "R语言实战" - 作者：Hadley WickhamR语言是一种非常流行的数据分析工具，尤其在统计学领域广泛应用。

该书是一本R语言实践指南，作者通过大量的实例和案例，教会读者如何使用R语言进行数据清洗、探索性数据分析、统计模型建立等。

无论是初学者还是有一定R语言基础的读者，都能够从中获得实用的数据分析技巧。

4. "数据可视化实战" - 作者：Nathan Yau数据可视化是数据分析中非常重要的环节之一，能够帮助人们更好地理解和解释数据。

该书主要介绍了数据可视化的基本原理和实践技巧，通过具体的案例和示例，教会读者如何使用各种工具和技术创建吸引人的数据可视化作品。

对于想要提升数据分析报告的可视化效果的读者来说是一本不容错过的书籍。

总结：以上是几本值得推荐的数据分析相关书籍。

无论是初学者还是有一定经验的数据分析师，都可以从中获得有益的知识和技巧。

数据分析是当今社会非常重要的一项技能，掌握好数据分析方法和工具将为个人和企业带来巨大的价值。

希望读者能够通过这些书籍的学习，提升自己的数据分析能力。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

1996年温斯洛普大学学生的宗教调查(Winthrop University Student Religion Survey, 1996 )
数据摘要：
This study was designed by the principal investigator and his students in an upper-division sociology of religion course. The survey items were formulated around key issues in class and administered to Winthrop University students in General Education classes. The topics covered include religious background and behavior, spiritual beliefs, and attitudes toward deviant religious groups.
中文关键词：
温斯洛普大学,学生,宗教调查,关键问题,精神信念,态度,
英文关键词：
Winthrop university,student,religion survey,key issues,spiritual belief,attitude,
数据格式：
TEXT
数据用途：
The data can be used for data mining.
数据详细介绍：
Winthrop University Student Religion Survey, 1996
This study was designed by the principal investigator and his students in an upper-division sociology of religion course. The survey items were formulated around key issues in class and administered to Winthrop University students in General Education classes. The topics covered include religious background and behavior, spiritual beliefs, and attitudes toward deviant religious groups. Data File
Cases: 306
Variables: 46
Weight Variable:
Data Collection
Date Collected: Fall, 1996
Collection Procedures
Self-report questionnaire
Sampling Procedures
The target population for this survey was students at Winthrop University, a state university in South Carolina. The sample was drawn from students in lower-division General Education classes. Classes were surveyed with the consent of professors. There is no information on response rate, though refusals were quite rare.
Principal Investigators
Douglas Eckberg
数据预览：
点此下载完整数据集。