H. B. Barlow. Unsupervised learning. Neural Computation, 1295–311, 1989.

合集下载

无监督学习在自然语言生成中的应用

无监督学习在自然语言生成中的应用第一章引言1.1 研究背景自然语言生成(Natural Language Generation，NLG)是人工智能领域重要的研究方向之一。

随着深度学习技术的快速发展，特别是无监督学习(Unsupervised Learning)方法的引入，自然语言生成取得了显著的进展。

本文将重点探讨无监督学习在自然语言生成中的应用。

1.2 研究目的和意义自然语言生成是人机交互、智能问答、游戏设计等领域的重要基础技术，在实际应用中具有广泛的应用前景。

本文旨在探讨无监督学习在自然语言生成中的应用，以期为后续研究提供参考，并推动自然语言生成技术的进一步发展。

第二章无监督学习概述2.1 无监督学习简介无监督学习是指从未标记的数据中学习模式或结构的机器学习方法。

相比于有监督学习，无监督学习不需要标记的数据，能够自主地发现数据中的隐藏模式，因此具有更广泛的应用场景。

2.2 无监督学习方法在自然语言生成中，常用的无监督学习方法包括聚类、降维和生成模型。

聚类算法将相似的数据点归为一类，对语料库进行划分，发现数据的潜在类别。

降维算法则通过保留关键信息，将高维数据映射到低维空间，以便更好地处理和分析。

生成模型则是通过从数据中学习其分布，进而生成与原始数据相似的样本。

第三章无监督学习在自然语言生成中的应用3.1 主题发现与文本摘要主题发现是聚类算法在自然语言处理中的重要应用之一。

通过聚类算法，可以将文本数据划分为不同的主题，从而实现对大规模文本数据的自动归纳和处理。

在这个基础上，可以进一步提取关键词和生成文本摘要，为用户提供更快速、更精确的信息。

3.2 语言模型的生成语言模型的生成是自然语言生成的重要任务之一，可以用于机器翻译、对话系统和文本生成等应用中。

无监督学习可以通过生成模型来对大规模文本数据进行建模，发现其中的语言规律和结构。

生成模型能够学习到数据的分布，从而生成与原始数据相似的新样本。

3.3 词嵌入和语义表示词嵌入是将离散的词语映射到连续的向量空间中，以便更好地捕捉词语之间的语义相似性。

无监督学习的使用中常见错误(Ⅲ)

无监督学习（Unsupervised Learning）是机器学习领域的一个重要分支，它通过对数据的自动分析和模式识别来发现数据中的规律和结构。

与监督学习不同，无监督学习不需要事先标记好的训练数据，因此在很多实际应用中具有很大的吸引力。

然而，正是因为其自由度较高，无监督学习在使用中也容易出现一些常见的错误。

本文将从几个常见的角度来探讨无监督学习的使用中常见的错误。

数据预处理是无监督学习中一个很关键的环节。

不正确的数据预处理会直接影响到后续模型的性能。

常见的错误之一是忽略数据的分布情况。

在进行聚类或者降维等任务时，很多人会忽视数据的分布情况，直接将原始数据输入到算法中进行处理。

然而，如果数据的分布不均匀或者存在异常值，这样的处理结果很可能会偏离真实情况。

因此，在进行无监督学习之前，我们需要对数据的分布进行分析，如果发现数据存在较大的偏斜或者异常值，需要对数据进行相应的处理，比如去除异常值、进行数据平衡等。

另一个常见的错误是选择不合适的模型。

在无监督学习中，模型的选择是非常重要的。

不同的问题需要不同的模型来进行处理。

比如，在进行聚类任务时，我们可以选择K-means、层次聚类、DBSCAN等不同的算法，但是如果我们选择的算法不适合当前数据的特点，那么得到的聚类结果很可能会失真。

因此，在选择模型之前，我们需要对数据的特点有一个较好的认识，然后根据数据的特点来选择合适的模型。

另外，很多人在使用无监督学习时容易陷入“黑盒”陷阱。

所谓“黑盒”陷阱指的是使用无监督学习模型时只关注模型的输出结果，而忽视了模型内部的运行机制。

这样的做法会导致我们对数据的理解不深入，也无法发现模型内部的问题。

因此，在使用无监督学习模型时，我们需要对模型的内部机制有一个比较清晰的认识，理解模型的输入输出关系，以及模型参数对结果的影响，这样才能更好地使用模型。

此外，很多人在使用无监督学习时也容易陷入维度灾难。

维度灾难指的是在高维空间中进行数据分析时会面临的挑战，比如数据稀疏性、计算复杂度等。

《神经网络与深度学习综述DeepLearning15May2014

Draft:Deep Learning in Neural Networks:An OverviewTechnical Report IDSIA-03-14/arXiv:1404.7828(v1.5)[cs.NE]J¨u rgen SchmidhuberThe Swiss AI Lab IDSIAIstituto Dalle Molle di Studi sull’Intelligenza ArtiﬁcialeUniversity of Lugano&SUPSIGalleria2,6928Manno-LuganoSwitzerland15May2014AbstractIn recent years,deep artiﬁcial neural networks(including recurrent ones)have won numerous con-tests in pattern recognition and machine learning.This historical survey compactly summarises relevantwork,much of it from the previous millennium.Shallow and deep learners are distinguished by thedepth of their credit assignment paths,which are chains of possibly learnable,causal links between ac-tions and effects.I review deep supervised learning(also recapitulating the history of backpropagation),unsupervised learning,reinforcement learning&evolutionary computation,and indirect search for shortprograms encoding deep and large networks.PDF of earlier draft(v1):http://www.idsia.ch/∼juergen/DeepLearning30April2014.pdfLATEX source:http://www.idsia.ch/∼juergen/DeepLearning30April2014.texComplete BIBTEXﬁle:http://www.idsia.ch/∼juergen/bib.bibPrefaceThis is the draft of an invited Deep Learning(DL)overview.One of its goals is to assign credit to those who contributed to the present state of the art.I acknowledge the limitations of attempting to achieve this goal.The DL research community itself may be viewed as a continually evolving,deep network of scientists who have inﬂuenced each other in complex ways.Starting from recent DL results,I tried to trace back the origins of relevant ideas through the past half century and beyond,sometimes using“local search”to follow citations of citations backwards in time.Since not all DL publications properly acknowledge earlier relevant work,additional global search strategies were employed,aided by consulting numerous neural network experts.As a result,the present draft mostly consists of references(about800entries so far).Nevertheless,through an expert selection bias I may have missed important work.A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century.For these reasons,the present draft should be viewed as merely a snapshot of an ongoing credit assignment process.To help improve it,please do not hesitate to send corrections and suggestions to juergen@idsia.ch.Contents1Introduction to Deep Learning(DL)in Neural Networks(NNs)3 2Event-Oriented Notation for Activation Spreading in FNNs/RNNs3 3Depth of Credit Assignment Paths(CAPs)and of Problems4 4Recurring Themes of Deep Learning54.1Dynamic Programming(DP)for DL (5)4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL (6)4.3Occam’s Razor:Compression and Minimum Description Length(MDL) (6)4.4Learning Hierarchical Representations Through Deep SL,UL,RL (6)4.5Fast Graphics Processing Units(GPUs)for DL in NNs (6)5Supervised NNs,Some Helped by Unsupervised NNs75.11940s and Earlier (7)5.2Around1960:More Neurobiological Inspiration for DL (7)5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) (8)5.41979:Convolution+Weight Replication+Winner-Take-All(WTA) (8)5.51960-1981and Beyond:Development of Backpropagation(BP)for NNs (8)5.5.1BP for Weight-Sharing Feedforward NNs(FNNs)and Recurrent NNs(RNNs)..95.6Late1980s-2000:Numerous Improvements of NNs (9)5.6.1Ideas for Dealing with Long Time Lags and Deep CAPs (10)5.6.2Better BP Through Advanced Gradient Descent (10)5.6.3Discovering Low-Complexity,Problem-Solving NNs (11)5.6.4Potential Beneﬁts of UL for SL (11)5.71987:UL Through Autoencoder(AE)Hierarchies (12)5.81989:BP for Convolutional NNs(CNNs) (13)5.91991:Fundamental Deep Learning Problem of Gradient Descent (13)5.101991:UL-Based History Compression Through a Deep Hierarchy of RNNs (14)5.111992:Max-Pooling(MP):Towards MPCNNs (14)5.121994:Contest-Winning Not So Deep NNs (15)5.131995:Supervised Recurrent Very Deep Learner(LSTM RNN) (15)5.142003:More Contest-Winning/Record-Setting,Often Not So Deep NNs (16)5.152006/7:Deep Belief Networks(DBNs)&AE Stacks Fine-Tuned by BP (17)5.162006/7:Improved CNNs/GPU-CNNs/BP-Trained MPCNNs (17)5.172009:First Ofﬁcial Competitions Won by RNNs,and with MPCNNs (18)5.182010:Plain Backprop(+Distortions)on GPU Yields Excellent Results (18)5.192011:MPCNNs on GPU Achieve Superhuman Vision Performance (18)5.202011:Hessian-Free Optimization for RNNs (19)5.212012:First Contests Won on ImageNet&Object Detection&Segmentation (19)5.222013-:More Contests and Benchmark Records (20)5.22.1Currently Successful Supervised Techniques:LSTM RNNs/GPU-MPCNNs (21)5.23Recent Tricks for Improving SL Deep NNs(Compare Sec.5.6.2,5.6.3) (21)5.24Consequences for Neuroscience (22)5.25DL with Spiking Neurons? (22)6DL in FNNs and RNNs for Reinforcement Learning(RL)236.1RL Through NN World Models Yields RNNs With Deep CAPs (23)6.2Deep FNNs for Traditional RL and Markov Decision Processes(MDPs) (24)6.3Deep RL RNNs for Partially Observable MDPs(POMDPs) (24)6.4RL Facilitated by Deep UL in FNNs and RNNs (25)6.5Deep Hierarchical RL(HRL)and Subgoal Learning with FNNs and RNNs (25)6.6Deep RL by Direct NN Search/Policy Gradients/Evolution (25)6.7Deep RL by Indirect Policy Search/Compressed NN Search (26)6.8Universal RL (27)7Conclusion271Introduction to Deep Learning(DL)in Neural Networks(NNs) Which modiﬁable components of a learning system are responsible for its success or failure?What changes to them improve performance?This has been called the fundamental credit assignment problem(Minsky, 1963).There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses(Sec.6.8).The present survey,however,will focus on the narrower,but now commercially important,subﬁeld of Deep Learning(DL)in Artiﬁcial Neural Networks(NNs).We are interested in accurate credit assignment across possibly many,often nonlinear,computational stages of NNs.Shallow NN-like models have been around for many decades if not centuries(Sec.5.1).Models with several successive nonlinear layers of neurons date back at least to the1960s(Sec.5.3)and1970s(Sec.5.5). An efﬁcient gradient descent method for teacher-based Supervised Learning(SL)in discrete,differentiable networks of arbitrary depth called backpropagation(BP)was developed in the1960s and1970s,and ap-plied to NNs in1981(Sec.5.5).BP-based training of deep NNs with many layers,however,had been found to be difﬁcult in practice by the late1980s(Sec.5.6),and had become an explicit research subject by the early1990s(Sec.5.9).DL became practically feasible to some extent through the help of Unsupervised Learning(UL)(e.g.,Sec.5.10,5.15).The1990s and2000s also saw many improvements of purely super-vised DL(Sec.5).In the new millennium,deep NNs haveﬁnally attracted wide-spread attention,mainly by outperforming alternative machine learning methods such as kernel machines(Vapnik,1995;Sch¨o lkopf et al.,1998)in numerous important applications.In fact,supervised deep NNs have won numerous of-ﬁcial international pattern recognition competitions(e.g.,Sec.5.17,5.19,5.21,5.22),achieving theﬁrst superhuman visual pattern recognition results in limited domains(Sec.5.19).Deep NNs also have become relevant for the more generalﬁeld of Reinforcement Learning(RL)where there is no supervising teacher (Sec.6).Both feedforward(acyclic)NNs(FNNs)and recurrent(cyclic)NNs(RNNs)have won contests(Sec.5.12,5.14,5.17,5.19,5.21,5.22).In a sense,RNNs are the deepest of all NNs(Sec.3)—they are general computers more powerful than FNNs,and can in principle create and process memories of ar-bitrary sequences of input patterns(e.g.,Siegelmann and Sontag,1991;Schmidhuber,1990a).Unlike traditional methods for automatic sequential program synthesis(e.g.,Waldinger and Lee,1969;Balzer, 1985;Soloway,1986;Deville and Lau,1994),RNNs can learn programs that mix sequential and parallel information processing in a natural and efﬁcient way,exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past75years.The rest of this paper is structured as follows.Sec.2introduces a compact,event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs.Sec.3introduces the concept of Credit Assignment Paths(CAPs)to measure whether learning in a given NN application is of the deep or shallow type.Sec.4lists recurring themes of DL in SL,UL,and RL.Sec.5focuses on SL and UL,and on how UL can facilitate SL,although pure SL has become dominant in recent competitions(Sec.5.17-5.22). Sec.5is arranged in a historical timeline format with subsections on important inspirations and technical contributions.Sec.6on deep RL discusses traditional Dynamic Programming(DP)-based RL combined with gradient-based search techniques for SL or UL in deep NNs,as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs,including successful policy gradient and evolutionary methods.2Event-Oriented Notation for Activation Spreading in FNNs/RNNs Throughout this paper,let i,j,k,t,p,q,r denote positive integer variables assuming ranges implicit in the given contexts.Let n,m,T denote positive integer constants.An NN’s topology may change over time(e.g.,Fahlman,1991;Ring,1991;Weng et al.,1992;Fritzke, 1994).At any given moment,it can be described as aﬁnite subset of units(or nodes or neurons)N= {u1,u2,...,}and aﬁnite set H⊆N×N of directed edges or connections between nodes.FNNs are acyclic graphs,RNNs cyclic.Theﬁrst(input)layer is the set of input units,a subset of N.In FNNs,the k-th layer(k>1)is the set of all nodes u∈N such that there is an edge path of length k−1(but no longer path)between some input unit and u.There may be shortcut connections between distant layers.The NN’s behavior or program is determined by a set of real-valued,possibly modiﬁable,parameters or weights w i(i=1,...,n).We now focus on a singleﬁnite episode or epoch of information processing and activation spreading,without learning through weight changes.The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system.During an episode,there is a partially causal sequence x t(t=1,...,T)of real values that I call events.Each x t is either an input set by the environment,or the activation of a unit that may directly depend on other x k(k<t)through a current NN topology-dependent set in t of indices k representing incoming causal connections or links.Let the function v encode topology information and map such event index pairs(k,t)to weight indices.For example,in the non-input case we may have x t=f t(net t)with real-valued net t= k∈in t x k w v(k,t)(additive case)or net t= k∈in t x k w v(k,t)(multiplicative case), where f t is a typically nonlinear real-valued activation function such as tanh.In many recent competition-winning NNs(Sec.5.19,5.21,5.22)there also are events of the type x t=max k∈int (x k);some networktypes may also use complex polynomial activation functions(Sec.5.3).x t may directly affect certain x k(k>t)through outgoing connections or links represented through a current set out t of indices k with t∈in k.Some non-input events are called output events.Note that many of the x t may refer to different,time-varying activations of the same unit in sequence-processing RNNs(e.g.,Williams,1989,“unfolding in time”),or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events.During an episode,the same weight may get reused over and over again in topology-dependent ways,e.g.,in RNNs,or in convolutional NNs(Sec.5.4,5.8).I call this weight sharing across space and/or time.Weight sharing may greatly reduce the NN’s descriptive complexity,which is the number of bits of information required to describe the NN (Sec.4.3).In Supervised Learning(SL),certain NN output events x t may be associated with teacher-given,real-valued labels or targets d t yielding errors e t,e.g.,e t=1/2(x t−d t)2.A typical goal of supervised NN training is toﬁnd weights that yield episodes with small total error E,the sum of all such e t.The hope is that the NN will generalize well in later episodes,causing only small errors on previously unseen sequences of input events.Many alternative error functions for SL and UL are possible.SL assumes that input events are independent of earlier output events(which may affect the environ-ment through actions causing subsequent perceptions).This assumption does not hold in the broaderﬁelds of Sequential Decision Making and Reinforcement Learning(RL)(Kaelbling et al.,1996;Sutton and Barto, 1998;Hutter,2005)(Sec.6).In RL,some of the input events may encode real-valued reward signals given by the environment,and a typical goal is toﬁnd weights that yield episodes with a high sum of reward signals,through sequences of appropriate output actions.Sec.5.5will use the notation above to compactly describe a central algorithm of DL,namely,back-propagation(BP)for supervised weight-sharing FNNs and RNNs.(FNNs may be viewed as RNNs with certainﬁxed zero weights.)Sec.6will address the more general RL case.3Depth of Credit Assignment Paths(CAPs)and of ProblemsTo measure whether credit assignment in a given NN application is of the deep or shallow type,I introduce the concept of Credit Assignment Paths or CAPs,which are chains of possibly causal links between events.Let usﬁrst focus on SL.Consider two events x p and x q(1≤p<q≤T).Depending on the appli-cation,they may have a Potential Direct Causal Connection(PDCC)expressed by the Boolean predicate pdcc(p,q),which is true if and only if p∈in q.Then the2-element list(p,q)is deﬁned to be a CAP from p to q(a minimal one).A learning algorithm may be allowed to change w v(p,q)to improve performance in future episodes.More general,possibly indirect,Potential Causal Connections(PCC)are expressed by the recursively deﬁned Boolean predicate pcc(p,q),which in the SL case is true only if pdcc(p,q),or if pcc(p,k)for some k and pdcc(k,q).In the latter case,appending q to any CAP from p to k yields a CAP from p to q(this is a recursive deﬁnition,too).The set of such CAPs may be large but isﬁnite.Note that the same weight may affect many different PDCCs between successive events listed by a given CAP,e.g.,in the case of RNNs, or weight-sharing FNNs.Suppose a CAP has the form(...,k,t,...,q),where k and t(possibly t=q)are theﬁrst successive elements with modiﬁable w v(k,t).Then the length of the sufﬁx list(t,...,q)is called the CAP’s depth (which is0if there are no modiﬁable links at all).This depth limits how far backwards credit assignment can move down the causal chain toﬁnd a modiﬁable weight.1Suppose an episode and its event sequence x1,...,x T satisfy a computable criterion used to decide whether a given problem has been solved(e.g.,total error E below some threshold).Then the set of used weights is called a solution to the problem,and the depth of the deepest CAP within the sequence is called the solution’s depth.There may be other solutions(yielding different event sequences)with different depths.Given someﬁxed NN topology,the smallest depth of any solution is called the problem’s depth.Sometimes we also speak of the depth of an architecture:SL FNNs withﬁxed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers.Certain SL RNNs withﬁxed weights for all connections except those to output units(Jaeger,2001;Maass et al.,2002; Jaeger,2004;Schrauwen et al.,2007)have a maximal problem depth of1,because only theﬁnal links in the corresponding CAPs are modiﬁable.In general,however,RNNs may learn to solve problems of potentially unlimited depth.Note that the deﬁnitions above are solely based on the depths of causal chains,and agnostic of the temporal distance between events.For example,shallow FNNs perceiving large“time windows”of in-put events may correctly classify long input sequences through appropriate output events,and thus solve shallow problems involving long time lags between relevant events.At which problem depth does Shallow Learning end,and Deep Learning begin?Discussions with DL experts have not yet yielded a conclusive response to this question.Instead of committing myself to a precise answer,let me just deﬁne for the purposes of this overview:problems of depth>10require Very Deep Learning.The difﬁculty of a problem may have little to do with its depth.Some NNs can quickly learn to solve certain deep problems,e.g.,through random weight guessing(Sec.5.9)or other types of direct search (Sec.6.6)or indirect search(Sec.6.7)in weight space,or through training an NNﬁrst on shallow problems whose solutions may then generalize to deep problems,or through collapsing sequences of(non)linear operations into a single(non)linear operation—but see an analysis of non-trivial aspects of deep linear networks(Baldi and Hornik,1994,Section B).In general,however,ﬁnding an NN that precisely models a given training set is an NP-complete problem(Judd,1990;Blum and Rivest,1992),also in the case of deep NNs(S´ıma,1994;de Souto et al.,1999;Windisch,2005);compare a survey of negative results(S´ıma, 2002,Section1).Above we have focused on SL.In the more general case of RL in unknown environments,pcc(p,q) is also true if x p is an output event and x q any later input event—any action may affect the environment and thus any later perception.(In the real world,the environment may even inﬂuence non-input events computed on a physical hardware entangled with the entire universe,but this is ignored here.)It is possible to model and replace such unmodiﬁable environmental PCCs through a part of the NN that has already learned to predict(through some of its units)input events(including reward signals)from former input events and actions(Sec.6.1).Its weights are frozen,but can help to assign credit to other,still modiﬁable weights used to compute actions(Sec.6.1).This approach may lead to very deep CAPs though.Some DL research is about automatically rephrasing problems such that their depth is reduced(Sec.4). In particular,sometimes UL is used to make SL problems less deep,e.g.,Sec.5.10.Often Dynamic Programming(Sec.4.1)is used to facilitate certain traditional RL problems,e.g.,Sec.6.2.Sec.5focuses on CAPs for SL,Sec.6on the more complex case of RL.4Recurring Themes of Deep Learning4.1Dynamic Programming(DP)for DLOne recurring theme of DL is Dynamic Programming(DP)(Bellman,1957),which can help to facili-tate credit assignment under certain assumptions.For example,in SL NNs,backpropagation itself can 1An alternative would be to count only modiﬁable links when measuring depth.In many typical NN applications this would not make a difference,but in some it would,e.g.,Sec.6.1.be viewed as a DP-derived method(Sec.5.5).In traditional RL based on strong Markovian assumptions, DP-derived methods can help to greatly reduce problem depth(Sec.6.2).DP algorithms are also essen-tial for systems that combine concepts of NNs and graphical models,such as Hidden Markov Models (HMMs)(Stratonovich,1960;Baum and Petrie,1966)and Expectation Maximization(EM)(Dempster et al.,1977),e.g.,(Bottou,1991;Bengio,1991;Bourlard and Morgan,1994;Baldi and Chauvin,1996; Jordan and Sejnowski,2001;Bishop,2006;Poon and Domingos,2011;Dahl et al.,2012;Hinton et al., 2012a).4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL Another recurring theme is how UL can facilitate both SL(Sec.5)and RL(Sec.6).UL(Sec.5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning.In particular,codes that describe the original data in a less redundant or more compact way can be fed into SL(Sec.5.10,5.15)or RL machines(Sec.6.4),whose search spaces may thus become smaller(and whose CAPs shallower)than those necessary for dealing with the raw data.UL is closely connected to the topics of regularization and compression(Sec.4.3,5.6.3). 4.3Occam’s Razor:Compression and Minimum Description Length(MDL) Occam’s razor favors simple solutions over complex ones.Given some programming language,the prin-ciple of Minimum Description Length(MDL)can be used to measure the complexity of a solution candi-date by the length of the shortest program that computes it(e.g.,Solomonoff,1964;Kolmogorov,1965b; Chaitin,1966;Wallace and Boulton,1968;Levin,1973a;Rissanen,1986;Blumer et al.,1987;Li and Vit´a nyi,1997;Gr¨u nwald et al.,2005).Some methods explicitly take into account program runtime(Al-lender,1992;Watanabe,1992;Schmidhuber,2002,1995);many consider only programs with constant runtime,written in non-universal programming languages(e.g.,Rissanen,1986;Hinton and van Camp, 1993).In the NN case,the MDL principle suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view(e.g.,MacKay,1992;Buntine and Weigend,1991;De Freitas,2003), and to high generalization performance(e.g.,Baum and Haussler,1989),without overﬁtting the training data.Many methods have been proposed for regularizing NNs,that is,searching for solution-computing, low-complexity SL NNs(Sec.5.6.3)and RL NNs(Sec.6.7).This is closely related to certain UL methods (Sec.4.2,5.6.4).4.4Learning Hierarchical Representations Through Deep SL,UL,RLMany methods of Good Old-Fashioned Artiﬁcial Intelligence(GOFAI)(Nilsson,1980)as well as more recent approaches to AI(Russell et al.,1995)and Machine Learning(Mitchell,1997)learn hierarchies of more and more abstract data representations.For example,certain methods of syntactic pattern recog-nition(Fu,1977)such as grammar induction discover hierarchies of formal rules to model observations. The partially(un)supervised Automated Mathematician/EURISKO(Lenat,1983;Lenat and Brown,1984) continually learns concepts by combining previously learnt concepts.Such hierarchical representation learning(Ring,1994;Bengio et al.,2013;Deng and Yu,2014)is also a recurring theme of DL NNs for SL (Sec.5),UL-aided SL(Sec.5.7,5.10,5.15),and hierarchical RL(Sec.6.5).Often,abstract hierarchical representations are natural by-products of data compression(Sec.4.3),e.g.,Sec.5.10.4.5Fast Graphics Processing Units(GPUs)for DL in NNsWhile the previous millennium saw several attempts at creating fast NN-speciﬁc hardware(e.g.,Jackel et al.,1990;Faggin,1992;Ramacher et al.,1993;Widrow et al.,1994;Heemskerk,1995;Korkin et al., 1997;Urlbe,1999),and at exploiting standard hardware(e.g.,Anguita et al.,1994;Muller et al.,1995; Anguita and Gomes,1996),the new millennium brought a DL breakthrough in form of cheap,multi-processor graphics cards or GPUs.GPUs are widely used for video games,a huge and competitive market that has driven down hardware prices.GPUs excel at fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training,where they can speed up learning by a factorof50and more.Some of the GPU-based FNN implementations(Sec.5.16-5.19)have greatly contributed to recent successes in contests for pattern recognition(Sec.5.19-5.22),image segmentation(Sec.5.21), and object detection(Sec.5.21-5.22).5Supervised NNs,Some Helped by Unsupervised NNsThe main focus of current practical applications is on Supervised Learning(SL),which has dominated re-cent pattern recognition contests(Sec.5.17-5.22).Several methods,however,use additional Unsupervised Learning(UL)to facilitate SL(Sec.5.7,5.10,5.15).It does make sense to treat SL and UL in the same section:often gradient-based methods,such as BP(Sec.5.5.1),are used to optimize objective functions of both UL and SL,and the boundary between SL and UL may blur,for example,when it comes to time series prediction and sequence classiﬁcation,e.g.,Sec.5.10,5.12.A historical timeline format will help to arrange subsections on important inspirations and techni-cal contributions(although such a subsection may span a time interval of many years).Sec.5.1brieﬂy mentions early,shallow NN models since the1940s,Sec.5.2additional early neurobiological inspiration relevant for modern Deep Learning(DL).Sec.5.3is about GMDH networks(since1965),perhaps theﬁrst (feedforward)DL systems.Sec.5.4is about the relatively deep Neocognitron NN(1979)which is similar to certain modern deep FNN architectures,as it combines convolutional NNs(CNNs),weight pattern repli-cation,and winner-take-all(WTA)mechanisms.Sec.5.5uses the notation of Sec.2to compactly describe a central algorithm of DL,namely,backpropagation(BP)for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP1960-1981and beyond.Sec.5.6describes problems encountered in the late1980s with BP for deep NNs,and mentions several ideas from the previous millennium to overcome them.Sec.5.7discusses aﬁrst hierarchical stack of coupled UL-based Autoencoders(AEs)—this concept resurfaced in the new millennium(Sec.5.15).Sec.5.8is about applying BP to CNNs,which is important for today’s DL applications.Sec.5.9explains BP’s Fundamental DL Problem(of vanishing/exploding gradients)discovered in1991.Sec.5.10explains how a deep RNN stack of1991(the History Compressor) pre-trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment Paths(CAPs,Sec.3)of depth1000and more.Sec.5.11discusses a particular WTA method called Max-Pooling(MP)important in today’s DL FNNs.Sec.5.12mentions aﬁrst important contest won by SL NNs in1994.Sec.5.13describes a purely supervised DL RNN(Long Short-Term Memory,LSTM)for problems of depth1000and more.Sec.5.14mentions an early contest of2003won by an ensemble of shallow NNs, as well as good pattern recognition results with CNNs and LSTM RNNs(2003).Sec.5.15is mostly about Deep Belief Networks(DBNs,2006)and related stacks of Autoencoders(AEs,Sec.5.7)pre-trained by UL to facilitate BP-based SL.Sec.5.16mentions theﬁrst BP-trained MPCNNs(2007)and GPU-CNNs(2006). Sec.5.17-5.22focus on ofﬁcial competitions with secret test sets won by(mostly purely supervised)DL NNs since2009,in sequence recognition,image classiﬁcation,image segmentation,and object detection. Many RNN results depended on LSTM(Sec.5.13);many FNN results depended on GPU-based FNN code developed since2004(Sec.5.16,5.17,5.18,5.19),in particular,GPU-MPCNNs(Sec.5.19).5.11940s and EarlierNN research started in the1940s(e.g.,McCulloch and Pitts,1943;Hebb,1949);compare also later work on learning NNs(Rosenblatt,1958,1962;Widrow and Hoff,1962;Grossberg,1969;Kohonen,1972; von der Malsburg,1973;Narendra and Thathatchar,1974;Willshaw and von der Malsburg,1976;Palm, 1980;Hopﬁeld,1982).In a sense NNs have been around even longer,since early supervised NNs were essentially variants of linear regression methods going back at least to the early1800s(e.g.,Legendre, 1805;Gauss,1809,1821).Early NNs had a maximal CAP depth of1(Sec.3).5.2Around1960:More Neurobiological Inspiration for DLSimple cells and complex cells were found in the cat’s visual cortex(e.g.,Hubel and Wiesel,1962;Wiesel and Hubel,1959).These cellsﬁre in response to certain properties of visual sensory inputs,such as theorientation of plex cells exhibit more spatial invariance than simple cells.This inspired later deep NN architectures(Sec.5.4)used in certain modern award-winning Deep Learners(Sec.5.19-5.22).5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) Networks trained by the Group Method of Data Handling(GMDH)(Ivakhnenko and Lapa,1965; Ivakhnenko et al.,1967;Ivakhnenko,1968,1971)were perhaps theﬁrst DL systems of the Feedforward Multilayer Perceptron type.The units of GMDH nets may have polynomial activation functions imple-menting Kolmogorov-Gabor polynomials(more general than traditional NN activation functions).Given a training set,layers are incrementally grown and trained by regression analysis,then pruned with the help of a separate validation set(using today’s terminology),where Decision Regularisation is used to weed out superﬂuous units.The numbers of layers and units per layer can be learned in problem-dependent fashion. This is a good example of hierarchical representation learning(Sec.4.4).There have been numerous ap-plications of GMDH-style networks,e.g.(Ikeda et al.,1976;Farlow,1984;Madala and Ivakhnenko,1994; Ivakhnenko,1995;Kondo,1998;Kord´ık et al.,2003;Witczak et al.,2006;Kondo and Ueno,2008).5.41979:Convolution+Weight Replication+Winner-Take-All(WTA)Apart from deep GMDH networks(Sec.5.3),the Neocognitron(Fukushima,1979,1980,2013a)was per-haps theﬁrst artiﬁcial NN that deserved the attribute deep,and theﬁrst to incorporate the neurophysiolog-ical insights of Sec.5.2.It introduced convolutional NNs(today often called CNNs or convnets),where the(typically rectangular)receptiveﬁeld of a convolutional unit with given weight vector is shifted step by step across a2-dimensional array of input values,such as the pixels of an image.The resulting2D array of subsequent activation events of this unit can then provide inputs to higher-level units,and so on.Due to massive weight replication(Sec.2),relatively few parameters may be necessary to describe the behavior of such a convolutional layer.Competition layers have WTA subsets whose maximally active units are the only ones to adopt non-zero activation values.They essentially“down-sample”the competition layer’s input.This helps to create units whose responses are insensitive to small image shifts(compare Sec.5.2).The Neocognitron is very similar to the architecture of modern,contest-winning,purely super-vised,feedforward,gradient-based Deep Learners with alternating convolutional and competition lay-ers(e.g.,Sec.5.19-5.22).Fukushima,however,did not set the weights by supervised backpropagation (Sec.5.5,5.8),but by local un supervised learning rules(e.g.,Fukushima,2013b),or by pre-wiring.In that sense he did not care for the DL problem(Sec.5.9),although his architecture was comparatively deep indeed.He also used Spatial Averaging(Fukushima,1980,2011)instead of Max-Pooling(MP,Sec.5.11), currently a particularly convenient and popular WTA mechanism.Today’s CNN-based DL machines proﬁta lot from later CNN work(e.g.,LeCun et al.,1989;Ranzato et al.,2007)(Sec.5.8,5.16,5.19).5.51960-1981and Beyond:Development of Backpropagation(BP)for NNsThe minimisation of errors through gradient descent(Hadamard,1908)in the parameter space of com-plex,nonlinear,differentiable,multi-stage,NN-related systems has been discussed at least since the early 1960s(e.g.,Kelley,1960;Bryson,1961;Bryson and Denham,1961;Pontryagin et al.,1961;Dreyfus,1962; Wilkinson,1965;Amari,1967;Bryson and Ho,1969;Director and Rohrer,1969;Griewank,2012),ini-tially within the framework of Euler-LaGrange equations in the Calculus of Variations(e.g.,Euler,1744). Steepest descent in such systems can be performed(Bryson,1961;Kelley,1960;Bryson and Ho,1969)by iterating the ancient chain rule(Leibniz,1676;L’Hˆo pital,1696)in Dynamic Programming(DP)style(Bell-man,1957).A simpliﬁed derivation of the method uses the chain rule only(Dreyfus,1962).The methods of the1960s were already efﬁcient in the DP sense.However,they backpropagated derivative information through standard Jacobian matrix calculations from one“layer”to the previous one, explicitly addressing neither direct links across several layers nor potential additional efﬁciency gains due to network sparsity(but perhaps such enhancements seemed obvious to the authors).。

unsupervised reinforcement learning

unsupervised reinforcement learning什么是无监督强化学习？为什么它在机器学习领域中如此重要？无监督强化学习（Unsupervised Reinforcement Learning）是一种结合了无监督学习和强化学习的技术。

在传统的强化学习中，智能体通过与环境的交互来学习最优策略，而在无监督学习中，智能体通过观察输入数据的统计特征来学习数据的内在结构。

无监督强化学习的目的是通过智能体自主地探索环境来学习一种无监督的表示方式，以辅助强化学习任务的完成。

为了更好地理解无监督强化学习的重要性，首先让我们回顾一下传统强化学习的一些限制。

在传统的强化学习中，智能体需要通过与环境的交互来学习最优策略，通常需要大量的样本和相应的奖励信号。

然而，在许多实际问题中，获取这些奖励信号可能是非常昂贵或困难的。

此外，强化学习算法通常需要较长的训练时间，因为智能体要不断地尝试和调整策略才能找到最优解。

无监督强化学习通过与传统强化学习相结合，可以克服一些这些限制。

无监督学习使得智能体可以从未标记的数据中学习到有用的信息，而不需要任何奖励信号。

通过发现数据的内在结构和模式，智能体可以生成更有效的数据表示，从而提高学习和决策的效率。

此外，无监督学习可以帮助智能体预测环境中的未来状态，进一步提高学习的速度和准确性。

那么，无监督强化学习的具体实现是怎样的呢？以下是一些常见的技术和方法，用于无监督强化学习：1. 自编码器（Autoencoder）：自编码器是一种无监督学习的神经网络模型，通过将输入数据压缩成潜在表示再进行重构来学习数据的结构。

在强化学习中，自编码器可以帮助智能体学习环境的特征表示，并从中提取有用的信息。

2. 生成对抗网络（GAN）：生成对抗网络由生成器和判别器组成，通过对抗的方式互相训练。

生成器试图生成与真实数据相似的样本，而判别器则尝试区分真实样本和生成样本。

在强化学习中，GAN可以用于生成新的环境状态或奖励信号，从而提供更多的训练样本和探索机会。

监督学习算法基础知识整理

第三章监督学习算法监督学习又称为分类（Classification）或者归纳学习（Inductive Learning）。

几乎适用于所有领域，包括文本和网页处理。

给出一个数据集D，机器学习的目标就是产生一个联系属性值集合A和类标集合C的分类/预测函数（Classification/Prediction Function），这个函数可以用于预测新的属性集合的类标。

这个函数又被称为分类模型（Classification Model）、预测模型（Prediction Model）。

这个分类模型可以是任何形式的，例如决策树、规则集、贝叶斯模型或者一个超平面。

在监督学习（Supervised Learning）中，已经有数据给出了类标；与这一方式相对的是无监督学习（Unsupervised Learning），在这种方式中，所有的类属性都是未知的，算法需要根据数据集的特征自动产生类属性。

其中算法中用于进行学习的数据集叫做训练数据集，当使用学习算法用训练数据集学习得到一个模型以后，我们使用测试数据集来评测这个模型的精准度。

机器学习的最基本假设：训练数据的分布应该与测试数据的分布一致。

训练算法：训练算法就是给定一组样本，我们计算这些参数的方法。

本节简要介绍以下几种常用的机器学习算法，比如决策树，朴素贝叶斯，神经网络，支持向量机，线性最小平方拟合，kNN，最大熵等。

3.1 两类感知器见课本3.2 多类感知器见课本3.3 决策树算法决策树学习算法是分类算法中最广泛应用的一种技术，这种算法的分类精度与其他算法相比具有相当的竞争力，并且十分高效。

决策树是一个预测模型；他代表的是对象属性与对象值之间的一种映射关系。

树中每个节点表示某个对象属性，而每个分叉路径则代表的某个可能的属性值，而每个叶结点则对应从根节点到该叶节点所经历的路径所表示的对象的值（类别）。

决策树仅有单一输出，若欲有复数输出，可以建立独立的决策树以处理不同输出。

有监督学习(supervised learning)和无监督学习(unsupervised learning)

有监督学习(supervised learning)和无监督学习(unsupervised learning)机器学习的常用方法，主要分为有监督学习(supervised learning)和无监督学习(unsupervised learning)。

监督学习，就是人们常说的分类，通过已有的训练样本（即已知数据以及其对应的输出）去训练得到一个最优模型（这个模型属于某个函数的集合，最优则表示在某个评价准则下是最佳的），再利用这个模型将所有的输入映射为相应的输出，对输出进行简单的判断从而实现分类的目的，也就具有了对未知数据进行分类的能力。

在人对事物的认识中，我们从孩子开始就被大人们教授这是鸟啊、那是猪啊、那是房子啊，等等。

我们所见到的景物就是输入数据，而大人们对这些景物的判断结果（是房子还是鸟啊）就是相应的输出。

当我们见识多了以后，脑子里就慢慢地得到了一些泛化的模型，这就是训练得到的那个（或者那些）函数，从而不需要大人在旁边指点的时候，我们也能分辨的出来哪些是房子，哪些是鸟。

监督学习里典型的例子就是KNN、SVM。

无监督学习（也有人叫非监督学习，反正都差不多）则是另一种研究的比较多的学习方法，它与监督学习的不同之处，在于我们事先没有任何训练样本，而需要直接对数据进行建模。

这听起来似乎有点不可思议，但是在我们自身认识世界的过程中很多处都用到了无监督学习。

比如我们去参观一个画展，我们完全对艺术一无所知，但是欣赏完多幅作品之后，我们也能把它们分成不同的派别（比如哪些更朦胧一点，哪些更写实一些，即使我们不知道什么叫做朦胧派，什么叫做写实派，但是至少我们能把他们分为两个类）。

无监督学习里典型的例子就是聚类了。

聚类的目的在于把相似的东西聚在一起，而我们并不关心这一类是什么。

因此，一个聚类算法通常只需要知道如何计算相似度就可以开始工作了。

那么，什么时候应该采用监督学习，什么时候应该采用非监督学习呢？我也是从一次面试的过程中被问到这个问题以后才开始认真地考虑答案。

【半监督分类】（一）半监督学习概述

【半监督分类】（一）半监督学习概述展开全文半监督学习(Semi-Supervised Learning,SSL)类属于机器学习(Machine Learning,ML)。

一 ML有两种基本类型的学习任务：1.监督学习(Supervised Learning,SL)根据输入-输出样本对L={(x1,y1),···,(x l,y l)}学习输入到输出的映射f:X->Y,来预测测试样例的输出值。

SL包括分类(Classification)和回归(Regression)两类任务，分类中的样例x i∈R m(输入空间)，类标签y i∈{c1,c2,···,c c},c j∈N;回归中的输入x i∈R m，输出y i∈R(输出空间)。

2. 无监督学习(Unsupervised Learning,UL)利用无类标签的样例U={x1,···,x n}所包含的信息学习其对应的类标签Yu=[y1···y n]T,由学习到的类标签信息把样例划分到不同的簇(Clustering)或找到高维输入数据的低维结构。

UL包括聚类(Clistering)和降维(Dimensionality Reduction)两类任务。

二半监督学习(Semi-Supervised Learning,UL)在许多ML的实际应用中，很容易找到海量的无类标签的样例，但需要使用特殊设备或经过昂贵且用时非常长的实验过程进行人工标记才能得到有类标签的样本，由此产生了极少量的有类标签的样本和过剩的无类标签的样例。

因此，人们尝试将大量的无类标签的样例加入到有限的有类标签的样本中一起训练来进行学习，期望能对学习性能起到改进的作用，由此产生了SSL，如如图１所示。

SSL避免了数据和资源的浪费，同时解决了SL的模型泛化能力不强和UL的模型不精确等问题。

无监督学习在异常检测中的实践与对比分析

无监督学习在异常检测中的实践与对比分析异常检测（Anomaly Detection）是机器学习中的一个重要任务，它的目标是识别数据中的异常或不寻常的行为。

传统的异常检测方法通常依赖于人工标注或规则定义，这限制了它们的应用范围和扩展性。

而无监督学习（Unsupervised Learning）作为一种无需人工标注的机器学习方法，近年来在异常检测任务中得到了广泛的应用。

在本文中，我们将探讨无监督学习在异常检测中的实践及其与传统方法的对比分析。

首先，我们将介绍常见的无监督学习算法，包括聚类算法、密度估计算法和基于子空间分析的算法。

然后，我们将详细讨论这些算法在异常检测任务中的应用，并对它们的优劣进行对比分析。

聚类算法是一种常见的无监督学习算法，它将数据集划分为若干个簇或群组。

在异常检测中，聚类算法可以将正常样本划分为一个簇，而异常样本则会对应于其他簇或孤立的数据点。

常用的聚类算法包括K-means、层次聚类和DBSCAN。

这些算法可以通过计算样本与簇中心的距离或样本之间的相似性来实现异常检测。

另一类常见的无监督学习算法是密度估计算法，它通过估计数据集的密度分布来识别异常样本。

其中，LOF （Local Outlier Factor）算法是一种基于局部密度的算法，它通过计算每个样本点周围的邻居密度和局部密度之比来判断样本是否为异常。

此外，基于高斯混合模型（Gaussian Mixture Model）的异常检测方法也被广泛应用。

这些方法通过建立概率模型来估计数据分布，从而检测与模型不符的样本。

除了聚类算法和密度估计算法，基于子空间分析的算法也是无监督学习中常用的异常检测方法。

子空间分析通过将数据映射到低维子空间中，从而提取数据的主要特征。

在异常检测中，如果数据点不符合主要特征，即与主要子空间偏离较大，可以将其识别为异常。

常见的子空间分析方法包括主成分分析（PCA）和子空间聚类。

与传统的异常检测方法相比，无监督学习算法在实践中有许多优势。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

ReferencesD.H.Ackley,G.E.Hinton,and T.J.Sejnowski.A learning algorithm forBoltzmann machines.Cognitive Science,9:147–169,1985.D.G.Amaral,N.Ishizuka,and B.Claiborne.Neurons,numbers,and the hip-pocampal network.Progress in Brain Research,83:1–11,1990.J.Atick and N.Redlich.Towards a theory of early visual processing.NeuralComputation,2:308–320,1990.S.Bao,V.T.Chan,and M.M.Merzenich.Cortical remodelling induced by activityof ventral tegmental dopamine neurons.Nature,412(6842):79–83,2001.H.B.Barlow.Unsupervised learning.Neural Computation,1:295–311,1989.H.B.Barlow and P.F¨o ldi´a k.Adaptation and decorrelation in the cortex.InR.Durbin,C.Miall,and G.Mitchison,editors,The Computing Neuron,chap-ter4,pages54–72.Addison-Wesley Publishing Corp.,1989.C.A.Barnes,B.L.McNaughton,S.J.Y.Mizumori,and -parison of spatial and temporal characteristics of neuronal activity in sequentialstages of hippocampal processing.Progress in Brain Research,83:287–300,1990.S.Becker.Mutual information maximization:Models of cortical self-organization.Network:Computation in Neural Systems,7:7–31,1996.S.Becker.Implicit learning in3d object recognition:The importance of temporalcontext.Neural Computation,1999.S.Becker.A computational principle for hippocampal learning and neurogenesis.To appear in Hippocampus,2005.S.Becker and G.E.Hinton.A self-organizing neural network that discovers surfacesin random-dot stereograms.Nature,355:161–163,1992.S.Becker and J.Lim.A computational model of prefrontal control in free recall:strategic memory use in the california verbal learning task.Journal of CognitiveNeuroscience,15(6):1–12,2003.A.J.Bell and T.J.Sejnowski.An information-maximisation approach to blindseparation and blind deconvolution.Neural Computation,7(6):1129–1159,1995.A.J.Bell and T.J.Sejnowski.The independent components of natural scenes areedgeﬁlters.Vision Research,37:3327–3338,1997.N.Brenner,W.Bialek,and R.de Ruyter van Steveninck.Adaptive rescaling Haykin,Principe,Sejnowski,and McWhirter:New Directions in Statistical Signal Processing:From Systems to Brain2005/03/0810:1464REFERENCESmaximizes information transmission.Neuron,26:695702,2000.B.M.Calhoun andC.E.Schreiner.Spectral envelope coding in cat primary auditorycortex:linear and non-linear eﬀects of stimulus characteristics.European Jouranlof Neuroscience,10(3):926–40,1998.M.S.Cousins,A.Atherton,L.Turner,and J.D.Salamone.Nucleus accumbensdopamine depletions alter relative response allocation in a t-maze cost/beneﬁttask.Behavioral Brain Research,74(1-2):189–97,1996.P.Dayan,G.E.Hinton,R.Neal,and R.S.Zemel.The helmholtz machine.NeuralComputation,7:1022–1037,1995.V.R.de Sa.Learning classiﬁcation with unlabeled data.In Advances in NeuralInformation Processing Systems6,pages112–119.Morgan Kaufmann,1994.G.C.DeAngelis,I.Ohzawa,and R.D.Freeman.Spatiotemporal organization ofsimple-cell receptiveﬁelds in the cats striate cortex.i.general characteristics andpostnatal development.Journal of Neurophysiology,69(4),1993.D.Dong and J.Atick.Temporal decorrelation:a theory of lagged and nonlaggedresponses in the lateral geniculate work:Computation in NeuralSystems,6:159178,1995.W.Gerstner and L.F.Abbott.Learning navigational maps through potentiationand modulation of hippocampal place cells.Journal of Computational Neuro-science,4:79–94,1997.M.E.Hasselmo,B.P.Wyble,and G.V.Wallenstein.Encoding the retrievalof episodic memories:role of cholinergic and GABAergic modulation in thehippocampus.Hippocampus,6:693–708,1996.M.E.Hasselmo and minar selectivity of the cholinergic suppressionof synaptic transmission in rat hippocampal region ca1:computational modelingand brain slice physiology.Journal of Neuroscience,14(6):3898–3914,1994.G.Hinton.Training products of experts by minimizing contrastive divergence.Technical Report GCNU TR2000-004,2000.G.E.Hinton and A.Brown.Spiking boltzmann machines.In Advances in NeuralInformation Processing Systems12.MIT Press,Cambridge,MA,2000.D.H.Hubel and T.N.Wiesel.Receptiveﬁelds and functional architecture ofmonkey striate cortex.Journal of Physiology,195(1):215–243,1968.R.A.Jacobs,M.I.Jordan,S.J.Nowlan,and G.E.Hinton.Adaptive mixtures oflocal experts.Neural Computation,3(1):79–87,1991.M.W.Jung and B.L.McNaughton.Spatial selectivity of unit activity in thehippocampal granular layer.Hippocampus,3(2):165–182,1993.S.Kali and P.Dayan.The involvement of recurrent connections in area ca3inestablishing the properties of placeﬁelds:A model.Journal of Neuroscience,20:7463–7477,2000.J.Kay.Feature discovery under contextual supervision using mutual information. Haykin,Principe,Sejnowski,and McWhirter:New Directions in Statistical Signal Processing:From Systems to Brain2005/03/0810:14REFERENCES65In Proceedings of the International Joint Conference on Neural Networks,vol-ume IV,pages79–84,1992.N.Kowalski,D.A.Depireux,and S.Shamma.Analysis of dynamic spectra in ferretprimary auditory cortex.i.characteristics of single-unit responses to movingripple spectra.Journal of Neurophysiology,76(5):3503–23,1996.W.B.Levy.A sequence predicting CA3is aﬂexible associator that learns and usescontext to solve hippocampal-like tasks.Hippocampus,6:579–590,1996.W.B.Levy,C.M.Colbert,and N.L.Desmond.Elemental adaptive processesof neurons and synapses:a statistical/computational perspective.In M.Gluck&D.Rumelhart,editor,Neuroscience and Connectionist Models,pages187–235.Lawrence Erlbaum Associates,Hillsdale,New Jersey,1990.Z.Li and J.Atick.Eﬃcient stereo coding in the multiscale work:Computation in neural systems,5:157–174,1994.R.Linsker.From basic network principles to neural architecture:Emergence oforientation columns.Proceedings of the National Academy of Sciences USA,83:8779–8783,1986a.R.Linsker.From basic network principles to neural architecture:Emergence oforientation-selective cells.Proceedings of the National Academy of Sciences USA,83:8390–8394,1986b.R.Linsker.From basic network principles to neural architecture:Emergence ofspatial opponent cells.Proceedings of the National Academy of Sciences USA,83:7508–7512,1986c.R.Linsker.Self-organization in a perceptual network.IEEE Computer,March,21:105–117,1988d.R.Linsker.How to generate ordered maps by maximizing the mutual informationbetween input and output signals.Neural Computation,1(3),1989e.A.Marr.Simple memory:A theory for archicortex.Philosophical Transactions ofthe Royal Society of London,262(Series B):23–81,1971.J.L.McClelland,B.L.McNaughton,and R.C.O’Reilly.Why there are comple-mentary learning systems in the hippocampus and neocortex:Insights from thesuccesses and failures of connectionist models of learning and memory.Psycho-logical Review,102(3):419–457,1995.B.L.McNaughton and R.G.M.Morris.Hippocampal synaptic enhancement andinformation storage within a distributed memory systems.Trends in Neuro-sciences,10:408–415,1987.M.Mishkin,W.A.Suzuki,D.G.Gadian,and F.Vargha-Khadem.Hierarchicalorganization of cognitive memory.Philosophical Transactions of the Royal Societyof London B:Biological Sciences,352(1360):1461–7,1997.P.R.Montague,P.Dayan,and T.J.Sejnowski.A framework for mesencephalicdopamine systems based on predictive hebbian learning.Neuroscience,16:1936–1947,1996.Haykin,Principe,Sejnowski,and McWhirter:New Directions in Statistical Signal Processing:From Systems to Brain2005/03/0810:1466REFERENCESS.J.Nowlan.Maximum likelihood competitive learning.In D.S.Touretzky,editor,Neural Information Processing Systems,Vol.2,pages574–582,San Mateo,CA,1990.Morgan Kaufmann.K.Okajima.Binocular disparity encoding cells generated through an infomax basedlearning algorithm.Neural Networks,17(7):953–62,2004.R.C.O’Reilly and J.W.Rudy.Conjunctive representations in learning and memory:Principles of cortical and hippocampal function.Psychological Review,108:311–345,2001.W.A.Phillips,D.Floreano,and J.Kay.Contextually guided unsupervised learningusing local multivariate binary processors.Neural Networks,11(1):117–140,1998.J.Rissanen.Modeling by shortest data description.Automatica,14:465–471,1978.E.Rolls.Functions of neural networks in the hippocampus and neocortex inmemory.In J.H.Byrne and W.O.Berry,editors,Neural Models of Plasticity:Theoretical and Empirical Approaches,pages240–265.Academic Press,1989.R.M.Shapley and J.D.Victor.The contrast gain conrol of the cat retina.VisionResearch,19:431434,1979a.A.Smith,S.Becker,and S.Kapur.A computational model of the selective roleof the striatal d2-receptor in the expression of previously acquired behaviours.Neural Computation,17(2):361–395,2005.J Stone.Learning perceptually salient visual parameters using spatiotemporalsmoothness constraints.Neural Computation,8:1463–1492,1996.S.M.Stringer,E.T.Rolls,T.P.Trappenberg,and I.E.T.de Araujo.Self-organizingcontinuous attractor networks and path integration:two-dimensional models ofplace work,13(4):429–446,2002.R.Sutton.Learning to predict by the methods of temporal diﬀerences.MachineLearning,3:9–44,1988.R.S.Sutton and A.G.Barto.Toward a modern theory of adaptive networks:expectation and prediction.Psychology Review,88:135–170,1981.A.Treves and putational constraints suggest the need for twodistinct input systems to the hippocampal ca3network.Hippocampus,2:189–200,1992.V.Virsu,B.B.Lee,and O.D.Creutzfeldt.Dark adaptation and receptiveﬁeldorganisation of cells in the cat lateral geniculate nucleus.Experimental BrainResearch,27(1):35–50,1977.G.V.Wallenstein and M.E.Hasselmo.GABAergic modulation of hippocampalplace cell activity:sequence learning,placeﬁeld development,and the phaseprecession eﬀect.Journal of Neurophysiology,78:393–408,1997.C.J.C.H.Watkins.Learning from delayed rewards.PhD thesis,King’s College,Cambridge University,1989.R.S.Zemel and G.E.Hinton.Discovering viewpoint-invariant relationships that Haykin,Principe,Sejnowski,and McWhirter:New Directions in Statistical Signal Processing:From Systems to Brain2005/03/0810:14REFERENCES67characterize objects.In R.P.Lippmann,J.E.Moody,and D.S.Touretzky,editors,Advances In Neural Information Processing Systems3,pages299–305.Morgan Kaufmann Publishers,1991.R.S.Zemel and G.E.Hinton.Developing population codes by minimizing descrip-tion length.Neural Computation,7:549–564,1995.Haykin,Principe,Sejnowski,and McWhirter:New Directions in Statistical Signal Processing:From Systems to Brain2005/03/0810:14。