Deep Sentence Embedding Using Long

Short-Term Memory Networks

Hamid Palangi,Li Deng,Yelong Shen,Jianfeng Gao,Xiaodong He,Jianshu Chen,Xinying Song,

Rabab Ward

Abstract —This paper develops a model that addresses sentence embedding,a hot topic in current natural lan-guage processing research,using recurrent neural networks (RNN)with Long Short-Term Memory (LSTM)cells.The proposed LSTM-RNN model sequentially takes each word in a sentence,extracts its information,and embeds it into a semantic vector.Due to its ability to capture long term memory,the LSTM-RNN accumulates increasingly richer information as it goes through the sentence,and when it reaches the last word,the hidden layer of the network provides a semantic representation of the whole sentence.In this paper,the LSTM-RNN is trained in a weakly supervised manner on user click-through data logged by a commercial web search engine.Visualization and analysis are performed to understand how the embedding process works.The model is found to automatically attenuate the unimportant words and detects the salient keywords in the sentence.Furthermore,these detected keywords are found to automatically activate different cells of the LSTM-RNN,where words belonging to a similar topic activate the same cell.As a semantic representation of the sentence,the embedding vector can be used in many different applications.These automatic keyword detection and topic allocation abilities enabled by the LSTM-RNN allow the network to perform document retrieval,a dif?cult language processing task,where the similarity between the query and documents can be measured by the distance between their corresponding sentence embedding vectors computed by the LSTM-RNN.On a web search task,the LSTM-RNN embedding is shown to signi?cantly outperform several existing state of the art methods.

Index Terms —Deep Learning,Long Short-Term Mem-ory,Sentence Embedding.

I.I NTRODUCTION

EARNING a good representation (or features)of input data is an important task in machine learning.In text and language processing,one such problem is learning of an embedding vector for a sentence;that is,to train a model that can automatically transform a sentence to a vector that encodes the semantic mean-ing of the sentence.While word embedding is learned using a loss function de?ned on word pairs,sentence

H.Palangi and R.Ward are with the Department of Electrical and Computer Engineering,University of British Columbia,Vancouver,BC,V6T 1Z4Canada (e-mail:{hamidp,rababw }@ece.ubc.ca)

L.Deng,Y .Shen,J.Gao,X.He,J.Chen and X.Song are with Microsoft Research,Redmond,W A 98052USA (e-mail:{deng,jfgao,xiaohe,yeshen,jianshuc,xinson }@https://www.360docs.net/doc/ba4105987.html,)

embedding is learned using sentence pairs.As a result,sentence embedding can better discover salient words and topics in a sentence,and thus is more suitable for tasks that require computing semantic similarities between text strings.By mapping texts into a uni?ed semantic representation,the embedding vector can be further used for different language processing applica-tions,such as machine translation [1],sentiment analysis [2],and information retrieval [3].In machine translation,the recurrent neural networks (RNN)with Long Short-Term Memory (LSTM)cells,or the LSTM-RNN,is used to encode an English sentence into a vector,which contains the semantic meaning of the input sentence,and then another LSTM-RNN is used to generate a French sentence from the vector.The model is trained to best predict the output sentence.In [2],a paragraph vector is learned in an unsupervised manner as a distributed representation of sentences and documents,which are then used for sentiment analysis.Sentence embedding can also be applied to information retrieval,where the contextual information are properly represented by the vectors in the same space for fuzzy text matching [3].In this paper,we propose to use an RNN to sequen-tially accept each word in a sentence and recurrently map it into a latent space together with the historical informa-tion.As the RNN reaches the last word in the sentence,the hidden activations form a natural embedding vector for the contextual information of the sentence.We further incorporate the LSTM cells into the RNN model (i.e.the LSTM-RNN)to address the dif?culty of learning long term memory in RNN.The learning of such a model is performed in a weakly supervised manner on the click-through data logged by a commercial web search engine.Although manually labelled data are insuf?cient in machine learning,logged data with limited feedback signals are massively available due to the widely used commercial web search engines.Limited feedback in-formation such as click-through data provides a weak supervision signal that indicates the semantic similarity between the text on the query side and the clicked text on the document side.To exploit such a signal,the objective of our training is to maximize the similarity between the two vectors mapped by the LSTM-RNN from the query and the clicked document,respectively.

a r X i v :1502.06922v 2 [c s .C L ] 5 J u l 2015

Consequently,the learned vector is expected to represent different sentences of a similar meaning.

An important contribution of this paper is to analyse the embedding process of the LSTM-RNN by visualizing the internal activation behaviours in response to different text inputs.We show that the embedding process of the learned LSTM-RNN effectively detects the keywords, while attenuating less important words,in the sentence automatically by switching on and off the gates within the LSTM-RNN cells.We further show that different cells in the learned model indeed correspond to differ-ent topics,and the keywords associated with a similar topic activate the same cell unit in the model.As the LSTM-RNN reads to the end of the sentence,the topic activation accumulates and the hidden vector at the last word encodes the rich contextual information of the entire sentence.For this reason,a natural application of sentence embedding is web search ranking,in which the embedding vector from the query can be used to match the embedding vectors of the candidate documents according to the maximum cosine similarity rule.Evalu-ated on a real web document ranking task,our proposed method signi?cantly outperforms all the existing state of the art methods in NDCG scores.

II.R ELATED W ORK

Inspired by the word embedding method[4],[5],the authors in[2]proposed an unsupervised learning method to learn a paragraph vector as a distributed representation of sentences and documents,which are then used for sentiment analysis with superior performance.However, the model is not designed to capture the?ne-grained sentence structure.

Similar to the recurrent models in this paper,The DSSM[3]and CLSM[6]models,developed for in-formation retrieval,can also be interpreted as sentence embedding methods.However,DSSM treats the input sentence as a bag-of-words and does not model word dependencies explicitly.CLSM treats a sentence as a bag of n-grams,where n is de?ned by a window,and can capture local word dependencies.Then a Max-pooling layer is used to form a global feature vector.Methods in[7]are also convolutional based networks for Natural Language Processing(NLP).These models,by design, cannot capture long distance dependencies,i.e.,depen-dencies among words belonging to non-overlapping n-grams.

Long short-term memory networks were developed in [8]to address the dif?cult of capturing long term mem-ory in RNN.It has been successfully applied to speech recognition,which achieves state-of-art performance[9],

[10].In text analysis,LSTM-RNN treats a sentence as

a sequence of words with internal structures,i.e.,word dependencies.It encodes a semantic vector of a sentence incrementally which differs from DSSM and CLSM.The encoding process is performed left-to-right,word-by-word.At each time step,a new word is encoded into the semantic vector,and the word dependencies embedded in the vector are“updated”.When the process reaches the end of the sentence,the semantic vector has embedded all the words and their dependencies,hence,can be viewed as a feature vector representation of the whole sentence.In the machine translation work[1],an input English sentence is converted into a vector representation using LSTM-RNN,and then another LSTM-RNN is used to generate an output French sentence.The model is trained to maximize the probability of predicting the correct output sentence.In[11],there are two main composition models,ADD model that is bag of words and BI model that is a summation over bi-gram pairs plus a non-linearity.In our proposed model,instead of simple summation,we have used LSTM model with letter tri-grams which keeps valuable information over long intervals(for long sentences)and throws away useless information.In[12],an encoder-decoder approach is proposed to jointly learn to align and translate sentences from English to French using RNNs.The concept of “attention”in the decoder,discussed in this paper,is closely related to how our proposed model extracts keywords in the document side.For further explanations please see section V-A2.

Different from the aforementioned studies,the method developed in this paper trains the model so that sentences that are paraphrase of each other are close in their semantic embedding vectors—see the description in Sec.IV further ahead.Another reason that LSTM-RNN is particularly effective for sentence embedding,is its robustness to noise.For example,in the web document ranking task,the noise comes from two sources:(i)Not every word in query/document is equally important, and we only want to“remember”salient words using the limited“memory”.(ii)A word or phrase that is important to a document may not be relevant to a given query,and we only want to“remember”related words that are useful to compute the relevance of the document for a given query.We will illustrate robustness of LSTM-RNN in this paper.The structure of LSTM-RNN will also circumvent the serious limitation of using a?xed window size in CLSM.Our experiments show that this difference leads to signi?cantly better results in web document retrieval task.Furthermore,it has other advantages.It allows us to capture keywords and key topics effectively.The models in this paper also do not need the extra max-pooling layer,as required by the CLSM,to capture global contextual information and they do so more effectively.

III.S ENTENCE E MBEDDING U SING RNN S WITH AND

WITHOUT LSTM C ELLS

In this section,we introduce the model of recurrent neural networks and its long short-term memory version for learning the sentence embedding vectors.We start with the basic RNN and then proceed to LSTM-RNN.A.The basic version of RNN

The RNN is a type of deep neural networks that are “deep”in temporal dimension and it has been used extensively in time sequence modelling [13],[14],[15],[16],[17],[18],[19],[20],[21].The main idea of using RNN for sentence embedding is to ?nd a dense and low dimensional semantic representation by sequentially and recurrently processing each word in a sentence and mapping it into a low dimensional vector.In this model,the global contextual features of the whole text will be in the semantic representation of the last word in the text sequence —see Figure 1,where x (t )is the t -th word,coded as a 1-hot vector,W h is a ?xed hashing operator similar to the one used in [3]that converts the word vector to a letter tri-gram vector,W is the input weight matrix,W rec is the recurrent weight matrix,y (t )is the hidden activation vector of the RNN,which can be used as a semantic representation of the t -th word,and y (t )associated to the last word x (m )is the semantic representation vector of the entire sentence.Note that this is very different from the approach in [3]where the bag-of-words representation is used for the whole text and no context information is used.This is also different from [6]where the sliding window of a ?xed size (akin to an FIR ?lter)is used to capture local features and a max-pooling layer on the top to capture global features.In the RNN there is neither a ?xed-sized window nor a max-pooling layer;rather the recurrence is used to capture the context information in the sequence (akin to an IIR ?lter).

Without loss of generality,we assume the bias is zero.Then the mathematical formulation of the above RNN model for sentence embedding can be expressed as

l (t )=W h x (t )

y (t )=f (Wl (t )+W rec y (t ?1))

(1)

where W and W rec are the input and recurrent matrices to be learned,W h is a ?xed word hashing opera-tor,and f (·)is assumed to be tanh(·).Note that the architecture proposed here for sentence embedding is slightly different from traditional RNN in that there is a word hashing layer that convert the high dimensional input into a relatively lower dimensional letter tri-gram representation.There is also no per word supervision during training,instead,the whole sentence has a label.This is explained in more detail in section IV.

…

Embedding vector

l(2)

l(1)

l(m) Fig.1.The basic architecture of the RNN for sentence embedding,where temporal recurrence is used to model the contextual information across words in the text string.The hidden activation vector corre-sponding to the last word is the sentence embedding vector (blue).

B.The RNN with LSTM cells

Although RNN performs the transformation from the sentence to a vector in a principled manner,it is generally dif?cult to learn the long term dependency within the sequence due to vanishing gradients problem.One of the effective solutions for this problem in RNNs is using memory cells instead of neurons originally proposed in [8]as Long Short-Term Memory (LSTM)and completed in [22]and [23]by adding forget gate and peephole connections to the architecture.

We use the architecture of LSTM illustrated in Fig.2for the proposed sentence embedding method.In this ?gure,i (t ),f (t ),o (t ),c (t )are input gate,forget gate,output gate and cell state vector respectively,W p 1,W p 2and W p 3are peephole connections,W i ,W reci and b i ,i =1,2,3,4are input connections,recurrent connections and bias values,respectively,g (·)and h (·)are tanh(·)function and σ(·)is the sigmoid function.We use this architecture to ?nd y for each word,then use the y (m )corresponding to the last word in the sentence as the semantic vector for the entire sentence.

Considering Fig.2,the forward pass for LSTM-RNN model is as follows:

y g (t )=g (W 4l (t )+W rec4y (t ?1)+b 4)

i (t )=σ(W 3l (t )+W rec3y (t ?1)+W p 3c (t ?1)+b 3)f (t )=σ(W 2l (t )+W rec2y (t ?1)+W p 2c (t ?1)+b 2)c (t )=f (t )?c (t ?1)+i (t )?y g (t )

o (t )=σ(W 1l (t )+W rec1y (t ?1)+W p 1c (t )+b 1)y (t )=o (t )?h (c (t ))

(2)

where ?denotes Hadamard (element-wise)product.

x(t)

W ?

l(t)

W 3

x(t) W ?

l(t)

W 4

x(t)

W ?

l(t) W 2

x(t)

W ?

l(t)

W 1

g(.)

σ(.)

Input Gate Output Gate σ(.) Forget Gate

Cell

y g (t)

i(t) f(t)

c(t ?1) × ?(.)

c(t)

o(t)

y(t)

c(t ?1)

W p2

W p3 W p1

y(t ?1)

W rec4

b 4

y(t ?1)

W rec3

b 3

b 1

W rec1

y(t ?1)

b 2 y(t ?1)

W rec2

Fig.2.The basic LSTM architecture used for sentence IV.L EARNING M ETHOD

To learn a good semantic representation of the input sentence,our objective is to make the embedding vectors for sentences of similar meaning as close as possible,and meanwhile,to make sentences of different meanings as far apart as possible .This is challenging in practice since it is hard to collect a large amount of manually labelled data that give the semantic similarity signal between different sentences.Nevertheless,the widely used commercial web search engine is able to log massive amount of data with some limited user feedback signals.For example,given a particular query,the click-through information about the user-clicked document among many candidates is usually recorded and can be used as a weak (binary)supervision signal to indicate the semantic similarity between two sentences (on the query side and the document side).In this section,we explain how to leverage such a weak supervision signal to learn a sentence embedding vector that achieves the aforementioned training objective.

We now describe how to train the model to achieve the above objective using the click-through data logged by a commercial search engine.For a complete description of the click-through data please refer to section 2in [24].To begin with,we adopt the cosine similarity between the semantic vectors of two sentences as a measure for their similarity:

R (Q,D )=

y Q (T Q )T y D (T D )

y Q (T Q ) · y D (T D )

(3)

where T Q and T D are the lengths of the sentence Q and sentence D ,respectively.In the context of training over click-through data,we will use Q and D to denote “query”and “document”,respectively.In Figure 3,we show the sentence embedding vec-tors corresponding to the query,y Q (T Q ),and all the

y Q (T Q )

y D +(T D +)y D 1

(T D 1

)

y D n

(T D n

)Click:'1'

…'

0'0'

Nega/ve'samples'

Posi/ve'sample'

click-through signal can be used as similarity between the sentence the document side.The negative the training data.

{y D +(T D +)D ?1

),...subscript D +

the among the documents,and j -th (un-clicked)negative embedding vectors are generated by feeding the sen-tences into the RNN or LSTM-RNN model described in Sec.III and take the y corresponding to the last word —see the blue box in Figure 1.

We want to maximize the likelihood of the clicked document given query,which can be formulated as the following optimization problem:

L (Λ)=min Λ

?log N

r =1

P (D +

r |Q r ) =min Λ

N r =1

l r (Λ)

(4)where Λdenotes the collection of the model parameters;in regular RNN case,it includes W rec and W in Figure 1,and in LSTM-RNN case,it includes W 1,W 2,W 3,W 4,W rec 1,W rec 2,W rec 3,W rec 4,W p 1,W p 2,W p 3,

b 1,b 2,b 3and b 4in Figure 2.D +

r is the clicked

document for r -th query,P (D +

|Q r )is the probability of clicked document given the r -th query,N is number of query /clicked-document pairs in the corpus and

l r (Λ)=?log

e γR (Q r ,D +

r )

e γR (Q r ,D +

r )+ n i =j e γR (Q r ,D ?r,j ) =log 1+n

j =1

e ?γ·?r,j

(5)

where ?r,j =R (Q r ,D +

r )?R (Q r ,D ?r,j ),R (·,·)was

de?ned earlier in (3),D ?

r,j is the j -th negative candidate document for r -th query and n denotes the number of negative samples used during training.

The expression in (5)is a logistic loss over ?r,j .It upper-bounds the pairwise accuracy,i.e.,the 0-1loss.Since the similarity measure is the cosine function,?r,j ∈[?2,2].To have a larger range for ?r,j ,we use γfor scaling.It helps to penalize the prediction error more.Its value is set empirically by experiments on a held out dataset.

5 To train the RNN and LSTM-RNN,we use Back Prop-

agation Through Time(BPTT).The update equations for

parameterΛat epoch k are as follows:

Λk=Λk?Λk?1

Λk=μk?1 Λk?1? k?1?L(Λk?1+μk?1 Λk?1)

(6)

where?L(·)is the gradient of the cost function in(4),

is the learning rate andμk is a momentum parameter

determined by the scheduling scheme used for training.

Above equations are equivalent to Nesterov method

in[25].To see why,please refer to appendix A.1of

[26]where Nesterov method is derived as a momentum

method.The gradient of the cost function,?L(Λ),is:

?L(Λ)=?

r=1

j=1

τ=0

αr,j

??r,j,τ

one large update

(7)

where T is the number of time steps that we unfold the network over time and

αr,j=

?γe?γ?r,j

j=1

e?γ?r,j

.(8)

??r,j,τ?Λin(7)and error signals for different parameters

of RNN and LSTM-RNN that are necessary for training are presented in Appendix A.Full derivation of gradients in both models is presented in Appendix D.

To accelerate training by parallelization,we use mini-batch training and one large update instead of incremen-tal updates during back propagation through time.To resolve the gradient explosion problem we use gradient re-normalization method described in[27],[16].To accelerate the convergence,we use Nesterov method[25] and found it effective in training both RNN and LSTM-RNN for sentence embedding.

We have used a simple yet effective scheduling for μk for both RNN and LSTM-RNN models,in the?rst and last2%of all parameter updatesμk=0.9and for the other96%of all parameter updatesμk=0.995.We have used a?xed step size for training RNN and a?xed step size for training LSTM-RNN.

A summary of training method for LSTM-RNN is presented in Algorithm1.

V.A NALYSIS OF THE S ENTENCE E MBEDDING

P ROCESS AND P ERFORMANCE E VALUATION

To understand how the LSTM-RNN performs sentence embedding,we use visualization tools to analyze the semantic vectors generated by our model.We would like to answer the following questions:(i)How are word dependencies and context information captured? (ii)How does LSTM-RNN attenuate unimportant infor-mation and detect critical information from the input Algorithm1Training LSTM-RNN for Sentence Embed-ding

Inputs:Fixed step size“ ”,Scheduling for“μ”,Gradient clip threshold “th G”,Maximum number of Epochs“nEpoch”,Total number of query /clicked-document pairs“N”,Total number of un-clicked(negative)docu-ments for a given query“n”,Maximum sequence length for truncated BPTT “T”.

Outputs:Two trained models,one in query side“ΛQ”,one in document side“ΛD”.

Initialization:Set all parameters inΛQ andΛD to small random numbers, i=0,k=1.

procedure LSTM-RNN(ΛQ,ΛD)

while i≤nEpoch do

for“?rst minibatch”→“last minibatch”do

r←1

while r≤N do

for j=1→n do

Computeαr,j use(8)

Compute

τ=0

αr,j??r,j,τ

?Λk,Q

use(14)to(44)in appendix A

Compute

τ=0

αr,j??r,j,τ

?Λk,D

use(14)to(44)in appendix A

sum above terms for Q and D over j

end for

sum above terms for Q and D over r

r←r+1

end while

Compute?L(Λk,Q) use(7)

Compute?L(Λk,D) use(7)

if ?L(Λk,Q) >th G then

?L(Λk,Q)←th G·?L(Λk,Q)

?L(Λk,Q)

end if

if ?L(Λk,D) >th G then

?L(Λk,D)←th G·?L(Λk,D)

?L(Λk,D)

end if

Compute Λk,Q use(6)

Compute Λk,D use(6)

Update:Λk,Q← Λk,Q+Λk?1,Q

Update:Λk,D← Λk,D+Λk?1,D

k←k+1

end for

i←i+1

end while

end procedure

sentence?Or,how are the keywords embedded into the semantic vector?(iii)How are the global topics identi?ed by LSTM-RNN?

To answer these questions,we train the RNN with and without LSTM cells on the click-through dataset which are logged by a commercial web search engine. The training method has been described in Sec.IV. Description of the corpus is as follows.The training set includes200,000positive query/document pairs where only the clicked signal is used as a weak supervision for training LSTM.The relevance judgement set(test set) is constructed as follows.First,the queries are sampled from a year of search engine logs.Adult,spam,and bot queries are all removed.Queries are de-duped so that only unique queries remain.To re?ex a natural query distribution,we do not try to control the quality of these queries.For example,in our query sets,there are around 20%misspelled queries,and around20%navigational queries and10%transactional queries,etc.Second,for each query,we collect Web documents to be judged by

6 issuing the query to several popular search engines(e.g.,

Google,Bing)and fetching top-10retrieval results from each.Finally,the query-document pairs are judged by a group of well-trained assessors.In this study all the queries are preprocessed as follows.The text is white-space tokenized and lower-cased,numbers are retained, and no stemming/in?ection treatment is performed. We now proceed to perform a comprehensive analysis by visualizing the trained RNN and LSTM-RNN models. In particular,we will visualize the on-and-off behaviors of the input gates,output gates,cell states,and the semantic vectors in LSTM-RNN model,which reveals how the model extracts useful information from the input sentence and embeds it properly into the semantic vector according to the topic information.

Although giving the full learning formula for all the model parameters in the previous section,we will remove the peephole connections and the forget gate from the LSTM-RNN model in the current task.This is because the length of each sequence,i.e.,the number of words in a query or a document,is known in advance, and we set the state of each cell to zero in the beginning of a new sequence.Therefore,forget gates are not a great help here.Also,as long as the order of words is kept,the precise timing in the sequence is not of great concern. Therefore,peephole connections are not that important as well.Removing peephole connections and forget gate will also reduce the amount of training time,since a smaller number of parameters need to be learned.

A.Analysis

In this section we would like to examine how the in-formation in the input sentence is sequentially extracted and embedded into the semantic vector over time by the LSTM-RNN model.

1)Attenuating Unimportant Information:First,we examine the evolution of the semantic vector and how unimportant words are attenuated.Speci?cally,we feed the following input sentences from the test dataset into the trained LSTM-RNN model:

?Query:“hotels in shanghai”

?Document:“shanghai hotels accommodation hotel in shanghai discount and reservation”Activations of input gate,output gate,cell state and the embedding vector for each cell for query and document are shown in Fig.4and Fig.5,respectively.The vertical axis is the cell index from1to32,and the horizontal axis is the word index from1to10numbered from left to right in a sequence of words and color codes show activation values.From Figs.4–5,we make the following observations:

?Semantic representation y(t)and cell states c(t)are evolving over time.Valuable context information is

?0.6

?0.4

?0.2

0.2

0.4

(a)i(t)

?0.6

?0.4

?0.2

0.2

0.4

(b)c(t)

?0.6

?0.4

?0.2

0.2

0.4

(c)o(t)

?0.6

?0.4

?0.2

0.2

0.4

(d)y(t)

Fig.4.Query:“hotels in shanghai”.Since the sentence ends at the third word,all the values to the right of it are zero(green color).

?0.6

?0.4

?0.2

0.2

0.4

(a)i(t)

?0.6

?0.4

?0.2

0.2

0.4

(b)c(t)

?0.6

?0.4

?0.2

0.2

0.4

(c)o(t)

?0.6

?0.4

?0.2

0.2

0.4

(d)y(t)

Fig.5.Document:“shanghai hotels accommodation hotel

in shanghai discount and reservation”.Since the sentence ends at the ninth word,all the values to the right of it are zero(green color).

gradually absorbed into c(t)and y(t),so that the information in these two vectors becomes richer over time,and the semantic information of the entire input sentence is embedded into vector y(t), which is obtained by applying output gates to the cell states c(t).

?The input gates evolve in such a way that it attenuates the unimportant information and de-tects the important information from the input sentence.For example,in Fig.5(a),most of

246810

0.020.040.06

0.08

(a)y (t )top 10for query

46810

0.05

0.1

(b)y (t )top 10for document

Fig.6.Activation values,y (t ),of 10most active cells for Query:“hotels in shanghai ”and Document:“shanghai hotels accommodation hotel in shanghai discount and reservation ”

the input gate values corresponding to word 3,word 7and word 9have very small values (light green-yellow color)1,which corresponds to the words “accommodation ”,“discount ”and “reservation ”,respectively,in the document sen-tence.Interestingly,input gates reduce the effect of these three words in the ?nal semantic representa-tion,y (t ),such that the semantic similarity between sentences from query and document sides are not affected by these words.

2)Keywords Extraction:In this section,we show how the trained LSTM-RNN extracts the important in-formation,i.e.,keywords,from the input sentences.To this end,we backtrack semantic representations,y (t ),over time.We focus on the 10most active cells in ?nal semantic representation.Whenever there is a large enough change in cell activation value (y (t )),we assume an important keyword has been detected by the model.We illustrate the result using the above example (“hotels in shanghai ”).The evolution of the 10most active cells activation,y (t ),over time are shown in Fig.6for the query and the document sentences.2From Fig.6,we also observe that different words activate different cells.In Tables I–II,we show the number of cells each word activates.3From the tables,we observe that the keywords activate more cells than the unimportant words,meaning that they are selectively embedded into the semantic vector.

3)Topic Allocation:Now,we further show that the trained LSTM-RNN model not only detects the key-1If

this is not clearly visible,please refer to Fig.8in section B of Appendix E.We have adjusted color bar for all ?gures to have the same range,for this reason the structure might not be clearly visible.More visualization examples could also be found in Appendix E of the Supplementary Material

2Likewise,the vertical axis is the cell index and horizontal axis is the word index in the sentence.

3Note that before presenting the ?rst word of the sentence,activation values are initially zero so that there is always a considerable change in the cell states after presenting the ?rst word.For this reason,we have not indicated the number of cells detecting the ?rst word as a keyword.Moreover,another keyword extraction example can be found in Appendix E.

TABLE I

K EY WORDS FOR QUERY :“hotels in shanghai ”

hotels

in shanghai

Number of assigned cells out of 10

words,but also allocates them properly to different cells according to the topics they belong to.To do this,we go through the test dataset using the trained LSTM-RNN model and search for the keywords that are detected by a speci?c cell.For simplicity,we use the following simple approach:for each given query we look into the keywords that are extracted by the 5most active cells of LSTM-RNN and list them in Table III.Interestingly,each cell collects keywords of a speci?c topic.For example,cell 26in Table III extracts keywords related to the topic “food”and cells 2and 6mainly focus on the keywords related to the topic “health”.B.Performance Evaluation

1)Web Document Retrieval Task:In this section,we apply the proposed sentence embedding method to an important web document retrieval task for a commercial web search engine.Speci?cally,the RNN models (with and without LSTM cells)embed the sentences from the query and the document sides into their corresponding semantic vectors,and then compute the cosine similarity between these vectors to measure the semantic similarity between the query and candidate documents.

Experimental results for this task are shown in Table IV using the standard metric mean Normalized Dis-counted Cumulative Gain (NDCG)[28](the higher the better)for evaluating the ranking performance of the RNN and LSTM-RNN on a standalone human-rated test dataset.We also trained several strong baselines,such as DSSM [3]and CLSM [6],on the same training dataset and evaluated their performance on the same task.For fair comparison,our proposed RNN and LSTM-RNN models are trained with the same number of parameters as the DSSM and CLSM models (14.4M parameters).Besides,we also include in Table IV two well-known information retrieval (IR)models,BM25and PLSA,for the sake of benchmarking.The BM25model uses the bag-of-words representation for queries and documents,which is a state-of-the-art document ranking model based on term matching,widely used as a baseline in IR society.PLSA (Probabilistic Latent Semantic Analysis)is a topic model proposed in [29],which is trained using the Maximum A Posterior estimation [30]on the documents side from the same training dataset.We experimented with a varying number of topics from 100to 500for PLSA,which gives similar performance,and we report in Table IV the results of using 500topics.

TABLE II

K EY WORDS FOR DOCUMENT:“shanghai hotels accommodation hotel in shanghai discount and reservation”

shanghai hotels accommodation hotel in shanghai discount and reservation Number of assigned

cells out of10-43818534

TABLE III

K EYWORDS ASSIGNED TO EACH CELL OF LSTM-RNN FOR DIFFERENT QUERIES OF TWO TOPICS,“FOOD”AND“HEALTH”

Results for a language model based method,uni-gram language model(ULM)with Dirichlet smoothing,are also presented in the table.As shown in Table IV,the LSTM-RNN signi?cantly outperforms all these models, and exceeds the best baseline model(CLSM)by1.3% in NDCG@1score,which is a statistically signi?cant improvement.As we pointed out in Sec.V-A,such an improvement comes from the LSTM-RNN’s ability to embed the contextual and semantic information of the sentences into a?nite dimension vector.

A comparison between the value of the cost function during training for LSTM-RNN and RNN on the click-through data is shown in Fig.7.From this?gure, we conclude that LSTM-RNN is optimizing the cost function in(4)more effectively.Please note that all parameters of both models are initialized randomly.

VI.C ONCLUSIONS AND F UTURE W ORK

This paper addresses deep sentence embedding.We propose a model based on long short-term memory to model the long range context information and embed the key information of a sentence in one semantic vector.We show that the semantic vector evolves over time and only takes useful information from any new input.This has

been made possible by input gates that detect useless information and attenuate it.Due to general limitation of available human labelled data,we proposed and implemented training the model with a weak supervision signal using user click-through data of a commercial web search engine.

By performing a detailed analysis on the model,we showed that:1)The proposed model is robust to noise, i.e.,it mainly embeds keywords in the?nal semantic vector representing the whole sentence and2)In the pro-posed model,each cell is usually allocated to keywords from a speci?c topic.These?ndings have been supported using extensive examples.As a concrete sample appli-cation of the proposed sentence embedding method,we evaluated it on the important language processing task of web document retrieval.We showed that,for this task, the proposed method outperforms all existing state of the art methods signi?cantly.

This work has been motivated by the earlier successes of deep learning methods in speech[31],[32],[33], [34],[35]and in semantic modelling[3],[6],[36],[37], and it adds further evidence for the effectiveness of these methods.Our future work will further extend the methods to include1)Developing bi-directional version of the proposed model2)Using the proposed sentence embedding method for other important language pro-cessing tasks for which we believe sentence embedding plays a key role,e.g.,the question/answering task.3) Exploit the prior information about the structure of the different matrices in Fig.2to develop a more effective cost function and learning method.

A PPENDIX A

E XPRESSIONS FOR THE G RADIENTS

In this appendix we present the?nal gradient expres-sions that are necessary to use for training the proposed models.Full derivations of these gradients are presented in Appendix D of supplementary materials.

A.RNN

For the recurrent parameters,Λ=W rec(we have ommitted r subscript for simplicity):

??j,τ?W rec =[δD+

y Q

(t?τ)y T Q(t?τ?1)+

δD+ y D (t?τ)y T D+(t?τ?1)]?[δD?j y

(t?τ)y T Q(t?τ?1)

+δD?j y

D (t?τ)y T

D?j

(t?τ?1)](9)

where D?j means j-th candidate document that is not clicked and

δy

(t?τ?1)=(1?y Q(t?τ?1))?

(1+y Q(t?τ?1))?W T recδy

Q (t?τ)(10)

and the same as(10)forδy

(t?τ?1)with D subscript for document side model.Please also note that:

δy

(T Q)=(1?y Q(T Q))?(1+y Q(T Q))?

(b.c.y D(T D)?a.b3.c.y Q(T Q)),

δy

(T D)=(1?y D(T D))?(1+y D(T D))?

(b.c.y Q(T Q)?a.b.c3.y D(T D))(11)

where

a=y Q(t=T Q)T y D(t=T D)

y Q(t=T Q)

,c=

y D(t=T D)

(12)

For the input parameters,Λ=W:

??j,τ

=[δD+

y Q

(t?τ)l T Q(t?τ)+

δD+

y D

(t?τ)l T D+(t?τ)]?

[δD?j y

(t?τ)l T Q(t?τ)+δD?j y

(t?τ)l T

D?j

(t?τ)](13)

A full derivation of BPTT for RNN is presented in

Appendix D-A.

B.LSTM-RNN

Starting with the cost function in(4),we

use the Nesterov method described in(6)to

update LSTM-RNN model parameters.Here,Λ

is one of the weight matrices or bias vectors

{W1,W2,W3,W4,W rec1,W rec2,W rec3,W rec4

,W p1,W p2,W p3,b1,b2,b3,b4}in the LSTM-RNN

architecture.The general format of the gradient of the

cost function,?L(Λ),is the same as(7).By de?nition

of?r,j,we have:

??r,j

?Λ

?R(Q r,D+r)

?Λ

?R(Q r,D r,j)

?Λ

(14)

We omit r and j subscripts for simplicity and present

?R(Q,D)

?Λ

for different parameters of each cell of LSTM-RNN in the following subsections.This will complete

the process of calculating?L(Λ)in(7)and then we

can use(6)to update LSTM-RNN model parameters.

In the subsequent subsections vectors v Q and v D are

de?ned as:

v Q=(b.c.y D(t=T D)?a.b3.c.y Q(t=T Q))

v D=(b.c.y Q(t=T Q)?a.b.c3.y D(t=T D))(15) where a,b and c are de?ned in(12).Full derivation of

truncated BPTT for LSTM-RNN model is presented in

Appendix D-B.

1)Output Gate:For recurrent connections we have:

?R(Q,D)?W rec1=δrec1

y Q

(t).y Q(t?1)T+δrec1

y D

(t).y D(t?1)T

(16)

where

δrec1

y Q

(t)=o Q(t)?(1?o Q(t))?h(c Q(t))?v Q(t)(17)

and the same as(17)forδrec1

y D (t)with subscript D for

document side model.For input connections,W1,and peephole connections,W p1,we will have:

?R(Q,D)?W1=δrec1

y Q

(t).l Q(t)T+δrec1

y D

(t).l D(t)T(18)

?R(Q,D)?W p1=δrec1

y Q

(t).c Q(t)T+δrec1

y D

(t).c D(t)T(19)

The derivative for output gate bias values will be:

?R(Q,D)

?b1=δrec1

y Q

(t)+δrec1

y D

(t)(20)

2)Input Gate:For the recurrent connections we have:?R(Q,D)

?W rec3

diag(δrec3

y Q (t)).

?c Q(t)

rec3

+diag(δrec3

y D

(t)).

?c D(t)

rec3

(21)

where

δrec3

y Q

(t)=(1?h(c Q(t)))?(1+h(c Q(t)))?o Q(t)?v Q(t)

?c Q(t)

rec3=diag(f Q(t)).

?c Q(t?1)

rec3

+b i,Q(t).y Q(t?1)T

b i,Q(t)=y g,Q(t)?i Q(t)?(1?i Q(t))(22)

In equation(21),δrec3

y D (t)and?c D(t)

?W rec3

are the same as

(22)with D subscript.For the input connections we will have the following:

?R(Q,D)

?W3

diag(δrec3

y Q (t)).

?c Q(t)

?W3

+diag(δrec3

y D

(t)).

?c D(t)

?W3

(23)

where

?c Q(t)?W3=diag(f Q(t)).

?c Q(t?1)

?W3

+b i,Q(t).x Q(t)T

(24)

For the peephole connections we will have:?R(Q,D)

?W p3

diag(δrec3

y Q (t)).

?c Q(t)

?W p3

+diag(δrec3

y D

(t)).

?c D(t)

?W p3

(25)

where

?c Q(t)

?W p3

=diag(f Q(t)).

?c Q(t?1)

?W p3

+b i,Q(t).c Q(t?1)T

(26)

For bias values,b3,we will have:

?R(Q,D)

?b3

diag(δrec3

y Q

(t)).

?c Q(t)

?b3

+diag(δrec3

y D

(t)).

?c D(t)

?b3

(27)

where

?c Q(t)

?b3

=diag(f Q(t)).

?c Q(t?1)

?b3

+b i,Q(t)(28)

3)Forget Gate:For the recurrent connections we will

have:

?R(Q,D)

?W rec2

diag(δrec2

y Q

(t)).

?c Q(t)

?W rec2

+diag(δrec2

y D

(t)).

?c D(t)

?W rec2

(29)

where

δrec2

y Q

(t)=(1?h(c Q(t)))?(1+h(c Q(t)))?o Q(t)?v Q(t)

?c Q(t)

?W rec2

=diag(f Q(t)).

?c Q(t?1)

?W rec2

+b f,Q(t).y Q(t?1)T

b f,Q(t)=

c Q(t?1)?f Q(t)?(1?f Q(t))(30)

For input connections to forget gate we will have:

?R(Q,D)

?W2

diag(δrec2

y Q

(t)).

?c Q(t)

?W2

+diag(δrec2

y D

(t)).

?c D(t)

?W2

(31)

where

?c Q(t)

?W2

=diag(f Q(t)).

?c Q(t?1)

?W2

+b f,Q(t).x Q(t)T

(32)

For peephole connections we have:

?R(Q,D)

?W p2

diag(δrec2

y Q

(t)).

?c Q(t)

?W p2

+diag(δrec2

y D

(t)).

?c D(t)

?W p2

(33)

where

?c Q(t)

?W p2

=diag(f Q(t)).

?c Q(t?1)

?W p2

+b f,Q(t).c Q(t?1)T

(34)

For forget gate’s bias values we will have:

?R(Q,D)

?b2

diag(δrec2

y Q

(t)).

?c Q(t)

?b2

+diag(δrec2

y D

(t)).

?c D(t)

?b2

(35)

where

?c Q(t)?b2=diag(f Q(t)).

?c Q(t?1)

?b3

+b f,Q(t)(36)

4)Input without Gating(y g(t)):For recurrent con-nections we will have:

?R(Q,D)

?W rec4

diag(δrec4

y Q (t)).

?c Q(t)

?W rec4

+diag(δrec4

y D

(t)).

?c D(t)

?W rec4

(37)

where

δrec4

y Q

(t)=(1?h(c Q(t)))?(1+h(c Q(t)))?o Q(t)?v Q(t)

?c Q(t)?W rec4=diag(f Q(t)).

?c Q(t?1)

?W rec4

+b g,Q(t).y Q(t?1)T

b g,Q(t)=i Q(t)?(1?y g,Q(t))?(1+y g,Q(t))(38) For input connection we have:

?R(Q,D)

?W4

diag(δrec4

y Q (t)).

?c Q(t)

?W4

+diag(δrec4

y D

(t)).

?c D(t)

?W4

(39)

where

?c Q(t)?W4=diag(f Q(t)).

?c Q(t?1)

?W4

+b g,Q(t).x Q(t)T

(40)

For bias values we will have:?R(Q,D)

?b4

diag(δrec4

y Q (t)).

?c Q(t)

?b4

+diag(δrec4

y D

(t)).

?c D(t)

?b4

(41)

where

?c Q(t)?b4=diag(f Q(t)).

?c Q(t?1)

?b4

+b g,Q(t)(42)

5)Error signal backpropagation:Error signals are back propagated through time using following equations:δrec1

(t?1)=

[o Q(t?1)?(1?o Q(t?1))?h(c Q(t?1))]

?W T rec1.δrec1

(t)(43)

δrec i

(t?1)=[(1?h(c Q(t?1)))?(1+h(c Q(t?1)))

?o Q(t?1)]?W T rec

i .δrec i

(t),for i∈{2,3,4}

(44)

R EFERENCES

[1]I.Sutskever,O.Vinyals,and Q.V.Le,“Sequence to sequence

learning with neural networks,”in Proceedings of Advances in

Neural Information Processing Systems,2014,pp.3104–3112.

[2]Q.V.Le and T.Mikolov,“Distributed representations of sen-

tences and documents,”Proceedings of the31st International

Conference on Machine Learning,pp.1188–1196,2014.

[3]P.-S.Huang,X.He,J.Gao,L.Deng,A.Acero,and L.Heck,

“Learning deep structured semantic models for web search

using clickthrough data,”in Proceedings of the22Nd ACM

International Conference on Conference on Information&

Knowledge Management,ser.CIKM’13.ACM,2013,pp.2333–

2338.

[4]T.Mikolov,I.Sutskever,K.Chen,G.S.Corrado,and J.Dean,

“Distributed representations of words and phrases and their com-

positionality,”in Proceedings of Advances in Neural Information

Processing Systems,2013,pp.3111–3119.

[5]T.Mikolov,K.Chen,G.Corrado,and J.Dean,“Ef?cient esti-

mation of word representations in vector space,”arXiv preprint

arXiv:1301.3781,2013.

[6]Y.Shen,X.He,J.Gao,L.Deng,and G.Mesnil,“A latent seman-

tic model with convolutional-pooling structure for information

retrieval.”CIKM,November2014.

[7]R.Collobert and J.Weston,“A uni?ed architecture for natural

language processing:Deep neural networks with multitask learn-

ing,”in International Conference on Machine Learning,ICML,

2008.

[8]S.Hochreiter and J.Schmidhuber,“Long short-term memory,”

Neural Comput.,vol.9,no.8,pp.1735–1780,Nov.1997.

[9] A.Graves,A.Mohamed,and G.Hinton,“Speech recognition

with deep recurrent neural networks,”in Proc.ICASSP,Vancou-

ver,Canada,May2013.

[10]H.Sak,A.Senior,and F.Beaufays,“Long short-term memory

recurrent neural network architectures for large scale acoustic

modeling,”in Proceedings of the Annual Conference of Inter-

national Speech Communication Association(INTERSPEECH),

2014.

[11]K.M.Hermann and P.Blunsom,“Multilingual mod-

els for compositional distributed semantics,”arXiv preprint

arXiv:1404.4641,2014.

[12] D.Bahdanau,K.Cho,and Y.Bengio,“Neural machine

translation by jointly learning to align and translate,”ICLR2015,

2015.[Online].Available:https://www.360docs.net/doc/ba4105987.html,/abs/1409.0473

[13]J.L.Elman,“Finding structure in time,”Cognitive Science,

vol.14,no.2,pp.179–211,1990.

[14] A.J.Robinson,“An application of recurrent nets to phone

probability estimation,”IEEE Transactions on Neural Networks,

vol.5,no.2,pp.298–305,August1994.

[15]L.Deng,K.Hassanein,and M.Elmasry,“Analysis of the corre-

lation structure for a neural predictive model with application to

speech recognition,”Neural Networks,vol.7,no.2,pp.331–339,

1994.

[16]T.Mikolov,M.Kara?′a t,L.Burget,J.Cernock`y,and S.Khudan-

pur,“Recurrent neural network based language model.”in Proc.

INTERSPEECH,Makuhari,Japan,September2010,pp.1045–

1048.

[17] A.Graves,“Sequence transduction with recurrent neural net-

works,”in Representation Learning Workshp,ICML,2012.

[18]Y.Bengio,N.Boulanger-Lewandowski,and R.Pascanu,“Ad-

vances in optimizing recurrent networks,”in Proc.ICASSP,

Vancouver,Canada,May2013.

[19]J.Chen and L.Deng,“A primal-dual method for training recur-

rent neural networks constrained by the echo-state property,”in

Proceedings of the International Conf.on Learning Representa-

tions(ICLR),2014.

[20]G.Mesnil,X.He,L.Deng,and Y.Bengio,“Investigation of

recurrent-neural-network architectures and learning methods for

spoken language understanding,”in Proc.INTERSPEECH,Lyon,

France,August2013.

[21]L.Deng and J.Chen,“Sequence classi?cation using high-level

features extracted from deep neural networks,”in Proc.ICASSP, 2014.

[22] F.A.Gers,J.Schmidhuber,and F.Cummins,“Learning to forget:

Continual prediction with lstm,”Neural Computation,vol.12,pp.

2451–2471,1999.

[23] F.A.Gers,N.N.Schraudolph,and J.Schmidhuber,“Learning

precise timing with lstm recurrent networks,”J.Mach.Learn.

Res.,vol.3,pp.115–143,Mar.2003.

[24]J.Gao,W.Yuan,X.Li,K.Deng,and J.-Y.Nie,“Smoothing

clickthrough data for web search ranking,”in Proceedings of the32Nd International ACM SIGIR Conference on Research and Development in Information Retrieval,ser.SIGIR’09.New York,NY,USA:ACM,2009,pp.355–362.

[25]Y.Nesterov,“A method of solving a convex programming

problem with convergence rate o(1/k2),”Soviet Mathematics Doklady,vol.27,pp.372–376,1983.

[26]I.Sutskever,J.Martens,G.E.Dahl,and G.E.Hinton,“On the

importance of initialization and momentum in deep learning,”in ICML(3)’13,2013,pp.1139–1147.

[27]R.Pascanu,T.Mikolov,and Y.Bengio,“On the dif?culty of

training recurrent neural networks.”in ICML2013,ser.JMLR Proceedings,https://www.360docs.net/doc/ba4105987.html,,2013,pp.1310–1318. [28]K.J¨a rvelin and J.Kek¨a l¨a inen,“Ir evaluation methods for re-

trieving highly relevant documents,”in Proceedings of the23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,ser.SIGIR.ACM,2000, pp.41–48.

[29]T.Hofmann,“Probabilistic latent semantic analysis,”in In Proc.

of Uncertainty in Arti?cial Intelligence,UAI99,1999,pp.289–296.

[30]J.Gao,K.Toutanova,and W.-t.Yih,“Clickthrough-based latent

semantic models for web search,”ser.SIGIR’11.ACM,2011, pp.675–684.

[31]G.E.Dahl,D.Yu,L.Deng,and A.Acero,“Large vocabulary

continuous speech recognition with context-dependent DBN-HMMs,”in Proc.IEEE ICASSP,Prague,Czech,May2011,pp.

4688–4691.

[32] D.Yu and L.Deng,“Deep learning and its applications to sig-

nal and information processing[exploratory dsp],”IEEE Signal Processing Magazine,vol.28,no.1,pp.145–154,jan.2011.

[33]G.Dahl,D.Yu,L.Deng,and A.Acero,“Context-dependent

pre-trained deep neural networks for large-vocabulary speech recognition,”Audio,Speech,and Language Processing,IEEE Transactions on,vol.20,no.1,pp.30–42,jan.2012.

[34]G.Hinton,L.Deng,D.Yu,G.E.Dahl,A.Mohamed,N.Jaitly,

A.Senior,V.Vanhoucke,P.Nguyen,T.N.Sainath,and

B.Kings-

bury,“Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups,”IEEE Signal Processing Magazine,vol.29,no.6,pp.82–97,November 2012.

[35]L.Deng,D.Yu,and J.Platt,“Scalable stacking and learning for

building deep architectures,”in Proc.ICASSP,march2012,pp.

2133–2136.

[36]J.Gao,P.Pantel,M.Gamon,X.He,L.Deng,and Y.Shen,

“Modeling interestingness with deep neural networks,”in Proc.

EMNLP,2014.

[37]J.Gao,X.He,W.tau Yih,and L.Deng,“Learning continuous

phrase representations for translation modeling,”in Proc.ACL, 2014.

51015202530

?0.6

?0.4

?0.200.20.4Fig.8.Input gate,i (t ),for Document:“shanghai hotels accommoda-tion hotel in shanghai discount and reservation ”

S UPPLEMENTARY

M ATERIAL

A PPENDIX B

A MORE CLEAR FIGURE FOR INPUT GATE FOR

“hotels in shanghai ”EXAMPLE In this section we present a more clear ?gure for part (a)of Fig.5that shows the structure of the input gate for document side of “hotels in shanghai ”example.As it is clearly visible from this ?gure,the input gate values for most of the cells corresponding to word 3,word 7and word 9in document side of LSTM-RNN have very small values (light green-yellow color).These are corresponding to words “accommodation ”,“discount ”and “reservation ”respectively in the document title.Interestingly,input gates are trying to reduce effect of these three words in the ?nal representation (y (t ))because the LSTM-RNN model is trained to maximize the similarity between query and document if they are a good match.

A PPENDIX C

A C LOSER L OOK AT RNN S WITH AND WITHOUT LSTM C ELLS IN W E

B D OCUMENT R ETRIEVAL T ASK In this section we further show examples to reveal the advantage of LSTM-RNN sentence embedding com-pared to the RNN sentence embedding.

First,we compare the scores assigned by trained RNN and LSTM-RNN to our “hotels in shanghai ”example.On average,each query in our test dataset is associated with 15web documents (URLs).Each query /document pair has a relevance label which is human generated.These relevance labels are “Bad”,“Fair”,“Good”and

TABLE V

RNN S WITH &WITHOUT LSTM CELLS FOR THE SAME QUERY :

“hotels in shanghai ”

hotels

in shanghai

Number of assigned cells out of 10(LSTM-RNN)

-07Number of assigned neurons out of 10(RNN)

-2

“Excellent”.This example is rated as a “Good”match in the dataset.The score for this pair assigned by RNN is “0.8165”while the score assigned by LSTM-RNN is “0.9161”.Please note that the score is between 0and 1.This means that the score assigned by LSTM-RNN is more correspondent with the human generated label.Second,we compare the number of assigned neurons and cells to each word by RNN and LSTM-RNN respec-tively.To do this,we rely on the 10most active cells and neurons in the ?nal semantic vectors in both models.Results are presented in Table V and Table VI for query and document respectively.An interesting observation is that RNN sometimes assigns neurons to unimportant words,e.g.,6neurons are assigned to the word “in ”in Table VI.

As another example we consider the query,“how to fix bath tub wont turn off ”.This example is rated as a “Bad”match in the dataset by human.It is good to know that the score for this pair assigned by RNN is “0.7016”while the score assigned by LSTM-RNN is “0.5944”.This shows the score generated by LSTM-RNN is closer to human generated label.

Number of assigned neurons and cells to each word by RNN and LSTM-RNN are presented in Table VII and Table VIII for query and document.This is out of 10most active neurons and cells in the semantic vector of RNN and LSTM-RNN.Examples of RNN assigning neurons to unimportant words are 3neurons to the word “a”and 4neurons to the word “you”in Table VIII.A PPENDIX D

D ERIVATION OF BPTT FOR RNN AND LSTM-RNN In this appendix we present the full derivation of the gradients for RNN and LSTM-RNN.A.Derivation of BPTT for RNN From (4)and (5)we have:

?L (Λ)?Λ=N r =1?l r (Λ)?Λ=?N r =1n j =1

αr,j

??r,j ?Λ(45)

where

αr,j

?γe ?γ?r,j

1+ n j =1e ?γ?r,j

(46)

TABLE VI

RNN S WITH&WITHOUT LSTM CELLS FOR THE SAME D OCUMENT:“shanghai hotels accommodation hotel in shanghai discount and

reservation”

shanghai hotels accommodation hotel in shanghai discount and reservation Number of assigned

cells out of10(LSTM-RNN)-43818534 Number of assigned

neurons out of10(RNN)-107968326

TABLE VII

RNN VERSUS LSTM-RNN FOR Q UERY:“how to fix bath tub wont turn off”

how to fix bath tub wont turn off

Number of assigned

cells out of10(LSTM-RNN)-0476350

Number of assigned

neurons out of10(RNN)-11046271

TABLE VIII

RNN VERSUS LSTM-RNN FOR D OCUMENT:“how do you paint a bathtub and what paint should...”

how do you paint a bathtub and what paint should you...

Number of assigned

cells out of10(LSTM-RNN)-117092384 Number of assigned

neurons out of10(RNN)-144372547

and

?r,j=R(Q r,D+r)?R(Q r,D r,j)(47)

We need to?nd??r,j

?Λfor input weights and recurrent

weights.We omit r subscript for simplicity.

1)Recurrent Weights:

??j ?W rec =

?R(Q,D+)

?W rec

?R(Q,D?j)

?W rec

(48)

We divide R(D,Q)into three components:

R(Q,D)=y Q(t=T Q)T y D(t=T D)

y Q(t=T Q)

b .

y D(t=T D)

(49)

then

?R(Q,D)?W rec =

?W rec

.b.c

+a.

?W rec

a.b.?c

rec

F (50)

We have

?y Q(t=T Q)T y D(t=T D).b.c

?W rec

?y Q(t=T Q)T y D(t=T D).b.c

?y Q(t=T Q)

?W rec

+?y Q(t=T Q)T y D(t=T D).b.c

?y D(t=T D)

?W rec

=y D(t=T D).b.c.

?y Q(t=T Q)

?W rec

y Q(t=T Q).(b.c)T

b.c

?y D(t=T D)

?W rec

(51)

Since f(.)=tanh(.),using chain rule we have

?y Q(t=T Q)

W rec

[(1?y Q(t=T Q))?(1+y Q(t=T Q))]y Q(t?1)T

(52)

and therefore

D=[b.c.y D(t=T D)?(1?y Q(t=T Q))?

(1+y Q(t=T Q))]y Q(t?1)T+

[b.c.y Q(t=T Q)?(1?y D(t=T D))?

(1+y D(t=T D))]y D(t?1)T(53) To?nd E we use following basic rule:

x?a 2=

x?a

x?a 2

(54)

Therefore

E=a.c.

?W rec

( y Q(t=T Q) )?1=

?a.c.( y Q(t=T Q) )?2.? y Q(t=T Q)

?W rec

=?a.c.( y Q(t=T Q) )?2.

y Q(t=T Q)

?y Q(t=T Q)

?W rec

=?[a.c.b3.y Q(t=T Q)?(1?y Q(t=T Q))?

(1+y Q(t=T Q))]y Q(t?1)(55) F is calculated similar to(55):

F=?[a.b.c3.y D(t=T D)?(1?y D(t=T D))?

(1+y D(t=T D))]y D(t?1)(56) Considering(50),(53),(55)and(56)we have:

?R(Q,D)?W rec =δy

(t)y Q(t?1)T+δy

(t)y D(t?1)T

(57)

where

δy

(t=T Q)=(1?y Q(t=T Q))?(1+y Q(t=T Q))?(b.c.y D(t=T D)?a.b3.c.y Q(t=T Q)),

δy

(t=T D)=(1?y D(t=T D))?(1+y D(t=T D))?(b.c.y Q(t=T Q)?a.b.c3.y D(t=T D))(58) Equation(58)will just unfold the network one time step, to unfold it over rest of time steps using backpropagation we have:

δy

(t?τ?1)=(1?y Q(t?τ?1))?

(1+y Q(t?τ?1))?W T recδy

(t?τ),

δy

(t?τ?1)=(1?y D(t?τ?1))?

(1+y D(t?τ?1))?W T recδy

(t?τ)(59) whereτis the number of time steps that we unfold the network over time which is from0to T Q and T D for queries and documents respectively.Now using(48)we have:

??j,τ?W rec =[δD+

y Q

(t?τ)y T Q(t?τ?1)+

δD+ y D (t?τ)y T D+(t?τ?1)]?[δD?j y

(t?τ)y T Q(t?τ?1)

+δD?j y

D (t?τ)y T

D?j

(t?τ?1)](60)

To calculate?nal value of gradient we should fold back the network over time and use(45),we will have:

?L(Λ)?W rec =?

r=1

j=1

τ=0

αr,j,T

D,Q

??r,j,τ

?W rec

one large update

(61)

2)Input Weights:Using a similar procedure we will

have the following for input weights:

?R(Q,D)

=δy

(t?τ)l Q(t?τ)T+δy

(t?τ)l D(t?τ)T

(62)

where

δy

(t?τ)=(1?y Q(t?τ))?(1+y Q(t?τ))?

(b.c.y D(t?τ)?a.b3.c.y Q(t?τ)),

δy

(t?τ)=(1?y D(t?τ))?(1+y D(t?τ))?

(b.c.y Q(t?τ)?a.b.c3.y D(t?τ))(63)

Therefore:

??j,τ

[δD+

y Q

(t?τ)l T Q(t?τ)+δD+

y D

(t?τ)l T D+(t?τ)]?

[δD?j y

(t?τ)l T Q(t?τ)+δD?j y

(t?τ)l T

D?j

(t?τ)](64)

and therefore:

?L(Λ)

r=1

j=1

τ=0

αr,j

??r,j,τ

one large update

(65)

B.Derivation of BPTT for LSTM-RNN

Following from(50)for every parameter,Λ,in

LSTM-RNN architecture we have:

?R(Q,D)

?Λ

.b.c

+a.

?Λ

+a.b.

?Λ

(66)

and from(51):

D=y D(t=T D).b.c.

?y Q(t=T Q)

?Λ

y Q(t=T Q).b.c.

?y D(t=T D)

?Λ

(67)

From(55)and(56)we have:

E=?a.c.b3.y Q(t=T Q)

?y Q(t=T Q)

?Λ

(68)

F=?a.b.c3.y D(t=T D)

?y D(t=T D)

?Λ

(69)

Therefore

?R(Q,D)

?Λ

=D+E+F=

v Q

?y Q(t=T Q)

?Λ

+v D

?y D(t=T D)

?Λ

(70)

where

v Q=(b.c.y D(t=T D)?a.b3.c.y Q(t=T Q))

v D=(b.c.y Q(t=T Q)?a.b.c3.y D(t=T D))(71)

16 1)Output Gate:Sinceα?β=diag(α)β=diag(β)α

where diag(α)is a diagonal matrix whose main diagonal

entries are entries of vectorα,we have:

?y(t)?W rec1=

?W rec1

(diag(h(c(t))).o(t))

=?diag(h(c(t)))

?W rec1

zero

.o(t)+diag(h(c(t))).

?o(t)

?W rec1

=o(t)?(1?o(t))?h(c(t)).y(t?1)T(72) Substituting(72)in(70)we have:

?R(Q,D)?W rec1=δrec1

y Q

(t).y Q(t?1)T+δrec1

y D

(t).y D(t?1)T

(73)

where

δrec1

y Q

(t)=o Q(t)?(1?o Q(t))?h(c Q(t))?v Q(t)

δrec1 y D (t)=o D(t)?(1?o D(t))?h(c D(t))?v D(t)

(74)

with a similar derivation for W1and W p1we get:

?R(Q,D)?W1=δrec1

y Q

(t).l Q(t)T+δrec1

y D

(t).l D(t)T(75)

?R(Q,D)?W p1=δrec1

y Q

(t).c Q(t)T+δrec1

y D

(t).c D(t)T(76)

For output gate bias values we have:

?R(Q,D)

?b1=δrec1

y Q

(t)+δrec1

y D

(t)(77)

2)Input Gate:Similar to output gate we start with:

?y(t)?W rec3=

?W rec3

(diag(o(t)).h(c(t)))

=?diag(o(t))

?W rec3

zero

.h(c(t))+diag(o(t)).

?h(c(t))

?W rec3

=diag(o(t)).(1?h(c(t)))?(1+h(c(t)))?c(t)

rec3

(78)

To?nd?c(t)

?W rec3assuming f(t)=1(we derive formula-

tion for f(t)=1from this simple solution)we have:

c(0)=0

c(1)=c(0)+i(1)?y g(1)=i(1)?y g(1)

c(2)=c(1)+i(2)?y g(2)

...

c(t)=

k=1

i(k)?y g(k)=

k=1

diag(y g(k)).i(k)(79)

Therefore

?c(t)

?W rec3

k=1

[

?diag(y g(k))

W rec3

zero

.i(k)+diag(y g(k)).

?i(k)

W rec3

]

k=1

diag(y g(k)).i(k)?(1?i(k)).y(k?1)T(80)

(81)

and

?y(t)

?W rec3

k=1

[o(t)?(1?h(c(t)))?(1+h(c(t)))

a(t)

?y g(k)?i(k)?(1?i(k))

b(k)

]y(k?1)T(82)

But this is expensive to implement,to resolve it we have:

?y(t)

rec3

t?1

k=1

[a(t)?b(k)]y(k?1)T

expensive part

+[a(t)?b(t)]y(t?1)T

=diag(a(t))

t?1

k=1

b(k).y(k?1)T

?c(t?1)

?W rec3

+diag(a(t)).b(t).y(t?1)T(83)

Therefore

?y(t)

?W rec3

=[diag(a(t))][

?c(t?1)

?W rec3

+b(t).y(t?1)T]

(84)

For f(t)=1we have

?y(t)

?W rec3

=[diag(a(t))][diag(f(t)).

?c(t?1)

?W rec3

+b i(t).y(t?1)T](85)

where

a(t)=o(t)?(1?h(c(t)))?(1+h(c(t)))

b i(t)=y g(t)?i(t)?(1?i(t))(86)

substituting above equation in(70)we will have:

?R(Q,D)

?W rec3

=diag(δrec3

y Q

(t)).

?c Q(t)

?W rec3

+diag(δrec3

y D

(t)).

?c D(t)

?W rec3

(87)

where

δrec3

y Q

(t)=(1?h(c Q(t)))?(1+h(c Q(t)))?o Q(t)?v Q(t)

?c Q(t)

?W rec3

=diag(f Q(t)).

?c Q(t?1)

?W rec3

+b i,Q(t).y Q(t?1)T

b i,Q(t)=y g,Q(t)?i Q(t)?(1?i Q(t))(88)

In equation(87),δrec3

y D (t)and?c D(t)

?W rec3

are the same as

(88)with D subscript.Therefore,update equations for W rec3are(87),(88)for Q and D and(6).

With a similar procedure for W3we will have the following:

?R(Q,D)?W3=diag(δrec3

y Q

(t)).

?c Q(t)

?W3

+diag(δrec3

y D (t)).

?c D(t)

?W3

(89)

where

?c Q(t)?W3=diag(f Q(t)).

?c Q(t?1)

?W3

+b i,Q(t).x Q(t)T

(90)

Therefore,update equations for W3are(89),(90)for Q and D and(6).

For peephole connections we will have:

?R(Q,D)?W p3=diag(δrec3

y Q

(t)).

?c Q(t)

?W p3

+diag(δrec3

y D (t)).

?c D(t)

?W p3

(91)

where

?c Q(t)?W p3=diag(f Q(t)).

?c Q(t?1)

?W p3

+b i,Q(t).c Q(t?1)T

(92)

Hence,update equations for W p3are(91),(92)for Q and D and(6).

Following similar derivation for bias values b3we will have:

?R(Q,D)

?b3=diag(δrec3

y Q

(t)).

?c Q(t)

?b3

+diag(δrec3

y D (t)).

?c D(t)

(93)

where

?c Q(t)?b3=diag(f Q(t)).

?c Q(t?1)

?b3

+b i,Q(t)(94)

Update equations for b3are(93),(94)for Q and D and (6).

3)Forget Gate:For forget gate,with a similar deriva-tion to input gate we will have

?y(t)?W rec2=[diag(a(t))][diag(f(t)).

?c(t?1)

?W rec2

+b f(t).y(t?1)T](95) where

a(t)=o(t)?(1?h(c(t)))?(1+h(c(t)))

b f(t)=c(t?1)?f(t)?(1?f(t))(96)substituting above equation in(70)we will have:

?R(Q,D)

?W rec2

=diag(δrec2

y Q

(t)).

?c Q(t)

?W rec2

+diag(δrec2

y D

(t)).

?c D(t)

?W rec2

(97) where

δrec2

y Q

(t)=(1?h(c Q(t)))?(1+h(c Q(t)))?o Q(t)?v Q(t)?c Q(t)

?W rec2

=diag(f Q(t)).

?c Q(t?1)

?W rec2

+b f,Q(t).y Q(t?1)T b f,Q(t)=c Q(t?1)?f Q(t)?(1?f Q(t))(98) Therefore,update equations for W rec2are(97),(98)for

Q and D and(6).

For input weights to forget gate,W2,we have

?R(Q,D)

?W2

=diag(δrec2

y Q

(t)).

?c Q(t)

?W2

+diag(δrec2

y D

(t)).

?c D(t)

?W2

(99) where

?c Q(t)

?W2

=diag(f Q(t)).

?c Q(t?1)

?W2

+b f,Q(t).x Q(t)T

(100) Therefore,update equations for W2are(99),(100)for

Q and D and(6).

For peephole connections,W p2,we have

?R(Q,D)

?W p2

=diag(δrec2

y Q

(t)).

?c Q(t)

?W p2

+diag(δrec2

y D

(t)).

?c D(t)

?W p2

(101) where

?c Q(t)

?W p2

=diag(f Q(t)).

?c Q(t?1)

?W p2

+b f,Q(t).c Q(t?1)T

(102) Therefore,update equations for W p2are(101),(102)

for Q and D and(6).

Update equations for forget gate bias values,b2,will

be following equations and(6):

?R(Q,D)

?b2

=diag(δrec2

y Q

(t)).

?c Q(t)

?b2

+diag(δrec2

y D

(t)).

?c D(t)

?b2

(103) where

?c Q(t)

?b2

=diag(f Q(t)).

?c Q(t?1)

?b3

+b f,Q(t)(104)

4)Input without Gating(y g(t)):Gradients for y g(t) parameters are as follows:

?y(t)?W rec4=[diag(a(t))][diag(f(t)).

?c(t?1)

?W rec4

+b g(t).y(t?1)T](105) where

a(t)=o(t)?(1?h(c(t)))?(1+h(c(t)))

b g(t)=i(t)?(1?y g(t))?(1+y g(t))(106) substituting above equation in (70)we will have:

?R(Q,D)?W rec4=diag(δrec4

y Q

(t)).

?c Q(t)

?W rec4

+diag(δrec4

y D (t)).

?c D(t)

?W rec4

(107)

where

δrec4

y Q

(t)=(1?h(c Q(t)))?(1+h(c Q(t)))?o Q(t)?v Q(t)

?c Q(t)?W rec4=diag(f Q(t)).

?c Q(t?1)

?W rec4

+b g,Q(t).y Q(t?1)T

b g,Q(t)=i Q(t)?(1?y g,Q(t))?(1+y g,Q(t))(108) Therefore,update equations for W rec4are(107),(108) for Q and D and(6).

For input weight parameters,W4,we have

?R(Q,D)?W4=diag(δrec4

y Q

(t)).

?c Q(t)

?W4

+diag(δrec4

y D (t)).

?c D(t)

?W4

(109)

where

?c Q(t)?W4=diag(f Q(t)).

?c Q(t?1)

?W4

+b g,Q(t).x Q(t)T

(110)

Therefore,update equations for W4are(109),(110)for Q and D and(6).

Gradients with respect to bias values,b4,are

?R(Q,D)

?b4=diag(δrec4

y Q

(t)).

?c Q(t)

?b4

+diag(δrec4

y D (t)).

?c D(t)

?b4

(111)

where

?c Q(t)?b4=diag(f Q(t)).

?c Q(t?1)

?b4

+b g,Q(t)(112)

Therefore,update equations for b4are(111),(112)for Q and D and(6).There is no peephole connections for y g(t).

A PPENDIX E

LSTM-RNN V ISUALIZATION

In this appendix we present more examples of LSTM-RNN visualization.

2468

100

0.02

0.04

0.06

0.08

0.1

Fig.11.Query:“how to?x bath tub wont turn off”

A.LSTM-RNN Semantic Vectors:Another Example Consider the following example from test dataset:?Query:“how to fix bath tub wont turn off”?Document:“how do you paint a bathtub and what

paint should you use yahoo answers

treated as one word

”

Activations of input gate,output gate,cell state and cell output for each cell for query and document are presented in Fig.9and Fig.10respectively based on a trained LSTM-RNN model.

Three interesting observations from Fig.9and Fig.10:?Semantic representation y(t)and cell states c(t)are

evolving over time.

?In part(a)of Fig.10,we observe that input gate

values for most of the cells corresponding to word

3,word4,word7and word9in document side

of LSTM-RNN have very small values(light blue

color).These are corresponding to words“you”,

“paint”,“and”and“paint”respectively in the

document title.Interestingly,input gates are trying

to reduce effect of these words in the?nal repre-

sentation(y(t))because the LSTM-RNN model is

trained to maximize the similarity between query

and document if they are a good match.

?y(t)is used as semantic representation after apply-

ing output gate on cell states.Note that valuable

context information is stored in cell states c(t). B.Key Word Extraction:Another Example

Evolution of10most active cells over time for the second example are presented in Fig.11for query and Fig.12for document.Number of assigned cells out of 10most active cells to each word are presented in Table IX and Table X.

51015202530

?0.3

?0.2?0.100.10.20.30.4(a)i (t )

51015202530

?0.3

?0.2?0.100.10.20.30.4(b)c (t )

51015202530

?0.3

?0.2?0.100.10.20.30.4(c)o (t )

51015202530

?0.3

?0.2?0.100.10.20.30.4(d)y (t )

Fig.9.Query:“how to ?x bath tub wont turn off ”

TABLE IX

K EYWORD EXTRACTION FOR Q UERY :“how to fix bath tub wont turn off ”

how

to fix bath tub wont turn off Number of assigned cells out of 10

-0

TABLE X

K EYWORD EXTRACTION FOR D OCUMENT :“how do you paint a bathtub and what paint should...”

how

do you paint a bathtub

and what paint should you...

Number of assigned cells out of 10

-1

51015202530

0.1

0.120.140.160.180.20.22(a)i (t )

51015202530

?0.2

?0.10

0.10.20.3(b)c (t )

51015202530

0.35

0.40.450.50.55

(c)o (t )

1015202530

?0.1

?0.05

00.050.1(d)y (t )

Fig.10.Document:“how do you paint a bathtub and what paint should ...”

246810

6810

0.020.040.06

0.080.1Fig.12.Document:“how do you paint a bathtub and what paint should ...”

英文写作指导——练写段落主题句topic sentences practice

Topic S entences: P ractice Read t he p aragraphs b elow. T hey a re m issing a t opic s entence. W rite a t opic s entence t hat introduces t he m ain i dea o f e ach p aragraph. *Note: Y ou d o n ot h ave t o u se a s imile o r m etaphor, b ut s ometimes t hese a re g ood w ays t o write i nteresting t opic s entences. ____________________________________________________________________. Who takes care of you? Who supports you? Who sees you grow up? Family is very important. My family has six people: my grandma, my parents, myself, and my two brothers. My grandma loves me very much. When the weather is cold, she always tells me to wear more clothes. Although I often argue with my brothers, they will give me support when I need it. My parents have taken care of me since I was born. My definition of family is an organization which is full of love. ____________________________________________________________________. When you travel to Europe, you can visit many different countries, such as England, Spain, Germany, and Greece. Many different languages are spoken in Europe, and the cultures of the countries are all unique. Also, the weather in Europe varies a lot. Countries in the north are very cold, and you can go skiing. In the south, there are beautiful beaches, and these are popular places for vacations. As you can see, Europe is a very interesting place with different kinds of people and many possibilities. ____________________________________________________________________. Her name is Mrs. Graham, and she not only teaches music in my school, but she is also a friend to all of her students. In class, she teaches us to love music, and she introduces us to different songs and styles of music. She taught me to play the piano and violin, and I am sure that I will enjoy playing these instruments for the rest of my life. Mrs. Graham often tells interesting stories in class, and she always helps us or gives us advice when we have problems. Mrs. Graham is more than just a music teacher, she is like a star in the sky.

topic sentence and controlling idea

Topic sentences and controlling idea 1.The Colorado mountains are the most beautiful in America." topic sentence.：Colorado mountains controlling idea：in America 2.The life cycle of a frog has two stages. topic sentence：The life cycle of the frog controlling idea.：has two stages topic sentences(in red) and controlling ideas (in green). 1. People can avoid burglaries by taking certain precautions. (The precautions for? 2. There are several advantages to growing up in a small town. (The advantages of? 3. Most US universities require a 550 point TOEFL score for a number of reasons. (The reasons for? 4. Air pollution in Mexico City is the worst in the world for a number of reasons. (The causes of? or (The effects of? 5. Fixing a flat tire on a bicycle is easy if you follow these steps. (The steps for? 6. There are several enjoyable ways to travel between the US and Queretaro. (The ways to? or (The methods of? 7. Animals in danger of becoming extinct come from a wide range of countries. (The different countries?[parts, kinds, types]) 8. Effective leadership requires specific qualities that anyone can develop. (The qualities (or characteristics or traits) of? 9. Industrial waste poured into Lake Michigan has led to dramatic changes in its ability to support marine life. (The effects of?

topic sentence

Topic Sentence Ⅰ. Choose the best topic sentence from the group below. 1. A. Picasso was thought to be dead at birth in Malaga on. Oct. 25, 1881. B. By the age of 25, Picasso was an able and gifted artist. C. Picasso’s father was a painter named Jose Ruiz Blasco. D. The full sweep of Picasso’s effect on modern art is difficult to document. Answer: ___________ 2. A. In later adulthood，we begin to come to terms with our own mortality. B. There are various stages of human development. C. Adolescence is typically a time of identity crisis. D. Psychologists report that we pass through various stages of development throughout our lives. Answer： Ⅱ. Read the following paragraph carefully and select the best topic sentence from the four possible answers that follow the paragraph. 1.Topic Sentence： “Music，”the teacher would tell his pupils，“is a state of being. It is not so much knowledge and know-how .If you want to be good at playing an instrument，let music get hold of you first and this will in turn get hold your muscles and make them produce the music that is now inside you. How can music come out of an instrument if it is not first put into it？And who is to put it there？The composer，the maker of the instrument，the printed score or the player？” A. If you want to be good at playing an instrument，let music get hold of you first. B. It is about stages of learning to play an instrument. C. Only when the player puts the music into the instrument，can music come out of it. D. Only when the maker of the instrument puts the music into the instrument，can music come out of it. 2. Topic sentence ___________ At one time, transistor radios were not practical, because they were too expensive, Now all of that has changed. With the reduced price of transistors and the cheaper costs of mass production, the transistor radio is cheaper than the old-style tube model. In addition, transistor radios do not heat up like the old tube radios, so they will not wear out as quickly. Also, transistor radios can be made much smaller because transistors are smaller than tubes. Furthermore, transistor radios are more reliable. They have fewer parts, so less can go wrong. A. Transistor radios are practical and inexpensive. B. Transistor radios have undergone much improvement. C. Transistor radios are cheaper than tube radios because of mass production. D. Transistor radios are better than the old-style tube radios. Ⅲ. Read the following passages and identify the topic sentence in each by underlining it. 1. The biggest problem in ancient DNA research is getting the DNA in the first place. The favorite material to work with is bone, and a small chunk of it is best. Cells can lie inside the hard

topicsentence英语主题句

讲解主题句是一个完整的句子，用以概括、叙述或说明该段落的主旨大意。每一个段落只能有一个主题，全段的其他文字都应围绕它展开。主题句通常由两部分组成，即主题(topic) +中心思想（controlling idea）。中心思想的作用是导向(control) 和制约(limit) 主题。所谓导向就是规定段落的发展脉络，所谓制约就是限制主题的覆盖范围，两者不可分割。没有导向，内容就会离题或偏题；没有制约，内容就可能超出一个段落所能容纳的范围。下面的8 个句子都是很漂亮的主题句，其中红色加粗的文字为主题，绿色斜体的文字为中心思想： 1. People can avoid burglaries（入室行窃）by taking certain precautions. ( 接下来作者将围绕着“防范措施”展开段落) 2. There are several advantages to growing up in a small town. ( 接下来作者将围绕着“优点”展开段落) 3. Most US universities require a 550 point TOEFL score for a number of reasons. ( 接下来作者将围绕着“原因”展开段落) 4. Fixing a flat tire on a bicycle is easy if you follow these steps. ( 接下来作者将围绕着“步骤”展开段落) 5. There are several enjoyable ways to travel between the US and Queretaro. ( 接下来作者将围绕着“方法”展开段落) 6. Effective leadership requires specific qualities that anyone can develop. ( 接下来作者将围绕着“品质”展开段落) 7. Industrial waste poured into Pearl River has led to dramatic changes in its ability to support marine life. ( 接下来作者将围绕着“变化”展开段落) 8. In order to fully explore the wreck（残骸）of the Titanic, scientists must address several problems. ( 接下来作者将围绕着“问题”展开段落) 通过对上面8个主题句的分析可见：段落的主题句对主题的限定主要是通过句中的关键词来实现的。也就是说，中心思想中应包含着一个相对具体的关键词

topic sentence练习

Exercises 1. Revise topic sentences. Topic sentences as below are not effective one. Try to find out what is wrong and revise them. Model: Original: Columbus was an explorer in the 1400s. Revision: Travel has changed since the days of Columbus. (1) Original: Internet is the topic of the paragraph. Revision: (2) Original: I had a very bright student long ago. Revision: (3) Original: People waste time Revision: (4)Original:Every extra hour of watching television meant an 8% increase in the chances of developing signs of depression. Revision: (5)Original:Failure is the mother of success and fearing mistakes prevents us from making change and slow down personal progress. Revision: 2. Identify the topic sentence in each of the following paragraphs. Supply a topic sentence if there is no explicit one in it. (1) Opera in the 19th century developed in two main ways. Fist, there was the type that preferred to give the greatest importance to the voice, keeping the orchestra more as an accompaniment. And second, there was the type that reduced the importance of the singers and allowed the orchestra to dominate the proceedings. The first type could be described as OPEAR OF VOCAL MELODY. You can fid examples in the work of Rossini, Belline, Donizetti, and of course, Verdi, The second type is really SYMPHONICOPERA. Wagner is its most important exponent. (2) Increasingly over the past ten years, people have become aware of the need to change their eating habits, because what we choose to eat can influence our health—for better or worse Consequently, there has been a growing interest in natural foods. (3) Of all human emotions, none is more natural than the love for the town, the valley or the neighborhood where we grew up. Our homeland speaks to our most intimate memories, moves our deepest emotions. Everything that is a part of it belongs to us in some measure. In a way we belong to it, too, as a leaf belongs to a tree. (4) Standing in the Sicilian sun with your feet washed by the waves of the Tyrrhenian Sea, it’s easy to forget that there’s more to the tri-corner island than beauty. The beaches are magnificent—turquoise waters stretching out to the horizon, soft sand, gentle waves and some of the clearest water in the world. In the distance, on the rockier portions of the coast, volcanic rock juts out of the sea, its dark color and jagged edge adding drama to the vista.

topic sentence 主题句

Looking for the Topic Sentence More often than not, one sentence in a paragraph tells the reader exactly what the subject of the paragraph is and thus gives ?the ?main idea. This main idea sentence is called a topic sentence or topic statement. The topic sentence states briefly an idea whose full meaning and significance are developed by the supporting ?details. It may appear at the beginning, or in the middle, or at the end of a paragraph. Sample 1: At the beginning London's weather is very strange . It can rain several times? a? day; each time the rain may come suddenly after the sun is shining brightly. The air is damp and chill right through July. On ?one March afternoon on Hampton Heath last year it rained three times, there was one hail(冰雹) storm, and the sun shone brilliantly -?-?all ?this within two hours' time. It is not unusual to see men and women rushing down the street on a sunny morning with umbrellas on their arms. No one knows what the next few moments will bring. (The main idea in this paragraph is London's weather is very strange . All the other sentences illustrate the ?idea ?with supporting details.) Sample 2: In the middle Just as I settle down to read or watch television, he demands that I play with him. If I ?get ?a ?telephone call,? he ?screams in ?the background or knocks something over. I always have to hang up to ?find out what's wrong with him. Baby- sitting with my little brother is no fun . He refuses to let me eat a snack in peace. Usually he ?wants half of whatsoever I have to eat. Then, when he finally grows tired, it takes about an hour for him to fall asleep. (All the details are cited in this paragraph ?to support the main idea: Baby--sitting with my little brother is no fun. ) Sample 3: At the end Doctors are of the opinion that most people cannot live beyond 100 years, but a growing number ?of ?scientists believe ?that ?the aging process can be controlled. There are more than 12,000 Americans ?over 100 years old, and their numbers are increasing each year. Dr.? James Langley of Chicago claims that, theoretically and under ?ideal conditions, animals, including man, can live ?six ?times longer ?than their normal period of growth. A person's period of growth lasts about 25 years. ?If ?Dr. ?Langley's theory ?is ?accurate?, ?future generations can expect a life span of 150 years. Sometimes a writer wants to give strong emphasis to a topic sentence. He may place a topic sentence at both the beginning ?and end ?of ?a paragraph. This can tell a reader that the idea in this paragraph ?is more important than other ideas found in other paragraphs.

英语作文之topic sentence

20100512062王珍Topic 1: Factors Influencing Young Adults 中文： Topic sentences: 1、There is no generation gap between Friends and us. 2、Friends are playing essential parts in our social activities . 3、It is a proper way to share our personal emotions with friends. Topic 2: The Benefits for Working as Village Officials 中文：较稳定的收入、竞争较小、生活环境较好、深入基层、体察情民情，体会农村的疾苦、为政治前途积累经验、锻炼吃苦耐劳的精神、 Topic sentences: 1、W orking as village officials is accessible to obtaining stable salary

and lead a healthy life-style 2、W orking in village enables officials to consider the people living in poverty. 3、W orking in the basement layer is an essential primary stage for officials during the way of deepening their political career. Topic 3: Should Smoking Be Banned in Public Places? 中文:危害他人健康、缺乏社会公德心、二手烟民、同时祸及多人、危害程度大、范围广、侵害他人健康的权益、害人害己 Topic sentences: 1、Smoking in public places is lacking the consciousness of social obligation and conscience. 2、Smoking in public places deprives the surrounding people’s right of healthy by unconscious. 3、Smoking in public places has the potential possibility of damaging the health of our whole social members in a widely and speedily way. Topic 4: How to Get Along with Your Roommates 中文：互相理解、支持、包容、平等、真诚待人、互相关心、团队精神、 Topic sentences: 1、Getting along with roommates requires adequate

主题句简介topic sentence

主题句简介topic sentence 二、主题句简介 2．1 主题句作者的首要任务是让读者知道所写段落要谈的是什么，这就是每段的主题句的作用。因此主题句应该阐明段落的主要思想，所有支持主题句的细节和描述都与这一主要思想有关。2．2 主题句的形式主题句通常有以下三种形式：1）肯定句（Affirmative Sentence）Example：The need for wildlif e protection is greater now than ever before．2）反诘句（Rhetorical Sentence）Example：How do you think people will solve the problem o f wildlife protection？3）不完整句（Fragments）Example：And the workingman？初学者最好使用肯定句作为主题句。2．3 主题句的位置主题句出现的位置有以下四种情况：1）段首（At the beginning）主题句经常居于段首，以便读者浏览主题句就可掌握文章的概要。这个位置适用于写提供信息或解释观点的段落。2）段末（At the end）用推理方法展开段落时，主题句往往位于句末。3）段中（In the middle）有时为了使段落多样化，主题句也可以居于段中。 4）隐含（Implied）有时候，尤其在写叙述性或描写性段落时，当所有的细节都围绕着一个显而易见的主题时可以不用主题句。2Exercise 2-1 Directions：Read the following paragraphs and identify the topic sentence．If it is

topic sentence英语主题句教学文案

t o p i c s e n t e n c e英语主题句

精品文档讲解主题句是一个完整的句子，用以概括、叙述或说明该段落的主旨大意。每一个段落只能有一个主题，全段的其他文字都应围绕它展开。主题句通常由两部分组成，即主题 (topic) +中心思想（controlling idea）。中心思想的作用是导向 (control) 和制约 (limit) 主题。所谓导向就是规定段落的发展脉络，所谓制约就是限制主题的覆盖范围，两者不可分割。没有导向，内容就会离题或偏题；没有制约，内容就可能超出一个段落所能容纳的范围。下面的8个句子都是很漂亮的主题句，其中红色加粗的文字为主题，绿色斜体的文字为中心思想： 1. People can avoid burglaries（入室行窃）by taking certain precautions. ( 接下来作者将围绕着“防范措施”展开段落) 2. There are several advantages to growing up in a small town. ( 接下来作者将围绕着“优点”展开段落) 3. Most US universities require a 550 point TOEFL score for a number of reasons. ( 接下来作者将围绕着“原因”展开段落) 4. Fixing a flat tire on a bicycle is easy if you follow these steps. ( 接下来作者将围绕着“步骤”展开段落) 5. There are several enjoyable ways to travel between the US and Queretaro. ( 接下来作者将围绕着“方法”展开段落) 6. Effective leadership requires specific qualities that anyone can develop. ( 接下来作者将围绕着“品质”展开段落) 7. Industrial waste poured into Pearl River has led to dramatic changes in its ability to support marine life. ( 接下来作者将围绕着“变化”展开段落) 8. In order to fully explore the wreck（残骸） of the Titanic, scientists must address several problems. ( 接下来作者将围绕着“问题”展开段落) 收集于网络，如有侵权请联系管理员删除

写作知识topic sentence

MORE EXPLANATION ABOUT TOPIC SENTENCE What is a topic sentence? A topic sentence states the main point of a paragraph: it serves as a mini-thesis for the paragraph. Y ou might think of it as a signpost for your readers—or a headline—something that alerts them to the most important, interpretive points in your essay. When read in sequence, your essay's topic sentences will provide a sketch of the essay's argument. Thus topics sentences help protect your readers from confusion by guiding them through the argument. But topic sentences can also help you to improve your essay by making it easier for you to recognize gaps or weaknesses in your argument. Where do topic sentences go? T opic sentences usually appear at the very beginning of paragraphs. In the following example from Anatomy of Criticism, Northrop Frye establishes the figure of the tragic hero as someone more than human, but less than divine. He backs up his claim with examples of characters from literature, religion and mythology whose tragic stature is a function of their ability to mediate between their fellow human beings and a power that transcends the merely human: The tragic hero is typically on top of the wheel of fortune, half-way between human society on the ground and the something greater in the sky.Prometheus, Adam, and Christ hang between heaven and earth, between a world of paradisal freedom and a world of bondage. Tragic heroes are so much the highest points in their human landscape that they seem the inevitable conductors of the power about them, great trees more likely to be struck by lightning than a clump of grass. Conductors may of course be instruments as well as victims of the divine lightning: Milton's Samson destroys the Philistine temple with himself, and Hamlet nearly exterminates the Danish court in his own fall. The structure of Frye's paragraph is simple yet powerful: the topic sentence makes an abstract point, and the rest of the paragraph elaborates on that point using concrete examples as evidence. Does a topic sentence have to be at the beginning of a paragraph? No, though this is usually the most logical place for it. Sometimes a transitional sentence or two will come before a topic sentence: We found in comedy that the term bomolochos or buffoon buffoon ( [b?'fu:n] 愚蠢的人,傻瓜,逗乐小丑，滑稽的人)need not be restricted to farce闹剧, but could be extended to cover comic characters who are primarily entertainers表演者, with the function of increasing or focusing the comic mood. The corresponding contrasting type is the suppliant, the character, often female, who presents a picture of unmitigated十足的helplessness and destitution贫穷. Such a figure is pathetic, and pathos, though it seems a gentler and more relaxed mood than tragedy, is even more terrifying. Its basis is the exclusion of an individual from the group; hence it attacks the deepest fear in ourselves that we possess--a fear much deeper than the relatively cosy and sociable bogey妖怪of