• NMT: Neural Machine Translation
  • SMT: Static Machine Translation
  • BLUE: Bilingual Evaluation Understudy
  • WMT-14 dataset:
  • beam-search: beam search是对贪心策略一个改进。在每一个时间步,不再只保留当前分数最高的1个输出,而是保留num_beams个。当num_beams=1时集束搜索就退化成了贪心搜索。
  • <EOS>: end-of-sentence symbol

Sequence to Sequence Learning with Neural Networks

Sequential Problems: Lengths are not known a-priori.

Using a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, then another deep LSTM to decode the target sequence from the vector.


$$BLEU = BP \times exp(\sum_{n=1}^{N}w_n\log{P_n})$$

其中 $BP$ 是惩罚因子(Brevity Penalty),为了避免评分的偏向性,所以在结果中引入。

Recurrent Neural Network

Given a sequence of inputs $(x1, \dots, x_T)$, a standard RNN computes a sequence of outputs $(y_1, \dots , y_T)$ by iterating the following equation:

$$h_t = sigm(W^{hx}x_t+W^{hh}h_{t-1})$$

$$y_t = W^{yh}h_t$$

However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relationships.

The goal of LSTM is known to learn problems with long range temporal dependencies, and estimate the conditional probability $p(y_1, \dots, y_{T’}|x_1, \dots, x_T)$ (length $T’$ may differ from $T$)

  • 用了两种不同的lstm,一种是处理输入序列,一种是处理输出序列;
  • 更深的lstm会比浅的lstm效果更好,所以论文选择了四层;
  • 将输入的序列翻转之后作为输入效果更好一些。

Decoding and Rescoring

Trained the model by maximizing the log probability of a correct translation $T$ given the source sentence $S$:

$$ 1/|\mathcal{S}| \sum_{(T,S)\in \mathcal{S}} \log p(T|\mathcal{S})$$