• NMT: Neural Machine Translation
• SMT: Static Machine Translation
• BLUE: Bilingual Evaluation Understudy
• WMT-14 dataset:
• beam-search: beam search是对贪心策略一个改进。在每一个时间步，不再只保留当前分数最高的1个输出，而是保留num_beams个。当num_beams=1时集束搜索就退化成了贪心搜索。
• <EOS>: end-of-sentence symbol

## Sequence to Sequence Learning with Neural Networks

Sequential Problems: Lengths are not known a-priori.

Using a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, then another deep LSTM to decode the target sequence from the vector.

### BLUE

$$BLEU = BP \times exp(\sum_{n=1}^{N}w_n\log{P_n})$$

### Recurrent Neural Network

Given a sequence of inputs $(x1, \dots, x_T)$, a standard RNN computes a sequence of outputs $(y_1, \dots , y_T)$ by iterating the following equation:

$$h_t = sigm(W^{hx}x_t+W^{hh}h_{t-1})$$

$$y_t = W^{yh}h_t$$

However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relationships.

The goal of LSTM is known to learn problems with long range temporal dependencies, and estimate the conditional probability $p(y_1, \dots, y_{T’}|x_1, \dots, x_T)$ (length $T’$ may differ from $T$)

• 用了两种不同的lstm，一种是处理输入序列，一种是处理输出序列；
• 更深的lstm会比浅的lstm效果更好，所以论文选择了四层；
• 将输入的序列翻转之后作为输入效果更好一些。

### Decoding and Rescoring

Trained the model by maximizing the log probability of a correct translation $T$ given the source sentence $S$:

$$1/|\mathcal{S}| \sum_{(T,S)\in \mathcal{S}} \log p(T|\mathcal{S})$$