Attention is all you need

  • Replicated “Attention is all you need” paper
  • Redesigned the architecture by changing attention scoring function to observe difference between different attention functions [Tensorflow ]
  • Trained and tested on WMT14 English-German dataset
  • Out of the three functions, the scaled dot product attention, the function used in the original paper, had the best performance in both the case sensitive (21.8 BLEU) and case insensitive (22.3 BLEU) areas on the WMT14 English-German test set after training 100000 steps or 12 hours on a single GPU (K80).