RoBERTa A Robustly Optimized BERT Pretraining Approach.

Paper review on “RoBERTa: A Robustly Optimized BERT Pretraining Approach(2019-7-26)”

Abstract

하이퍼파라미터와 training data size 의 중요성을 보여주는 논문이다.
GLUE, RACE, SQuAD에서 STOA를 달성

Introduction

modification on BERT

더 큰 batch size, epoch으로 학습한다
NSP를 적용시키지 않는다
더 긴 sequences로 학습한다
masking patter을 epoch마다 변화시켜준다

contribution

better design choices and training strategies
CC-News which is new data
better masked language model pretraining

Experimental Setup

Implementation

Adam epsilon term
B_2 = 0.98
large batch size
full-length sequences

Data

increasing data size 는 end task performance를 올려준다.

BOOKCORPUS plus ENGLISH WIKIPEDIA
CC-NEWS: collected from the English protion of the CommonCrawl News dataset
OPENWEBTEXT: text is web content extracted from URLS shared on Reddit
STORIES: story like style of Winograd Schemas

Evaluation

GLUE, SQuAD, RACE

Training Procedure Analysis

Static vs Dynamic Masking

기존 BERT는 하나의 sentence에 masking이 하나 RoBERTa는 10개의 duplicated sentence를 만든다.

Model Inut Format and Next Sentence Prediction

SEGMENT-PAIR+NSP input has a pair of segments from natural sentences
SENTENCE-PAIR+NSP input has a pair of natural sentences from document
FULL-SENTENCES packed with full sentences sampled contiguously from one or more documents.
DOC-SENTENCES packed with full sentences from documents which not cross document boundaries

특정 분야에 대해서 NSP는 장점을 보이기도 단점을 보이기도한다. DOC-SENTENCES가 FULL-SENTENCES보다 성능이 좋지만 DOC-SENTENCES는 batch size를 조절해야한다는 구현상의 단점이 있어 본 논문에서는 FULL-SENTENCES를 채택한다. 본 논문에서는 NO NSP, FULL-SENTENCES를 사용한다.

Training with large batches

Text Encoding

unicode characters 는 큰 데이터셋을 encoding할 때 sizeable portion을 차지한다. 이에 bytes characters를 base subword units로 이용하여 많은 양의 data(over 50K units)를 표현할 수 있도록 한다. 실제 성능이 많이 좋아지진 않지만 universal encoding scheme이 가지는 장점이 있을 것이라 주장

RoBERTa

dynamic masking
FULL-SENTENCES without NSP Loss
large mini-batches
large byte-level BPE

Important factors

data used for pretraining
the number of training passes through the data

Results

good performance on GLUE, SQuAD, RACE tasks ephasize do not rely on data augmentation, ensemble and multi-task finetuning

conclusion

performance improved by training the model longer, with bigger batch or more data.
Removing NSP and applying dynamic masking pattern
add CC-NEWS dataset