ParaBank v1.0 Full (~9 GB) (zip), ParaBank v1.0 Large, 50m pairs (~3 GB) (zip), ParaBank v1.0 Small Diverse, 5m pairs (zip), ParaBank v1.0 Large Diverse, 50m pairs (zip)


We present ParaBank, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of ParaNMT, we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that ParaBank’s paraphrases improve over ParaNMT on both semantic similarity and fluency. Finally, we use ParaBank to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.

For a detailed description of the collection and our methods, please see the following papers:

Hu, J. E., R. Rudinger, M. Post, & B. Van Durme. 2019. ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation. Proceedings of AAAI 2019, Honolulu, Hawaii, January 26 – Feb 1, 2019.

Our evaluation data is available for download here.

To interact with the monolingual rewriter described in the paper, please check out this live demo. The rewriter can be downloaded here.

We also present an improved constrained-decoding framework with an improved rewriter in the following paper:

Hu, J. E., H. Khayrallah, R. Culkin, P. Xia, T. Chen, M. Post, & B. Van Durme. 2019b. Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting. Proceedings of NAACL 2019, Minneapolis, Minnesota, June 2 – 7, 2019.

In our experiments on data augmentation we overlooked a nice related article on NMT-based paraphrasing for adversarial model improvements, please see: Ribeiro, Singh and Guestrin, ACL’18. There the authors are explicitly looking to find paraphrases that break models.

The improved rewriter is demonstrated here, and can be downloaded here.

We also made public our augmented MNLI and QA data, which is shown to have improved performance of some existing models.


Benjamin Van Durme bio photo
Benjamin Van Durme
Matt Post bio photo
Matt Post
Rachel Rudinger bio photo
Rachel Rudinger
Huda Khayrallah bio photo
Huda Khayrallah
Ryan Culkin bio photo
Ryan Culkin
Patrick Xia bio photo
Patrick Xia
Tongfei Chen bio photo
Tongfei Chen
Edward Hu bio photo
Edward Hu