Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression" in the C Users Journal.
A variant of the technique has shown to be useful in several natural language processing (NLP) applications, such as Google's SentencePiece, and OpenAI's GPT-3. Here, the goal is not data compression, but encoding text in a given language as a sequence of 'tokens', using a fixed vocabulary of different tokens. Typically, most words will be encoded as a single token, while rare words will be encoded as a sequence of a few tokens, where these tokens represent meaningful word parts. This translation of text into tokens can be found by a variant of byte pair encoding.
Byte pair encoding exampleEdit
Suppose the data to be encoded is
The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now there is the following data and replacement table:
Then the process is repeated with byte pair "ab", replacing it with Y:
ZYdZYac Y=ab Z=aa
The only literal byte pair left occurs only once, and the encoding might stop here. Or the process could continue with recursive byte pair encoding, replacing "ZY" with "X":
XdXac X=ZY Y=ab Z=aa
This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.
To decompress the data, simply perform the replacements in the reverse order.
- Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.
- "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.
- Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.
- "Byte Pair Encoding". Archived from the original on 2016-03-26.
- "google/sentencepiece". Google. 2021-03-02. Retrieved 2021-03-02.
- Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2020-06-04). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
- Sennrich, Rico; Haddow, Barry; Birch, Alexandra (2016-08-12). "Neural Machine Translation of Rare Words with Subword Units". Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (ACL): 1715–1725. doi:10.18653/v1/P16-1162.