This page should probably be moved to BERT (language representation model) rather than language model. A language model has a specific meaning in that it models the joint probability distribution of words, whereas BERT doesn't do that, although it can predict a masked word it can't give you the probability distribution.
This would also be consistent with Wikipedia's own definition of a language model.
- I agree. Let us move it unless we see substantial protests. Trondtr (talk) 14:03, 1 October 2021 (UTC).Reply