Wikipedia:WikiProject Languages

This WikiProject aims primarily to provide a consistent treatment of each human language on Wikipedia. Many languages already have extensive pages, and the systematic information on those pages is not presented in a consistent way. The purpose of this WikiProject is to present that information consistently, and to ensure that each of the major areas is covered at least briefly for each language.

These are only suggestions, things to give you focus and to get you going, and you shouldn't feel obligated in the least to follow them. However, try to stick to the format for the Infobox for each language. See the template for an example Infobox.

The easiest way to get started writing for a language that doesn't already have an article or to convert an article to the WikiProject format is to start with the template.

Article alertsEdit

Articles for deletion

Categories for discussion

Redirects for discussion

Good article nominees

Requested moves

Articles to be merged

Articles to be split

Articles for creation

Quality articlesEdit

Featured articles marked in bold have appeared on the Main Page.

Article assessmentEdit

Place the {{WikiProject Languages}} project banner template on the talk pages of any language-related articles. To rate the article on the quality scale, add one of the following parameters:

  • class=FA for featured articles
  • class=A for A-class articles
  • class=GA for good articles
  • class=B for B-class articles
  • class=start for Start-class articles
  • class=stub for Stub-class articles (which may not necessarily have a "stub" message on them!)
  • class=NA for non-articles (templates, images, etc.)

See WP:GRADES for pointers on classification.

StatisticsEdit

Index · Statistics · Log


Article namesEdit

The guidelines for article titles for languages are at Wikipedia:Naming conventions (languages). In short, most language articles should be titled XXX language. Reasons for this recommendation:

  1. Ambiguity. While some language have special forms that refer unambiguously to the language, English is inherently ambiguous about language names. Having a standard of "XXX language" ensures that it's always unambiguous.
  2. Precedent. This is how Encyclopædia Britannica and many other English-language encyclopedias name their articles.

When there is nothing to disambiguate a language name from, such as Hindi, Esperanto or Inuktitut, there is no need for the "language".

Whether the varieties of Arabic and Chinese should be called "languages" or "dialects" continues to be a highly controversial issue. The current convention is: use NAME + Arabic for Arabic varieties (e.g. Egyptian Arabic) and NAME + Chinese for Chinese varieties (e.g. Mandarin Chinese). Infoboxes are put at both Arabic language and Chinese language and at their first-level subdivisions. However, where there is little controversy that a variety of Arabic or Chinese is a dialect (when it is demonstrably intelligible to other dialects), then 'dialect' is acceptable in the title.

Even in cases in which there is a consensus that varieties of a language have a dialect status, the number and divisions between such dialects are often vaguely-defined, and controversies exist among dialectologists over whether certain varieties should be treated in a unified way or are best understood as separate though related varieties. Separate articles should only be written on varieties (e.g., Estuary English) or related groups of varieties (e.g., Hispanic English) that have been well-enough studied by linguists that at least a minimal body of literature exists about that variety or group of varieties, as a distinct dialect or group of dialects. Phonological, morphosyntactic, or lexical variation that may be considered subdialectal should be noted as "differences within X dialect,", where X is a dialect as discussed in the relevant literature. Controversies over dialect status can be noted in articles as such, but should also be based on citable work. Names used to refer to that dialect in the title should be preferred over folk-linguistic terms (e.g., Inland North versus Midwestern Accent).

Article structureEdit

If you would like to create an article on a new language, you can use {{subst:New language article}} to help streamline the process. An example structure and explanation of the sections can be found at /Template for oral languages and /Template (sign language) for sign languages. Language articles are subject to Wikipedia's includion criteria.

Open tasksEdit

GeneralEdit

UpdatesEdit

Population data has been mostly updated from Ethnologue 16 to 17. However, an unknown number of articles which did not have the ref field set to "e16" slipped through the cracks; an example is Cumanagoto, which did not have a ref'd population figure because E16 had mistakenly listed it as extinct. Articles which are not ref'd to Ethnologue could be checked in case E17 has a more recent figure.

User:PotatoBot helps keep ISO redirects in sync with changing WP articles and ISO standards. The results of the latest run are displayed at ISO 639 log and ISO 639 language articles missing.

Names at Spurious_languages#Spurious_according_to_Glottolog with asterisks have not been addressed.

Articles to be createdEdit

Red links should either be redirected or have their own articles.

Articles with red links

99.9% of ISO language names have articles, though not always one-to-one (e.g. Fulani, Zhuang, and Mazatec); the 0.01% which do not are spurious, dubious, or insufficiently attested to justify their own article, and are redirected to an article stating that.

Lists for evaluation

The lists below are of self-links in our articles, language names from various sources which do not have articles or redirects, and suspicious cases to keep track of.

Lists of obscure names from common refs
INALI
  • 48 at INALI names for Mexican languages (27 Mixtec & 6 Nahuatl to be reviewed; 12 Zapotec & 3 others attempted). Even blue links may be wrong, due to confusion of similar town names or misidentification at Ethnologue.
AIATSIS
  • 7 potential languages w data. The AIATSIS db is periodically updated, with new languages confirmed.
Ethnologue 11
  • Holima ["near Dobu" – misreading of Molima?], Waelulu ["existence unconfirmed"; taken from V&V]
Voegelin (1977)
36 red-linked names; list doesn't bother with reds links for what Loukotka says is unattested.
Blue links have not been checked. Many are presumably inadvertent homonyms rather than the language intended by V&V.
Ruhlen (1987)
  • S.Am.: 12 (see key) extremely obscure names of mostly unattested languages, not even listed in Campbell & Grondona 2012, and for only a few does Loukotka say anything other than 'unknown'. Those not found in Loukotka might be copy errors.
There are also at least half a dozen names in Ruhlen which take you to what is apparently the wrong article. One is a typo, 3 are unidentified, and 2 have perhaps just been reclassified.
Campbell & Grondona
Linguist List local-use ISO
Glottolog
25 at Talk:Glottolog#Unclassified_languages
93 more at Wikipedia:WikiProject Languages/Glottolog languages without ISO codes -- both for Glottolog 2.2
Circular and suspicious links
Identity suspect
Nshi, Sotatipo, Lui, Pasto (wrong ISO?), Kanamarí and Karipuná (contradicted by E17), Gulei (marked "?" in list), Sonde, Ngoni, Pretoria-Tsonga (marked "§" in list) & Mangala
Circular links of ISO names with summary data
Loloish, Qiangic (3 listed + old name Pingfang, which I can't ID), unclassified Asian (Bhatola: presumably a Gond dialect, Warduji: presumably a Persian dialect), Hindi (Ghera: Pakistani enclave of unidentified Indian language), conlang codes (Kotava, Romanova: old articles were deleted as not-notable)
Cases to track
No 1-to-1 correspondence to ISO
Tracking only; no need to fix.
Gbaya language (Central African Republic), Gbaya language (Sudan), Syriac language
ISO languages without info box
Typically because there are problems in defining the language. Tracking only; no need to fix.
Minor languages covered in family article: Loloish (4)
Language uncertain: Mina, Majhwar
Rd. to script or history article: Epi-Olmec (undeciphered), Ancient Zapotec, Middle Korean
Rd. to spurious-language article: Parsi-Dari, Parsi, Tapeba
Newly discovered or unattested languages without ISO codes
Lubu (unattested and extinct)
Cuyama (unattested and extinct)

Requests for expansionEdit

Images for articles in Category:Wikipedia requested photographs of languages.

Requests for attentionEdit

(no article Ashéninka people; Keres functions as the lang article but reads as a family article)

Tagged categoriesEdit

Category:Articles lacking sourcesEdit

Only language varieties are included here. Subjects such as 'French language in Jordan' and 'Westernized Chinese language', though in bad shape, are not listed because they would not be representative of the many unreferenced articles that are not about specific varieties.

  • 2004–2014: (only articles with 'language', 'dialect', 'creole', or 'pidgin' in name are included; distilled from an insane number of articles)
English: Jewish English languages
Germanic: Central Franconian dialects, Eastphalian dialect, Hamburgisch dialect, Norwegian dialects, Orsamål dialect, Ripuarian language, Sognamål dialect
Romance: Chipilo Venetian dialect, Comasco-Lecchese dialects, Fornes dialects, Pavese dialect, Sabino dialect, Sutsilvan dialects (Romansh)
Slavic: Debar dialect, Reka dialect, Strumica dialect
Maltese: Qormi dialect, Żejtun dialect
Chinese: Luoyang dialect, Mango dialect, Qihai dialect, Weihai dialect, Ningbo dialect, Ganyu dialect, Fu'an dialect, Xuzhou dialect
other: Kfar Kama Adyghe dialect (Adyghe), Enuani dialect (Igbo), Thanjavur Marathi dialect, South Korean standard language

Category:Orphaned articlesEdit

(same search terms as missing sources)

Ordek-Burnu language (moved to 'stele')

Open ISO issuesEdit

The following ISO requests for new languages from previous years were still open in 2016 Jan. The articles should be updated if they are accepted. (See the current list, reviewed to 2021-02.)

Old open ISO change requests[3]
2020-039 	tki 	Iraqi Turkman language 	 
2020-009 	nww 	Ndwewe language 	 
2019-007 	rrm 	Moriori language 	 
2011-041	vsn 	Vedic Sanskrit           
2009-081	elr 	Katharevousa Greek       
2009-060	ecg 	Ecclesiastical Greek     
2006-084	gkm     Medieval Greek           

Articles proposed for deletionEdit

including WP:AFD, WP:PROD and other processes

Articles to watchEdit

The following are language articles which come under repeated POV attack, often for ethnic or nationalistic reasons. Feel free to add ones you've noticed, and to remove languages which have not been a problem for some time. That way, if one of us drops out from editing, the articles we've been watching hopefully won't go to pot.

(Note: Ethnologue 17 and the Swedish Nationalencyklopedin use Indian census data, which is not a RS because it does not have a consistent definition of Hindi. For example, part of the Awadhi population is listed under Awadhi, but most is counted as Hindi. This problem is acknowledged in the presentation of the census results, but has gotten lost in 2ary sources.)
  • Serbo-Croatian & Croatian (subject to ARBMAC)
  • Saraiki dialect, Punjabi dialects, and "Panjistani" (requires text searches to purge repeated additions of contradictory claims of "Panjistani" to multiple articles)
  • Southern Luri language. It may be worthwhile splitting the Luri article, but so far the attempts to do so have been incompetent and motivated by OR redefinition of the language. The present description of the two varieties in the Luri article is so intertwined that splitting them would create something close to a content fork. — kwami (talk) 02:32, 4 September 2015 (UTC)
  • Assyrian Neo-Aramaic and Chaldean Neo-Aramaic, along with the ethnic articles. A seemingly chronic ethnic dispute.
  • Luganda and Baganda: deletion of ISO name
  • Misleading maps: Many national languages have had maps with half the world filled in because of emigration, with no apparent standard for what counts as a speaking population. Most of these will be caught by checking the top 100 at List of languages by number of native speakers.

Interpreting online sources of dataEdit

Ethnologue has long been the default source for language data on WP, despite its often poor referencing. It was the only global reference that was freely available online when the majority of WP language articles were created, but since has become a very expensive pay site (2,400 US$/year with maps as of 2021). For those editors who do not have access to Ethnologue (and perhaps also for those who do), a combination of Glottolog, for classification and for general sourcing, and the Endangered Languages Project, for demographic data, is probably the most reliable default combination of free online sources, though there are also reliable specialized sites such as AIATSIS for Australia. Linguist List/MultiTree maintains some value for long-extinct languages.

[Note, 2021-01-03: Links to ELP should be added to all relevant infoboxes this year. Upon request, ELP has sent us an ELP-to-ISO/LL code mapping, so we're just waiting for approval for a bot to add the links.]

There are several advantages to Ethnologue: for many languages, it's the only demographic data we have; for others, it provides a check on the politicization and population inflation that we experience when we allow advocates of a language to cherry-pick sources. Nonetheless, Ethnologue data needs to be carefully evaluated. Beside the now prohibitive cost, there are a few common and serious problems:

Extended content
  • The family trees are auto-generated, and should not be relied on. Auto-generation is skewed by idiosyncratic entries in the language articles. In E16, for example, the Maban family was listed as a branch of the Luo languages, because one of the Luo languages was named Maban; meanwhile, there were two separate Luo branches of Nilotic due to the spelling of "Luo" not matching across articles. The more obvious problems of this sort had been remedied in E17, but Ethnologue trees are still not a RS for classification, and the languages under a node are not a RS for the membership of a particular group. Many of our articles still say that there are X languages in the Y branch of a family, based on Ethnologue, but all that can be relied on is the classification cited in individual Ethnologue articles, and those are not sourced.
  • Speaker data is inconsistent. For instance, in E14, Gawwada was cited as having 32,698 mother tongue speakers, including 27,477 monolinguals, based on the 1998 census. In E17, it is cited as having 68,600 speakers based on the 2007 census, but still 27,500 monolinguals, without informing the reading that that figure comes from an older census. Similarly, the cited size of the ethnic group may be only half the cited number of speakers, due to it being several decades older. If the number of monolinguals or ethnic members is not given a citation date by Ethnologue, it is useless and should not be repeated by us. The number of speakers and the dialects of the language may be from different sources, with the result that the number of speakers may not be that of all dialects, or may include speakers of other ISO languages. (This is occasionally noted in the Ethnologue entry.) Very commonly, when a language is named after one of its dialects, the speaker number is that of the dialect, not of the language as a whole. Also, a language may be split up into separate ISO codes with the result that one article covers one variety but inherits the number of speakers of all varieties from the old article. Ethnologue has handled this well in recent years, but has not been able to go back and fix such errors inherited from old editions.
  • Ethnologue's arithmetic is consistently bad. For instance, Ethnologue lists five Central Iranian languages as having had 7,030 speakers reported in 2000. It appears that their source listed 35,000 speakers total, and Ethnologue divided that figure by 5 for the individual articles, with no indication that the result was no more than a guess. This kind of problem is not uncommon. Even more commonly, Ethnologue will add together incompatible data from various sources, paying no attention to significant figures. For example, if one source reported 2 to 5 million speakers in country A in 1975, and another 5 to 10 thousand in country B in 2006, Ethnologue will report the total as 3,507,500 speakers (3.5 million, the median of 2 and 5 million, plus 7,500, the median of 5–10,000). Old editions such as E14 are actually more reliable in this regard, as they tend to note that the estimate for country A was 2 to 5 million, when later editions will simply report 3.5 million as if that were the figure in the source. If the original source cannot be verified, we should at least look at each of the country figures that make up the total and redo the arithmetic, so as to avoid spurious precision as much as practicable.
  • Dates are not reliable indicators of when the data was taken. Unless they are census data, which has the problem all censuses do of speakers intentionally misreporting their language, the dates given by Ethnologue are the date of publication of their source. That can be several decades after the date the data was collected. The result is that an older cited date may report the same or more recent data than a newer cited date. For instance, several Australian languages were cited as "SIL 2011" in E17. However, in E16 they all had the same numbers of speakers cited to "Wurm and Hattori 1983". In other cases the source that Ethnologue uses may cite an old edition of Ethnologue, or the source that Ethnologue used in an old edition. And the sources themselves may have problems that are not mentioned in Ethnologue. For instance, one source from the 1990s notes that its numbers are copied from a 1980s publication that was based on unpublished fieldwork that had been conducted in the 1950s. In the Ethnologue entry, however, only the 1990s date was given. For another example, the data for the Hindi languages was updated between E16 and E17, based on the new Indian census. However, the census makes it clear that many Awadhi speakers, for example, reported their language to be "Hindi" rather than Awadhi. The result is that the E17 figure for Hindi is inflated by perhaps 100 million people who should be listed under other languages, but there is no warning about this in Ethnologue. Many entries are also undated. Some of these are recent oversights that will be fixed in the next edition, but many are inherited from old editions of Ethnologue, and the editorial team may be unable to identify their source. In such cases, citing the edition of Ethnologue that first reported the figure might give the reader some indication that it is not recent data.
  • Figures may be ethnic numbers and an order of magnitude greater than the actual number of speakers. A good start in cleaning this up was made in E17, but there has been some backsliding as well, with old linguistic survey data of heritage languages being replaced with recent census data that reports ethnic identification rather than language ability.

Such problems are understandable: Ethnologue is an enormous project with a very small editorial team. For years, Ethnologue had a reputation for being unresponsive, so many linguists do not bother to correct the errors they find, but since ca. 2012 they have been appreciative of feedback, and the quality of their coverage has improved markedly. Nonetheless, Ethnologue's sources (when they can be identified) should be checked for the accuracy of its claims whenever possible, and other sources used when available and Ethnologue's sources cannot be identified.

Glottolog is a reliably cited and well-researched alternative to Ethnologue. Apart from not covering demographics, it does a generally superior job, for instance in verifying and updating the classifications it adopts, in marking languages as 'spurious' when they cannot be verified to exist, and most importantly in citing its sources both for the languages and for their classifications. But it is largely the work of a single person (Harald Hammarström), and he has not had the time to improve on Ethnologue for all the languages of the world, so in some cases Glottolog is not (yet) an independent source. In most cases Hammarström has personally vetted the sources, even to the extent of doing his own comparison of the raw lexical or morphological data to evaluate which classification is the most accurate, though it may take some digging for the reader to determine all his evidence. He does however not distinguish whether a language with no known relatives is an isolate (a family of one) or simply unclassified due to lack of data or research, listing all such cases as 'isolates'. Maps are included, but the locations are points rather than areas as in Ethnologue (not that the areas in Ethnologue are necessarily accurate), and in some cases appear to be offset from where the language is actually spoken (all points on the map shifted by seemingly the same amount and direction, a problem that besets our automated location maps as well). Finally, Glottolog should not be relied on for dialects, as they were copied wholesale from MultiTree without verification and are often spurious. Only in a few cases has Glottolog since evaluated dialect data. (Dialects are typeset in italics, languages in boldface. Where the Glottolog dialects differ from those of MultiTree, they are likely to be Hammarström's or a colleague's work and thus reliable.)

The Endangered Languages Project does not attempt to include all the world's languages (it ignores languages with millions of speakers, for example, and as of 2021 doesn't cover some poorly documented areas of the world), but as of 2021 it has articles on 3585 languages/lects, 285 without ISO codes. ELP concentrates on demographic data, and tries to provide the most recent reliable sources for speaker population, transmission rate, bilingualism, etc., and so nicely complements Glottolog. In some cases it provides the date of the data, not just the date of its publication. Non-demographic data is minimal, and (like Glottolog) the maps show the languages as points rather than areas. (E.g. Comorian, where the location dot is in the middle of the ocean between the islands.). There are indications that some of the data has been input by people who don't understand it, such as locations of African languages being on the wrong side of the continent. It should therefore be used for its references rather than as a reliable source in its own right.

AIATSIS presents data from multiple sources for the indigenous languages spoken within the national borders of Australia. Its primarily focus is on identifying the many names found in the literature, resolving synonyms and ambiguities, evaluating whether putative lects can be confirmed to be distinct languages or dialects, and identifying which names might benefit from further investigation of archival sources.

Unreliable sites

Linguist List / MultiTree is a former undergrad student project that includes a large number of language names not found in Ethnologue, but their identification is highly unreliable, and can often be seen to be spurious with even a cursory glance at the literature. Since the creation of Glottolog they are no longer of much value as a source of references for living languages, though they do provide some informative expert summaries of the literature for long-extinct languages.

ISO 639-3 is only a reliable source for ISO codes and names. It should not be relied on for preferred names or spellings, whether a lect is a distinct language or a dialect, or whether it is still spoken. For example, despite its stated ideal of distinguishing languages by mutual intelligibility, for political reasons ISO maintains separate 639-3 codes for Serbian and Croatian, Urdu and Modern Standard Hindi, and Malaysian and Indonesian, despite private acknowledgements that doing so violates their stated aim. (Such pluricentric distinctions would be better maintained at ISO 639-2.) However, because ISO 639-3 codes are widely used to identify languages, WP language articles should include the ISO 639-3 name in the lead or a dedicated section if they use something different for the article name, and we should created redirects for those ISO names and for the codes themselves.

Global Recordings Network copies much of its data from Ethnologue, misidentifies alternative names as languages, and contradicts itself with speaker numbers.

TemplatesEdit

InfoboxesEdit

Project bannerEdit

Please add {{WikiProject Languages}} to talk pages of relevant articles. Articles with this template are put into Category:WikiProject Languages articles.

StubsEdit

Language stubs should be tagged with the most appropriate template of these:

UserboxEdit

After you sign up, you can add the project userbox to your user page by adding the following: {{User WikiProject Languages}}. Your username will then automatically be added to the Category:WikiProject Language members.

Related WikiProjectsEdit

This WikiProject is a descendant of WikiProject Linguistics. It has descendants of its own, most of which aren't particularly active at present.

See also:

ActiveEdit

Inactive or defunctEdit

Project volunteersEdit

If you'd like to help out, be contacted by others interested in this WikiProject's subject, and receive task assignments and project-related updates on your talk page, please add your name here:

CategoriesEdit

Click on "►" below to display subcategories: