User:Peter coxhead/Wikidata issues

Wikidata is a useful resource, but has some issues that cause problems when used in connection with articles about organisms in a wikipedia, including the English Wikipedia. In summary, it does not correctly model taxa and their scientific names nor articles in different language wikipedias.

Confusion between taxa and taxon names edit

Wikidata items are completely confused between taxa and taxon names. Thus Lotus tenellus (Q50322536) says that it is an "instance of taxon", where taxon (Q16521) is said to be a "group of one or more organism(s), which a taxonomist adjudges to be a unit". Pedrosia tenella (Q50328673), Lotus leptophyllus (Q49625795) and Pedrosia leptophylla (Q50329708) are also labelled "instance of taxon". So in Wikidata, there are four instances of taxon, i.e. there appear to be four distinct taxa. However, there is actually only one taxon, which has four different names: one accepted name (which may vary by source) and three synonyms. All Wikidata items that are said to be "instance of taxon" are actually instances of taxon names. There are no items in Wikidata that correspond to taxa.

See below for a consequence of this confusion.

Long discussions at Wikidata (see e.g. here and here) have led to the conclusion that no-one knows how to represent taxa and taxon names in Wikidata, or indeed in any kind of regular relational database. The problem can be illustrated by considering the case of Lemnaceae versus Lemnoideae. There are two well-supported views on how to classify duckweeds. Molecular phylogenetic evidence shows that they are firmly embedded within the arum family, Araceae. So duckweeds can be treated as the subfamily Lemnoideae of the family Araceae.[1] On the other hand, the morphology of duckweeds is radically different from the rest of the family. Accordingly, those who are willing to accept paraphyletic taxa make duckweeds a separate family, Lemnaceae, and accept that the reduced Araceae is paraphyletic.[2]

Define a taxon as a group of organisms considered to be a unit by one or more taxonomists. Then there are three taxa involved:

  • Taxon 1 consists of all plants that can be placed in a broadly defined family Araceae, including duckweeds.
  • Taxon 2 consists of all plants that can be placed in a more narrowly defined (and paraphyletic) family Araceae, excluding duckweeds.
  • Taxon 3 consists of all plants that can be considered to be duckweeds.

The diagram below attempts to represent this situation, showing two taxonomic views.

 
  • In the "red view", Taxon 3 has the name Lemnoideae and has Taxon 1 as its parent. Taxon 2 is not recognized.
  • In the "green view", Taxon 3 has the name Lemnaceae. Taxon 2 is the related but separate family Araceae. Taxon 1 is not recognized.

"Lemnaceae" and "Lemnoideae" are two different names at different ranks for the same taxon. "Araceae" is the same name at the same rank for two different taxa, only one of which is the parent of Lemnoideae. As of June 2020, no-one seems to know how to represent this situation in Wikidata, and so it only models the taxon names (although appearing to model taxa). The relationship between the Wikidata taxon name/taxon items (including the duckweed genus Lemna) as of 12 June 2020 is shown below. (The dashed arrows represent several extra parent links in the Wikidata items.)

 

The Wikidata relationships do not capture the fact that Lemnoideae and Lemnaceae are alternative names (at different ranks) for precisely the same taxon. The "parent" relationships between so-called "taxon" items in Wikidata generate a classification network rather than a classification tree.

Using Wikidata for classification hierarchies edit

It is sometimes suggested that the English Wikipedia should use Wikidata to derive the classification hierarchy in taxoboxes. The diagram above should make clear why this will not work. Wikidata is, rightly, neutral between reliable data sources. It does not, and should not, choose one taxonomic view over another. On the other hand, it has been agreed that although the text in articles must always observe WP:NPOV, article titles and taxoboxes in one area of the tree of life need to present one coherent view to avoid inconsistencies (e.g. having articles on the same group of organisms under different names). Articles describe taxa, not taxon names, so there is only one article at Lemnoideae with Lemnaceae as a redirect, because WP:PLANTS uses the Angiosperm Phylogeny Group system for flowering plant taxoboxes.

(The automated taxobox system is able to encode alternative classifications for different areas of the tree of life. For example, ornithologists concerned with living species and dinosaur taxonomists concerned with extinct species use different classifications for birds. This is handled by the use of variant taxonomy templates, with more than one taxonomy template for a given taxon. Thus Template:Taxonomy/Ornithurae includes Dinosauria in the hierarchy, whereas Template:Taxonomy/Aves, does not, because it goes to Template:Taxonomy/Ornithurae/skip.)

Non-1:1 relationships edit

Wikidata assumes there are 1:1 relationships between articles in different language wikipedias and the corresponding Wikidata items. An article in a particular language Wikipedia can only be linked to a single Wikidata item, and hence to only one article in another language wikipedia. However, this simply does not model reality, particularly for biology-related articles.

Multiple meanings in some languages edit

In English, the term "berry" has two overlapping but different uses: as a technical term in botany, discussed at Berry (botany), and as a general term for a variety of kinds of soft fruit, discussed at Berry. There is an item for each in Wikidata. The meaning of the technical term will be constant across languages, but there is no a priori reason to suppose that the same general term exists or is used in the same way in other languages. Hence there is no reason to suppose that there will be a 1:1 relationship between the articles in different wikis. For example, as of 11 June 2020, the German and Greek wikipedias had only one article. Where there is only one article, a choice has to be made in order to link it to only one Wikidata item. It cannot be linked to both, even if it covers both English meanings. (The consequences of this are perhaps clearer for monotypic taxa – see below.)

Monotypic taxa edit

In the English Wikipedia, monotypic taxa (i.e. taxa with only one lower ranked member) have a single article. For example, a monotypic genus will have a single article, normally titled as the genus, covering both the genus and its only species. The article Amborella also covers the order Amborellales, the family Amborellaceae and the sole species Amborella trichopoda, since the order, family and genus are monotypic. Other language wikipedias have different policies. Some also have a single article, but at a different title. For example, as of 20 June 2023, the Spanish Wikipedia has a single article covering the same taxa at es:Amborellaceae, and the Italian Wikipedia has a single article at it:Amborella trichopoda. Others may have articles at multiple ranks, regardless of whether they are monotypic. For example, as of 20 June 2023, the French Wikipedia has articles at fr:Amborellales, fr:Amborellaceae and fr:Amborella trichopoda.

Consider a simple example. One language Wikipedia (Wiki 1) has two articles, one for a monotypic genus X and one for its sole species X y. Another (Wiki 2 – the English Wikipedia, say) has only one article, for the monotypic genus, and a redirect for the species. Wikidata has taxon items for both the genus and the species.

 

There are two ways of linking Wiki 2 to Wikidata:

  1. Wikidata now allows links to be created to redirects, so two links can be created as shown by the solid red lines in the diagram. This seems the most obvious way of creating the links. However, one consequence is that if you look at the article at "X" in Wiki 2, the interlanguage links in the sidebar will only show the article at "X" in Wiki 1. However, for a monotypic taxon, most of the information in the combined genus and species article will be about the species, so connecting to the genus article in Wiki 1 is generally not the most useful.
  2. The alternative is to use one link shown by the dashed blue line in the diagram. Now the sidebar in Wiki 2's article will show the probably more useful article at "X y" in Wiki 1, although the titles won't match.

To model the actual situation fully, all three links from the articles in Wiki 2 to Wikidata are needed, but Wikidata does not allow this.

Returning to an actual example, the English article Amborella cannot be explicitly linked to all three of the French articles at fr:Amborellales, fr:Amborellaceae and fr:Amborella trichopoda, even though the English article covers all three. (As of June 2023, the sidebar at Amborella connects to fr:Amborella trichopoda.)

Synonyms edit

As noted above, Wikidata models taxon names, although claiming to model taxa. Acceptance of a name is to some degree a matter of taxonomic opinion. Thus it is both possible and legitimate for different language wikipedias to use different synonyms as the titles for their articles about the same taxon. Since only 1:1 links are allowed, this causes problems in deciding which Wikidata item to link to.

For example, Streptocarpus ionanthus is a species previously placed in the genus Saintpaulia as Saintpaulia ionantha. Saintpaulia is now placed inside Streptocarpus as Streptocarpus sect. Saintpaulia by some reliable sources (e.g. Plants of the World Online). As Wikidata models taxon names, there are two Wikidata items purporting to be instances of taxa, Saintpaulia ionantha (Q165406) and Streptocarpus ionanthus (Q50870323). A realistic model for the situation in the English and French Wikipedias as of 11 June 2020 would have at least all the boxes and lines shown below.

 

However, the box marked with dashed lines does not exist in Wikidata, which only models taxon names. As of 11 June 2020, the actual relationship is as shown below. (The links between the taxon name items in Wikidata use properties like "basionym" or "synonym".)

 

When the issue is how just two articles at different synonyms connect, as here, redirects do help (although they don't solve the underlying modeling problem). As of June 2023, the Wikidata item Saintpaulia ionantha (Q165406) connects to both the French article at fr:Saintpaulia ionantha and the English redirect at Saintpaulia ionantha. Hence the language links at the top of the French Saintpaulia ionantha article do include a link that leads to the English Streptocarpus ionanthus. The reverse is also true: the Wikidata item Streptocarpus ionanthus (Q50870323) connects to both the French redirect at fr:Streptocarpus ionanthus and the English article Streptocarpus ionanthus, so the latter leads to the French article.

Making the interconnections work when different language wikis use different synonyms as titles requires:

  1. The necessary redirects to exist in the language wikis
  2. Editors to have created the links to redirects at the relevant Wikidata items.

Taxonbars edit

The {{Taxonbar}} template allows an English Wikipedia article to link to multiple taxonomic databases (and other sources) via their identifiers for taxon names. Some taxonomic databases, like Plants of the World Online, have an accepted scientific name for every taxon, and treat other names as synonyms, although these synonyms have their own identifiers. Other databases, like Tropicos, are more-or-less neutral, aiming to capture all scientific names that have been used. Since Wikidata "taxon" items are actually instances of taxon names, they can be linked to scientific names in taxonomic databases via their identifiers. Multiple Wikidata items and their links to taxonomic databases can then be shown in a taxonbar by using multiple parameters. Thus

{{taxonbar|from2=Q165406|from1=Q50870323}}

displays a taxonbar linking to both the Streptocarpus ionanthus and Saintpaulia ionantha items in Wikidata, and hence to the entries for these names in taxonomic databases. (Note however than only one of these Wikidata items will be shown in the left sidebar of the desktop version of the article, because of Wikidata's strict 1:1 rule.)

References edit

  1. ^ Cabrera, Lidia I.; Salazar, Gerardo A.; Chase, Mark W.; Mayo, Simon J.; Bogner, Josef & Dávila, Patricia (2008), "Phylogenetic relationships of aroids and duckweeds (Araceae) inferred from coding and noncoding plastid DNA", American Journal of Botany, 95 (9): 1153–1165, doi:10.3732/ajb.0800073, PMID 21632433
  2. ^ Stace, Clive A. (2010), "141. Lemnaceae", New Flora of the British Isles (3rd ed.), Cambridge University Press, pp. 833–834, ISBN 978-0-521-70772-5. Stace takes the alternative view in the 2019 4th edition.