Wikipedia Is Trying to Transcend the Limits of Human Language


Welcome to Source Notes, a Future Tense column about the internet’s information ecosystem.

Wikipedia has 323 language editions, and at times, there are huge differences between them.

For instance, Jasenovac was a concentration and extermination camp during World War II, which is described in detail on English Wikipedia. Hebrew Wikipedia, and other language versions. But according to Croatian Wikipedia, Jasenovac was merely a labor camp.

Spanish Wikipedia refers to Catalonia as a Spanish autonomous community, whereas Catalan language Wikipedia declares Catalonia to be its own country.

Until relatively recently, Cebuano Wikipedia said that the mayor of San Francisco was Dianne Feinstein. (Feinstein has not been mayor since 1988; Cebuano is a language spoken in the southern Philippines.)

Why are there such differences? Each language version of Wikipedia has historically been its own project, operating largely independently with the content managed by its own community of volunteer editors. In other words, there is not a singular Wikipedia—there are 323 separate Wikipedias. But at a conference in August, Wikipedia leaders presented a new initiative that could theoretically unify the information presented by all of the other Wikipedias, a proposed language-independent encyclopedia that has been generating buzz and prompting a lot of questions within the free content movement.

“Functions are a type of knowledge, and therefore it’s our job to allow everyone to share in this knowledge,” Denny Vrandečić said while introducing Wikifunctions during Wikimania, the user conference for Wikipedia and the other free knowledge projects hosted by the Wikimedia Foundation, which this year had more than 4,000 registered virtual attendees. Wikifunctions is the first new Wikimedia project to be launched since 2012, and although the site itself is not expected to be available until 2022, development has already kicked into high gear.

At heart, Wikifunctions is rather technical: It will let the community create functions—that is, sequences of computer programming instructions. These functions will use data as inputs, apply an algorithm, and calculate an output, which can be rendered into one of the natural human languages to answer questions. That could have enormous implications for what you actually read on Wikipedia. A simple function might involve calculating how many days have passed between someone’s date of birth and date of death. The output would be the person’s lifespan, a fact that could appear in the content of that person’s Wikipedia biography.

Returning to the Dianne Feinstein example: When Vrandečić reviewed how San Francisco was described in each language back in 2019, he noticed that 62 Wikipedia language editions listed an out-of-date mayor. The most egregiously out-of-date instance was the Cebuano Wikipedia, which listed Feinstein as the current mayor of San Francisco. The problem was that the Cebuano language Wikipedia was very out-of-date, which is where Wikidata could have helped. Wikidata allocates items a unique QID; the concept “mayor of San Francisco,” for instance, is Q795295. Different language editions of Wikipedia can then insert Wikidata queries within their articles. That way, if the mayor of San Francisco is updated after an election, one change to the central Wikidata item can update all of the language editions of Wikipedia automatically.

But Wikidata is “limited,” Vrandečić explained at the conference. “It cannot do narration, and narration is fundamental for humans to learn.” Consider the scientist Marie Curie. The Wikidata item on Marie Curie, Q7186, reports that she received a Nobel Prize in Chemistry and a Nobel Prize in Physics. But Wikidata alone cannot express another concept that appears on her Wikipedia pages: that Curie is “the only person to win the Nobel Prize in two different scientific fields.” To achieve this, a future version of the Marie Curie page would use a machine-readable Wikifunction, something like “$Person is the only person $Condition.”  By using the data about Q7186 and other data from Wikidata as input, the Wikifunction would generate an output to describe Marie Curie’s special status.

In its future state, Wikifunctions is expected to be closely related to another project that has yet to launch, Abstract Wikipedia, an idea that Vrandečić first proposed in a Google working paper entitled “Architecture for a multilingual encyclopedia.” Before joining the foundation, Vrandečić worked as an ontologist at Google, and he explained to me in an interview that the name “Abstract Wikipedia” is trying to communicate that it’s a Wikipedia not written in a natural language, but in content abstracting from a concrete natural language. So, for example, the future Abstract Wikipedia page for Marie Curie might consist of several curated Wikifunctions, and these Wikifunctions would be used to express biographical information about Marie Curie, such as the fact that she was both a physicist and a chemist. The machine-readable abstract version of the Wikipedia page can then, theoretically, be piped out to the 323 language versions.

Abstract Wikipedia could  help Wikipedias that currently have fewer articles. For instance, there are fewer than 12,000 Wikipedia articles in Hausa, a language spoken in West and Central Africa, compared with 6.3 million articles in English Wikipedia. Without automation, it would take a lot of human time and energy for Hausa Wikipedia to expand from its current thousands of articles to millions of articles—and of course, not everyone has the economic means to donate copious amounts of free labor to an internet encyclopedia. But the programming-language articles on Abstract Wikipedia could perhaps provide a good starting place most of the time. Since these articles are written in the machine-readable format of Wikifunctions, they can more easily be translated by machine into the many natural, human language editions of Wikipedia.

There have been a few examples of Wikipedia language editions with fewer volunteer editors having famously gone off-the-rails, such as the “legendarily bad” Scots Wikipedia or the far-right historical revisionism of Croatian Wikipedia. Perhaps leveraging Abstract Wikipedia as a standard starting place for those editions could help bring some factual rigor to those projects,  stopping bad actors whose political goal is to minimize the horrors of the Holocaust.

On the other hand, there are good reasons to tread carefully with machine-oriented initiatives like Wikifunctions, according to Brent Hecht, an associate professor at Northwestern University, where he leads the People, Space, and Algorithms research group. “It’s not so much that human knowledge can’t be reduced to data, it’s that we’re often bad at doing so,” he told me in an email. He shared an anecdote about a previous attempt to generate a Wikipedia article based on Wikidata that rendered something like the following: “Adolph Hitler was a painter, soldier, politician, art collector, and statesperson.”

Besides the potential risk of misleading the reader, another concern is clunky language. Perhaps the software rendering a Wikipedia page about Slate based on our Wikidata entry would produce something choppy and robotic-sounding like “Slate is an instance of an online magazine. It was founded in 1996. Its country of origin is the United States of America.” But when we spoke, Vrandečić was optimistic that Wikifunctions could eventually convey more nuanced and complex concepts. He also reminded me that Wikipedia isn’t necessarily supposed to be written in highly stylized or flashy prose. “This is an encyclopedia, and you’re reading information,” Vrandečić said. “It’s not a manifesto, it’s not a novel.”

Even if you accept the premise that an internet encyclopedia should be written from a neutral point of view, it is clear that some knowledge is contested within cultures, such as whether the population of Israel should include occupied and contested territories, or whether Catalonia is better described as a Spanish autonomous community or its own country. If more language editions relied on Abstract Wikipedia as the central source of truth, then this dominant point of view could replace alternative perspectives.  But Vrandečić countered that each volunteer community could decide for itself whether Abstract Wikipedia should be used as a baseline. The Wikimedia Foundation, the nonprofit organization behind Wikipedia, will not mandate that different language versions be forced to use the machine-readable, abstract version. That means that, for example, Hebrew Wikipedia and Arabic Wikipedia could each continue to present very different articles for the topic of Jerusalem.

Then again, not all differences between language editions reflect deep cultural divisions. The Cebuano Wikipedia did not say that Dianne Feinstein was mayor of San Francisco because Feinstein holds some special cultural importance to the Cebuano language community in the Philippines—no, that specific page just happened to be really out-of-date. Hecht pointed out that there are millions of articles in non-English editions of Wikipedia that do not have corresponding pages in English, including pages on local villages and landmarks. “The issue here isn’t that language communities necessarily disagree about this knowledge, it’s that they simply focused their attention differently,” Hecht said.

Some editors are also concerned about the exponential growth of errors stemming from unintended consequences. During Vrandečić’s presentation at Wikimania, he was asked how Wikifunctions could maintain a diverse contributor base. After all, fewer than 20…


Read More:Wikipedia Is Trying to Transcend the Limits of Human Language