Kevin Golde: AI in Wikipedia - possible, but clear rules are needed

At the Wikipedia Day held in Welle7, Bern, on April 27, 2024, Wikipedian Kevin Golde (username: Wikiolo) was among our speakers under the theme “Wikipedia and artificial intelligence (AI): the race for free knowledge“. He replied to questions from our moderator Lisa Stähli. Here are his answers.

Hi Kevin, what is the current state of AI collaboration on Wikipedia and other knowledge platforms, and how is it set to evolve in the future?

AI itself has not generally been used in Wikipedia to date. This is mainly due to the fact that applications such as ChatGPT act as chatbots (computer programs that enable humans to interact with digital terminals as if they were communicating with a real person), and do not substantiate statements, for example. Particularly in the case of AIs that generate images, there is also the question of whether the content is freely licensed so that Wikipedia can use it.

The situation is different with Wikipedia as a source for AI. The free licenses in Wikipedia make it an excellent source of training data. There are also official cooperative ventures, such as the WikiGPT project with OpenAI, where ChatGPT directly accesses Wikipedia content.

But even today, simple AI tools could already prove useful in Wikipedia, for example to smooth out wording. In the medium term, I can imagine an AI tool being able to point out possible errors to a user, or to find (better) source(s). There is already a pilot project for this with the System Side software. AI could also make Wikipedia more interactive, for example by asking the reader questions about the article he or she has just read, by rephrasing an article on higher mathematics in simple language, by summarizing XXL articles, or by creating a set of slides for a presentation in the blink of an eye.

In addition, content that is missing in a language could be quickly supplemented by an AI translation. It may also be possible to set up a central Wikipedia from which the information can then be translated into the respective native language using an IT tool. This could be English, but another language is probably more suitable. It is important that this central Wikipedia is a unique language so that no content is lost, simplified or distorted by translations. I don’t know if there is already an optimal language for this, otherwise an artificial language designed for this purpose can still be developed.

In the long term, I can imagine an AI tool itself writing, completing and updating articles based on sources. Even so, it’s essential that there are still people in the background to verify the information .

How can AI-generated content be prevented from spreading misinformation?

Machine learning applications based solely on data never work quite as intended: at the end of the day, they’re just statistical methods. What’s more, the quality of the results depends on the strength, completeness and impartiality of the training data. Errors are human and for this reason alone, an AI tool that is trained using human data will always make mistakes. However, in order to keep erroneous information misinformation to a minimum, at least for the sources that can be quoted, there should always be humans in the background who carry out a fact check and ensure an unbiased and balanced presentation of those facts. If such verification is not carried out, I see the risk that AI tools will end up quoting themselves so often that the result is nothing but gibberish . Another danger is that information is altered in favor of certain groups of people . In order to prevent this, we should think very carefully about whether this technology is to be owned by a handful of people , or whether it belongs in the hands of society as an infrastructure oriented towards the common good, somewhat along the lines of public broadcasting, as a truly democratic tool.

What strategies are there to promote culturally sensitive and inclusive language in AI-generated content?

It is up to the AI operator to ensure that issues such as culturally sensitive and inclusive language are not neglected. Laws or guidelines may be beneficial. Depending on the model, the authorities or, if artificial intelligence can impose itself on the scheme of a public law institution, the institutions should ensure that this framework is adhered to. With ChatGPT, we can already see that this tool blocks certain requests. For example, it won’t spit out a stupid list if you ask ChatGPT for the dumbest politicians. On the other hand, if you ask who the smartest politicians are, ChatGPT will, indeed, spit out a list. So there’s a way to restrict patterns and typical requests in this way.

What ethical challenges arise from AI-generated content and how can they be specifically addressed?

For me, the first question that arises here is who should own such a powerful tool and what rules apply here. We all know the saying “knowledge is power”. The fact that this saying is just as valid in the digital world is abundantly clear, despite the fact that knowledge is freely disseminated by Wikipedia . After all, the rules of the game today – that people with enough money can buy almost anything and do almost anything they want with it – have shown just how much power individuals can wield with their spare change , at the latest with Elon Musk’s demonstration of power by buying Twitter – after he was disturbed by the fact that a democratically voted-out person who had called for a coup had his Twitter account suspended.

If we assume that a commercial AI manages to ensure that Wikipedia suffers a similar fate to the Brockhaus encyclopedia, because most people will use AI instead of Wikipedia in the future, I dare not imagine what manipulations will then be possible. Depending on who is at the helm of this enterprise, I can imagine that the democratic world would face even greater domestic political challenges than it does today .

Filter bubbles could also become an even bigger problem for our society than they already are today. What can already be seen on social media could be exacerbated if companies program their AI to tailor answers to user profiles and then, for example, answer questions about climate change differently to a speeding fanatic on German freeways than to a Fridays for Future protester, in line with the motto “the customer is king”.

It is possible that such bubbles can be prevented by laws that prevent personalized responses. However, the question here is always how well authorities can enforce these laws. To be on the safe side, the greatest possible transparency is therefore important, which in my opinion is best achieved by a platform organized under public law. With a well thought-out organizational structure, this should also generally be the best way to prevent attempts at manipulation.

How can Wikipedia make use of AI advances without compromising its integrity as a trusted source of knowledge?

I see a major problem in the future of Wikipedia in the fact that Wikipedia does not completely fulfill the criteria of a trustworthy source. That is why we in the Wikipedia community always say that Wikipedia is not a source in itself. For example, even today, statements are often waved through that are unsubstantiated and often incorrect, and we have a large number of older articles that are still largely devoid of evidence. In my experience, these articles in particular contain content errors, but even statements with individual references should not be blindly trusted, as there is no peer review process in Wikipedia. In addition, Wikipedia allows you to write anonymously, so people who have no intention of contributing to a neutral encyclopaedia have an easy time manipulating Wikipedia to their liking, especially when it comes to small topics.

And this is precisely where I think AI is promising. In future, it may be able to criticize possible errors, exaggerations or embellishments in articles and present them on the discussion page, for example. Or even suggest evidence that can be used to substantiate certain unsubstantiated statements. But as I said, for a truly reliable source, I see the human checking the AI in the final instance.

On the other hand, I also have to say that if Wikipedia remains as it is, it will only be a matter of time before it is overtaken in quality by an AI tool. I am sure that Wikipedia will be history by then at the latest, as even its last readers, for whom the AI was too unreliable, will turn away from the online encyclopaedia.

How can wiki projects continue to exist in the long term if there is hardly anyone left to access the pages directly, participate or make donations?

I don’t see this danger at the moment. In case of doubt, the donation banner will be up for a week or two longer than it was the year before and the donation campaign will reach the same number of people in the end. If this actually becomes a problem in the medium term, the companies on the platforms that benefit from Wikimedia content would probably make large donations to keep Wikipedia going. The question here, of course, is whether they would then also try to influence Wikimedia in order to restructure it according to their own ideas.

In order to avoid this, the above-mentioned model of Wikimedia as an institution under public law would help to achieve the goal of sustaining Wikimedia in the long term. Depending on the model, universities, for example, could also participate in the improvement and updating of Wikipedia with the task of imparting their knowledge. New financial opportunities would also arise, for example Wikimedia could develop or take over data-driven methods and its own LLMs (large language models) in the future. But for the time being, this is hardly feasible for financial reasons.

How can Wikimedia empower its community to navigate and adapt in a rapidly changing digital landscape?

Wikimedia has to move with the times and adapt to them. At the time, Wikipedia emerged from Nupedia, which still pursued a commercial approach. The innovative thing about Wikipedia was that, unlike Nupedia, everyone could contribute to the online encyclopedia, simply and democratically, without even needing a user account. That was of course great for laying the foundations of the encyclopedia, but I doubt very much whether this is the right formula for all eternity. As many people know, the social trust in Wikipedia has increased enormously since its beginnings. It hasn’t come out of nowhere, but it has been built on rules and criteria that a Wikipedia article must meet. With an encyclopaedia that gradually completes itself, it is therefore clear that it is becoming increasingly difficult to start volunteering in Wikipedia, so that despite our best efforts, we are virtually unable to attract any newcomers today.

Due to the very low number of new authors and a constantly dwindling number of authors, there are more and more articles that have to be maintained, which is mostly done by the classic Wikipedian, who is usually white, male and quite a bit older than me. As a consequence, Wikipedia is often unable to meet its own goals such as impeccable correctness, consistent topicality and neutrality of the articles. Wikimedia can still maintain itself because there is still no serious alternative to the main product, Wikipedia.

As I see it, there are signs that this comfortable time is slowly changing. I think that now that we see what is possible with AI, we need to think about how Wikimedia should develop further. In my opinion, we should start by rethinking everything, to analyze whether Wikimedia is still sustainable or whether a different model should take over: So, should Wikipedia and its sister projects still be a purely voluntary project, or should we also integrate colleges, universities and other institutions oriented towards the common good into our project? Or is a donation-financed encyclopedia still appropriate as a source of world knowledge, or is it not actually a public task as an infrastructure oriented towards the common good, which Wikipedia definitely is today?

We should also question how we can fully develop Wikipedia into a trustworthy source. We should, for example, ask ourselves whether we should continue to leave relics of articles from the early days that are completely undocumented, or whether we should not rather rephrase these articles so that they can be considered a reliable source.

Thank you, Kevin!

Photo: Amrei-Marie, Wikiolo 03, Image editing: WMCH, CC BY-SA 4.0

Kevin Golde: AI in Wikipedia – possible, but clear rules are needed