The importance of Open Source AI and the challenges of liberating data

(This post is taken from a speech given remotely at LLW 2023 by OSI Executive director Stefano Maffulli)

The conference program places two talks back to back titled “The goals of Open Source AI” followed by “The goals of a Free Software AI”… But to me, the distinction between Open Source and Free Software is insignificant. Open Source is the English term for something I’d call Software Libero in Italian. It’s time we stop making a distinction that only a small cabal understands and the general public tend to ignore and worse – misunderstand.

I started looking into AI with the fear its complexities might make Open Source irrelevant. I say this after seeing what happened when two crucial technologies, the iPhone and AWS, passed us by and I don’t want to repeat the mistake. Both radically impacted how software is distributed and executed, but the Open Source communities under estimated that impact. There were reactions like: ‘cloud is someone else’s computer’ and ‘iphones are locked, don’t use them.’

Today, the values of Open Source are largely foreign in both mobile and cloud.

If we miss addressing the impact of AI, too, we could kiss over 35 years of history goodbye, wrap it up and go fishing.

So what is the OSI doing about it? Last year, we started an investigation to understand the AI topic from multiple angles. TL/DR: This thing is useful, dangerous and introduces new digital artifacts. More on the report.

As long as we’re talking about nomenclature, I use the term Artificial Intelligence to refer to Machine Learning, Large Language Models, Deep Neural Networks and all of those systems. I’m aware of the over-hype around the supposed “intelligence” of modern systems. At the same time, the term AI is more than 70 years old and tied to a well-established scientific discipline. I find alternatives proposed by some groups (SALAMI or other) reductive of the importance of the topic. Let’s keep it serious, there’s legislation coming and legislators call it AI, too. We will keep using this term while remaining skeptical of the hype.

Open Source origin story 

It’s worth remembering that in the early days of computer science, software was widely available and not covered by copyright. The hacker community at MIT AI Lab had complete freedom to run, copy, share and modify software. It was the introduction of copyright and secrets that forced Richard Stallman to devise a hack and introduce copyleft. Then came the GNU Manifesto and finally the GNU GPL. This sequence is important.

As a new artifact of human production came to exist (the software), a community was established around principles (like the Manifesto) to create new software (the GNU operating system) shared with a legal agreement that subverted the system (the copyleft license). 

Then, software was relatively simple: source code written by a human in an understandable language, irreversibly transcoded by a compiler to machine-readable code (the “binary”). It wasn’t until the 70s that copyright was applied to software, too. In the US it wasn’t until Apple v. Franklin in 80s that it was clear that software fell under copyright protection.

Copyright puts obstacles to sharing knowledge and innovation. So the GNU Manifesto sets out the Golden Rule:

If I like a program, I must share it with other people who like it. Software sellers want to divide the users and conquer them, making each user agree not to share with others.

And then lists benefits of the GNU operating system:

  • […] much wasteful duplication of system programming effort will be avoided
  • Schools will be able to provide a much more educational environment… by encouraging all students to study and improve the system code
  • […] the overhead of considering who owns the system software and what one is or is not entitled to do with it will be lifted

The Golden Rule and its benefits can be easily adapted to modern AI systems, substituting the word “program” with “AI system”.

If I like an AI system I must share it with other people who like it.

What do I need to share such an AI system?

Open Source AI is built on data

Modern AI is built on three components: hardware, knowledge and data. Acquiring hardware is only a function of money: richer organizations can procure enough GPUs and other custom chips fairly easily, like the recent announcement by Elon Musk shows. Legally, there aren’t many obstacles.

Knowledge is a function of time and money. There aren’t many developers and system engineers capable of setting up clusters suitable for training large AI systems. But groups like EleutherAI, LAIoN and others demonstrate that it’s not too hard to collect enough knowledge to train complex models.

Data, instead, is a function of a variety of factors. First, large models require large datasets…ginormous. The Pile, used to train LLMs by EleutherAI is 825 GiB (JSON compressed). For comparison, all of Wikipedia is 43 GiB (XML uncompressed.)

Assembling large quantities of data is a technical challenge that’s also full of legal obstacles. Data is covered by a variety of laws and regulation: copyright, sui-generis rights (database), a variety of privacy laws (different all over the world), terms of use, bilateral contracts. 

AI systems are not as simple as software in the 70s. There isn’t simply source code and binary. To create a GNU Manifesto for an Open Source AI, we need to start from data, because creating large datasets is not a simple function of time, money or knowledge.

Liberating data the first step for an Open Source AI

Visual artists and developers reacted to the brouhaha following announcements by OpenAI and other large corporations with copyright. “Thou shall not use [My code|my art] in your dataset.”

This approach goes directly against the declared objectives of the GNU Manifesto adapted to AI. Putting obstacles based on copyright to the aggregation of data forces users to agree not to share with others. The benefits for schools would be removed, vast amounts of overheads would be added.

Plus, by putting obstacles to data mining we’re not preventing large corporations from accumulating data anyway. We’re leaving this space to the big tech and big government agencies who already proved to be good at accumulating data..

I’d argue that creating datasets is already highly regulated by other laws. Anti-discrimination, consumer protection, human rights, disability protection, privacy, national security laws, and many more legal frameworks. Why add copyright on top?

Instead, we should consider this an opportunity to remove copyright as much as possible to produce and spread knowledge and freedom. This is an historic opportunity to set new norms, just like copyleft hacked the legal system imposed on software back in the day.

Open Source has been spectacularly successful addressing the proprietary, secret, overly protected software made and distributed by software vendors. However this was due to a combination of factors (nature of the software, concept of derivative for copyleft, actual distribution of the software, intrinsic inefficiency of proprietary development in many many fields, etc.), favorable to creating and maintaining commons that work. The same tools don’t apply to other fields, like data.

Keep the models out of copyright, too

The elaboration of datasets are the models. For these we don’t need copyright, either. The upshot: We shouldn’t really be thinking about writing AI licenses.

But how can we protect the public from abuse? How can we keep paying jobs for writers, artists? How can we prevent mass disinformation campaigns and all the other doomsday scenarios we read every time ChatGPT is mentioned?

My bet is that already we have all the laws we need to keep things under control. Anti-discrimination, labor protection, privacy, accessibility, slander and defamation, all either already have provisions or can be amended to cover new corner cases opened by AI.

Conclusion 

The values in Open Source are encapsulated in its Definition, but can be distilled to “autonomy, transparency, frictionless innovation, education, community improvement”. The licenses are a way to enable these things in the face of copyright law that defaults to the contrary. The licenses are not the mechanism to achieve these goals. Instead it’s the community and innovation that they produce when you remove legal barriers to collaboration.

The licenses do something else: they remove liability for sharing, and this lack of liability has been instrumental in allowing people to share. Upcoming regulation will block collaboration and sharing, both for software and ML, and we should be exploring terms and mechanisms to avoid the negative consequences of those new legal blockers to sharing as much as we can.

It’s time to put our heads together not to write new licenses but to support policy makers so that Open Source can flourish in AI as it did in its early heyday. Reach out to me on Mastodon.

Image from Alma Studio via Canva.com