The Open Source Initiative recently kicked off a multi-stakeholder process to define machine learning systems that can be characterized as “Open Source.” A long list of non-profit organizations, corporations and research groups have joined our call to find a common understanding of “open” principles applied to artificial intelligence (AI).
A group of people who work at Mozilla Foundation, Creative Commons, Wikimedia Foundation, Internet Archive, Linux Foundation Europe, OSS Capital and OSI board members met recently in San Francisco to start framing the conversation.
Participants, who were not representing their employers, included: Lila Bailey, Adam Bouhenguel, Gabriele Columbro, Heather Meeker, Daniel Nazer, Jacob Rogers, Derek Slater and Luis Villa. The OSI’s Executive Director Stefano Maffulli and board members Pam Chestek, Aeva Black, and Justin Colannino also weighed in during the four-hour afternoon meeting at Mozilla’s San Francisco headquarters.
As the legislators accelerate and the doomsayers chant, one thing is clear: It’s time to define what “open” means in this context before it’s defined for us. AI is a controversial term and, for right now, the conversation about what to call this “open” definition is ongoing.
We want you to get involved: Send a proposal to speak at the online webinar series before August 4, 2023 and check out the timeline for upcoming in-person workshops. Up next is the first community review in Portland at FOSSY.
Why we’re in this together
This first small gathering aimed to set ground rules and create the first working document of a “Definition of AI systems” that reflect the Open Source values.
The group brainstormed over 20 reasons for dedicating time on this milestone project. These included reducing confusion for policymakers, helping developers understand data sharing and transparency, reducing confusion for re-users and modifiers, creating a permission structure and fighting open washing.
A few in detail:
Good for business, good for the world
Participants agreed there’s value in understanding which startups and technologies to invest in, based on their “open practices” and contributions to the community.
One participant commented, “The point is not that we need a definition [of open AI] for business. The point is we need a definition to identify people who are doing technology in a way that shares it with the world, and that is what is important. Even if companies fail, they’ve still given something to the world.”
Cracking the black box
The group was soundly divided on the tensions and tradeoffs around transparency in ML training data. There’s a huge question when it comes to the sausage making that is today’s AI systems – what goes in and what comes out? Who gets to see the ingredients? What data should be transparent – zip codes, for example – and what information should not be – single patient tumor scans?
“When a private company creates private machine learning models, we have no idea what is forming or shaping those models, to the detriment of society as a whole,” one person commented. Another person added, “I’m very concerned about people blocking access to [their own personal financial or health care] data [that could be] used to train models because we’re going to get inherently biased…I hope that those designing the models are thinking long and hard about what data is important and valuable, especially if there are people saying ‘you shouldn’t use my medical data to train your model.’ That’s a very harmful road to go down.”
The value of openness
Open Source is about delivering users self-sovereignty in their software. Presumably an “Open AI” would be aimed at delivering self-sovereignty when it comes to use of and input into AI systems. Self-sovereignty is the reason field-of-use restrictions are forbidden in Open Source: Those imply requiring permission from a gatekeeper to proceed.
“Part of this work involves reflecting on the past 20-to-30 years of learning about what has gone well and what hasn’t in terms of the open community and the progress it has made,” one participant said, adding that “It’s important to understand that openness does not automatically mean ethical, right or just.” Other factors such as privacy concerns and safety when developing open systems come into play – there’s an ongoing tension between something being open and being safe, or potentially harmful.
“It is crucial to establish a document that not only offers a definition of openness but that also provides the necessary context to support it.”
Participants generally agreed that the Definition of Open Source, drafted 25 years ago and maintained by the OSI, does not cover this new era. “This is not a software-only issue. It’s not something that can be solved by using the same exact terms as before,” noted one participant.
“Tensions” may have been the word to pop up most frequently in the course of the afternoon. The push-and-pull between best practices and formal requirements, what’s desirable in a definition versus what’s legally possible, the value of private data (e.g. healthcare) vs. reproducibility and transparency were just a few.
Most participants felt that the new definition should not limit the scope of the user’s right to adopt the technology for a specific purpose. There have been a number of AI creators leaving projects over ethical concerns and a push for “responsible” licenses that restrict usage.
“People are shortsighted in all the ways that matter,” one participant said, citing the example of Stable Diffusion’s ban on using the deep learning, text-to-image model for medical applications. “There are researchers who have figured out how to read the minds of people with locked-in syndrome, people who have figured out how to see mental imagery. And yet they can’t help these people and make their lives better because, technically, it’d be violating a license.” These researchers, for context, do not have the millions of dollars necessary to create a Stable Diffusion-type model from scratch, so the innovation is stalled.
“With field-of-use restrictions, we’re depriving creators of these tools a way to affect positive outcomes in society,” another participant noted.
While several participants noted their support for the intent behind ethical constraints, the consensus was that licenses are the wrong vehicle for enforcement.
There was much talk about a “landscape of tradeoffs” around attribution requirements, too. In a discussion about data used to train models, participants said that requiring attribution may not be meaningful because there’s not a single author. Even though communities like Wikipedia care about acknowledging who wrote what, it doesn’t hold up in this context and the creators of automated AI tools already have ways of being recognized. The length and breadth of these supporting documents are also a factor in skipping these requirements. One group member pointed out that “attribution” for a dataset might result in a 300-million page PDF. “Completely useless. It would compress well, because most of it would be redundant.”
This conversation dovetails with the tension between transparency and observability with requirements imposed by other regulations, like privacy and safety.
This half-day discussion is only the beginning. Participants were well aware that the community will need more conversations and more collective thinking before finding a common ground. Send a proposal to speak at the online webinar series before August 4, 2023 and check out the timeline for upcoming in-person workshops. OSI members can also book time to chat with Executive Director Stefano Maffulli during office hours.