Open Source software started in academic circles, and AI is not different.

Transcript from October 20th Deep Dive: AI Academia panel

Stefano Maffulli:

All right, here we go. Thanks everyone, and welcome to Deep Dive: AI. This is an event series from the Open Source Initiative. We started as a podcast season exploring how artificial intelligence impacts open source software from developers, businesses, to the rest of us. And the objective with the panel discussions is to better understand the similarities and differences between the AI and let’s call it classic software, particularly open source software. And today’s panel is the last of four discussions where we deepen our exploration of challenges and opportunities of AI for society as a whole. I’m Stefano Maffulli, I’m the Executive Director of the Open Source Initiative, and today I’m joined by Mark Surman. His job is basically to protect the web. Mark serves as the Executive Director of Mozilla Foundation, which is a global community that does everything from making Firefox to taking stands on issues like online privacy. Mark’s biggest focus is building the movement side of Mozilla, running the citizens of the web, building alliances with like-minded organizations and leaders and growing the open internet movement. Mark, thank you for being here.

Mark Surman:

Thank you Stefano. It’s building the alliances and the licenses given we’re on an OSI call.

Stefano Maffulli:

We have Dr. Ibrahim Haddad. He’s the VP of Strategic Programs, AI and Data at the Linux Foundation. He’s focused on facilitating a vendor neutral environment for advancing the open source AI platform, empowering generations of innovators by providing a neutral, trusted hub for developers to code, manage and scale open source technology projects. Haddad also leads the Linux Foundation, AI and Data Foundation, and PyTorch Foundation as of a few weeks ago. So thank you for being here, Ibrahim.

Ibrahim Haddad:

Thank you for having me. And hello to the other panelists as well.

Stefano Maffulli:

Then we have Chris Albon. He is the Director of Machine Learning at the Wikimedia Foundation, where he is applying statistical learning, artificial intelligence, and software engineering to political, social, and humanitarian efforts. He has a long experience in data science for humanitarian reasons and nonprofits and startups. He’s also written machine learning for Python cookbook and created the machine learning flashcards. Thank you for being here, Chris.

Chris Albon:

You’re welcome.

Stefano Maffulli:

And Amy Heineike,

Amy Heineike:

You got it. That’s right.

Stefano Maffulli:

Veteran Engineering and Data Science Leader who’s been working in fast growing startups for the past 15 years. Currently, she’s the VP of engineering for the supply chain software company, 7bridges. And she previously was on the founding team of Primer.AI where she built out their natural language process summarization engine, which will have to explain what that means, what that does. And she scaled also the technical team in both in the US and UK. So thanks for being here, Amy.

Amy Heineike:

My pleasure. Thank you.

Stefano Maffulli:

You’re a mathematician, right?

Amy Heineike:

Yes. By training. That’s right.

Stefano Maffulli:

So that, yeah, that’s gonna be one of the questions that I wanna focus on because, so there are three topics here today that I’d like to focus on with you. And one is the role of academia and research for AI.

Stefano Maffulli:

What would you think should be that one, that role and how can we foster collaboration in AI? How do we get better in a very wide sense? And how do we get that faster? And, finally, I’d like to talk about the responsibilities of research and academia to keep society safe from AI. So, let’s start and, and let’s start from the, research is one of the fields where the benefits of artificial intelligence start to become obvious. Whether you are reviewing medical records for diagnostic cancer or acting as a watchdog for this information. And you all have roots in, in academia research, but you also have crossed over to other sectors. So there is a lot of variety. In your experience, what do you think should be the role of universities and research institutions? You know, is that the reason to be training grounds ballast for the big technology companies or watchdogs? UMark, maybe why don’t you take it from there?

Mark Surman:

Well, I mean, there’s so many different roles that the research side of academia can play. I get, I guess it’s worth saying that Mozilla’s interest really is in AI aligning with the values we laid out in the Mozilla Manifesto for the web. And so in many ways, we think of the questions around AI today being similar to the questions we asked about the web 20 – 25 years ago, like how they actually stay open. I mean, we all have to connect to questions about how we deploy AI people, and respect human dignity. And so to me, when I think about the role of researchers, I think about it. How does that contribute to that? Cause I, I do think certainly where the most AI research happens and the most AI development happens is in the big companies where their incentives are not particularly by design lined up with humanity.

Mark Surman:

Whereas academia and academic research actually can lend itself to that. And so I think that’s one thing is, it is really to look at how we innovate in a way that can be in the interest of humanity through the research we can do in academia. But I, I guess the, the other thing I would just say is really to encourage and, and you see, you see this, but you don’t see enough of it. More of a focus on really pragmatic applied research and more of a focus on how we build out. You know, the open source stack for AI that is trustworthy is human values aligned and can outcompete you know, some of the, the more proprietary stacks. And so I think an opens stack the respects human dignity and, and you certainly see in Linux and, and other things in, in the web stack, for example, that we actually can build things that are open, that out compete the proprietary, or at least that compete equally. And I think building a trustworthy AI stack that out competes the proprietary stacks so that the stacks that don’t actually take human interest into mine, like I believe that’s achievable. I see academic research as researchers, as a key player in that.

Stefano Maffulli:

Amy, is that you’ve been, you know, from math to comp corporations, maybe you have the, the largest amount of experience on this panel, on the, for corporations. How do you see that role?

Amy Heineike:

Yeah, and I, I think the first thing to say, which is probably an incredibly obvious point, but it’s staggering the pace and diversity of the research and then innovation that’s going on in AI right now. So we found a primer about eight years ago, and when we founded it, we’re working in the NLP space, you know, the state of the art and some of the models we were using was like 65% for relation extraction. I think like summarization was like not really that possible. And our entire way of thinking was about how to work around the fact that actually the models weren’t that good, but they did exist and they were starting to become more available and accessible. And now, you know, it’s not that many years later and you look at the incredible things that are possible with things like, I mean, we’re all excited about stable diffusion and all these incredible new models that are out there.

Amy Heineike:

It’s an incredible pace of research. I think it’s also very, very early days for us figuring out how you make this stuff useful in people’s day to day life. So what applications and exactly what roles this technology has for people. And so there are these very profound questions about, you know, if you build models, how do people interpret them or understand them? How do you actually tune them and make them applicable for different use cases? So we’ve worked with a lot of different enterprise use cases, for example. And it’s often very hard. So you can work with an organization that, you know, has money to spend that knows they wanna use AI in some way to speed up some workflow, and they can kind of describe the workflow. And it’s still incredibly hard to figure out how you actually use AI to solve, to solve those problems.

Amy Heineike:

So there’s a big open field here in its early days. And so I think so to more specifically think about the role of academia, certainly in my roles we’ve made friends with different research groups and presented to one another and shared ideas. We’ve hired people who’ve come from PhDs or in posts and collaborated with them. You know, ideas are kind of bouncing back and forth very quickly. And so there’s a lot of spaces where we need people to come in and look at these problems in different ways and come up with different ways of thinking about it, I guess. I think we don’t know yet how, how it’s gonna play out exactly what’s gonna be possible and how it’s gonna be applied.

Stefano Maffulli:

It’s a recurring topic that I’ve heard mentioned many times, in the past and I wanna get back to that, that this past progress, this very rapid progress. Ibrahim I see your mic up. What, what are your thoughts on the role of academia?

Ibrahim Haddad:

So I think it’s very critical to have academia involved in the specific topic of ethical AI, right? I think, because this is where we’re going, our discussion today, when we look at the largest and most successful open source project in the domain of ai, a lot of them have roots in academia. And this is where really a lot of professors, you know, years and years ago, actually started looking into these different areas of AI in general, not, not really the ethical part. And a lot of their efforts were kind of smaller building blocks in an effort to get to where we got today in terms of these, you know, open source ai, you know, platforms, libraries, tooling and frameworks and so on. And we see that in and data where a lot of our projects have different tools in academia, whether it’s graduate students or end of field projects that evolved to a point where it kind of gave some traction and it started to grow and so on.

Ibrahim Haddad:

So there’s definitely a lot of influence from academia. And there was I think a turning point probably around 2017, 18, 19, around that point where a lot of the commercial entities and organizations, you know, commercial organizations realized that there isn’t enough AI talent to go around. And then at that point, they started kind of a land grab in academia, basically sponsoring whole AI departments at different universities. And actually Canada is one of these countries where they have some pretty decent universities and very advanced AI research where different organizations, including my previous employer, went and, you know, sponsored, you know, kind of the whole department and kind of of had their own agenda being driven by research to get kind of the, the share, you know, mind share of researchers and so on. So definitely there’s a lot of influence and a lot of the current successful projects actually came from academia.

Ibrahim Haddad:

So for me it’s kind of essential. As for the specifics on ethical ai, what I would like to mention is that, you know, in lfa in data at the next formation, we consider this a very critical topic. And we actually have an indicated committee of, you know, a lot of companies, you know, dozens of organizations that are working under the umbrella of what we call the Trusted AI Committee. And this committee actually convenes every two weeks, and they have two different work path. The first one is focused on kind of principles of trusted AI, and there are published papers that discuss these principles, you know, to a certain degree of detail. And, and this is all great, but then when it comes to the applicability of these principles, we also have three different hosted projects in the foundation whose goal is to basically look at different software stack and realize and, and come to, to, to conclusion if these stacks actually deploy ethical AI based on the different governance principles that were defined.

Ibrahim Haddad:

So these are actually source code that you can go download and, and compile and build and run on different code. And they’re being used today in different industries, more specifically as a great example, the financial and insurance industry. So for us, this is a great topic and, and, and certainly a hub topic and what makes it more complex, and I’m gonna have this as my ending kind of comment. What makes this very complex to maneuver is really the different legislations that we see across the globe. So, you know different countries like, you know, in North America, you know, Canada and, and the US more specifically in Europe, whether it’s individual countries or the EU block as a whole you know, China, Japan, Singapore, and other countries, they, they are introducing different legislation with respect to AI and the ethical use of ai. And, some countries even have an AI officer or AI minister that oversees the legislation and the different application of AI and its ethical news. So certainly an extremely hot topic globally.

Stefano Maffulli:

Indeed, it is. And, since you bring up the ethical approaches and the reasons of, I mean, the question of trust Chris, you probably have, I, I’d like to hear thoughts on, on that front, like you have a lot of experience in that, in that field.

Chris Albon:

Yeah, I, I mean definitely this, this leads into what I think is probably one of the really interesting parts is, you know, my PhD’s in, in the social science, and the thing that is interesting to me is, is the fact that even the discussion of, of ethical AI would not have happened except for the fact that like a lot of social scientists with computer scientists in academia got together and triggered the conversation, and now it’s spread into industry and it’s, you know, we’re having more panels on it and it’s being incorporated into work, but that initial core of people who started talking about it were in academia. And that’s really like a very applied idea of how it can be used. And for me, you know, I I, to, to echo Ibrahim’s point, like the stuff that, I think there’s probably two areas where academia plays a huge role.

Chris Albon:

Like one is computer science pushing the limits of what AI can do, right? That’s like, you know, new models, new architectures, new scale. Like how do we really push those techniques forward, which then get incorporated into, you know, more applied settings like, like us. And then the other one is, is the social science aspect, like legislation, right? How do, like, how do politics play into a role, not just, like, I would, I would bring it larger than like strictly like ethical AI in like a little bucket, I would say. Like, hey, the societal impact is the economics behind it. Like, does generative art create jobs? Does it lose jobs? Where do people go? Like, there’s a whole world there of understanding because the impact of AI is so large that, you know, there’s areas to learn from in lots of different areas. And it’s stuff that we take advantage of a lot at the Wikimedia Foundation because there’s lots of research that uses Wikipedia.

Chris Albon:

And we use that a lot in our, in our work of like, okay, how do you define like a sort of type of edit, like a swarm edit where lots of people come together to figure out how to edit an article together. Like how do we detect something like that using ml? Well, we go to the academic research who has sort of like looked at that before and try to figure out how to work from that. So there’s so much value in that, but I do wanna like, take it, I  admit the point that like, yeah, there’s tons of stuff that’s happening in like that computer science pushing the, the front forward, and we, we do that and that’s like really valued, but there’s also lots of areas in social science, economics, legislative, political science, you know, like sociology and that kind of stuff. So that is actually very useful and I think it’s probably a little bit underutilized, but it’ll, I think that’s going away with time.

Stefano Maffulli:

It’s so, yeah, the way I’ve seen it, this is, this is really progressing very fast. And Amy, you touched on this and Mark also, you touched on the point that there’s a lot of movement between researchers going into work for companies and then, and then being driven out of the – being driven by different motivations. And that has created also quite some amount of friction with researchers not working for large corporations, having, you know, to raise the hand and, and, and say, wait, wait a second, you’re doing this wrong. You’re making a mistake. You’re pushing what I’ve been building into to do something that is dangerous or, or just not meant to do what, what it’s been doing. So, and this all make me, makes me think that it’s kind of like early to start going out of the labs and from, from basic research into applied research or to applied products rather than research. I, you know, I  got that feeling. What are your thoughts on this? I’d like to hear all of your opinions. Is this ready for prime time?

Amy Heineike:

You know, I wanted socially in terms of groups of academics who are also coming in, I just wanna give a shout out to the biologists because we actually found in the NLP space, we ended up hiring a lot of people who’d come from biology or, or chemistry or applied sciences who’d learned machine learning in those contexts and had to therefore wrangle with really messy data and very specific questions that they were trying to answer. And so ended up kind of unpacking these problems in maybe different ways and very practical ways. And so the one when they came into kind of businesses and then they were thinking about kind of enterprise use cases or something brought a very useful perspective. And so I think sorry to go back on that point a little bit. I think it’s really fascinating seeing how there are very different perspectives that need to come together still to be able to use any of this stuff at all in any kind of meaningful way when you get down to like really practical use cases.

Amy Heineike:

And so I think that dynamic is fascinating. But yeah, as you said, I think it’s, I think it’s early, but I think, I don’t know that you necessarily work out what the problems are when you’re still in the lab setting. I think when we take these things into the applied settings and we try to use them, then we realize, oh, actually there’s a bunch of problems that are very difficult that frame different problems to go back and solve. So you know, people actually are very uncomfortable with big black box models that tell them an answer. They want to find tools that give them information that help them do things. And so often you don’t want the model to give you the answer. You, you also want the model to give you a reason why it got there, or give you bits of evidence that you can then reason about and use. But that, so different related problems become the ones that actually the, the keys to building useful software at the end of the day. So yeah, I think it has to bounce back and forth, but it’s definitely very early days.

Ibrahim Haddad:

Yeah. I, I second, sorry. Opinion especially on, on the fact of different views. You know, we work with organizations in several countries across the globe and we can ask the question on, you know, what do you consider as ethical AI and what kind of constraint should be put on different models and different others and so on. You can ask that question to somebody, a researcher and a policy maker in the government in China, in Japan, and, and let’s say in a couple countries in Europe in the UK, in the US and you’re gonna get, you know, seven, eight different answers. And what we realize is, you know, of course there are different opinions and, and, and different directions in terms of policy making and actual applicability on a day to day basis. And what’s really hard is to get to that kind of global consensus in a sense where we have a common understanding first of the issues at hand before we start kind of addressing them.

Ibrahim Haddad:

Because of course, the view of what’s considered ethical AI in China, it’s gonna be very different from what’s considered ethical use in Canada. And that’s through deep rooted in, you know, in, in the culture of the countries and their policies and their sense of democracy, you know, et cetera, et cetera. So I, I feel that there’s way a very long way before we get there at this, you know, from the basic understanding of the issues mentioned, you know, understanding the issues that we need to deal with and then get to a point where as a collective effort we’re able to produce the desired solutions for them.

Mark Surman:

And then lemme me just build on what Amy and Ibrahim said, which is of course it’s early days. I don’t think anybody would, would say anything else. I mean, we can just imagine what late days might look like. And I think that’s the exact reason that we need to do, like, think about things in the way that Amy suggested, right? Is the, the back and forth, because we’re shaping what those late days are gonna look like because it’s a back and forth, it doesn’t, nothing kinda platonic just comes outta the lab and, and they say, oh, no ready for prime time. I mean, we’re talking about massive adaptive systems. The idea that that they would ever be done is you know kind of counter to what they are, right? And so I, I think the back and forth that Amy talked about and then the kinda playing in different context the talked about is, is key.

Mark Surman:

And then it sounds like we’ve got consensus that part of how you play back and forth is with people from different perspectives and different disciplines. It really is critical if we think about all the software that the, the older of us on this call, which, you know, think about when I started, you know, using software, like programming things in basic on my own computer when I was, you know, 10 or 12 years old it that you just had people who, who wrote in code, you know, if you wanted to learn how something worked, you had to kind, there was no instructions in in the software and had to read a book and all those things. You think about the evolution in user experience design over the last 30 years, 40 years, you, you don’t pick up a piece of software or an app or whatever and need a book.

Mark Surman:

And that was, was a process of going from really engineering driven ideas about what computing was to stuff that had multiple disciplines involved, including designers, anthropologists, people who think about how we interact with, you know, all of the human center design. We’re at another, and I think bigger, beginning of a bigger piece like that where we need to bring a lot of different people with different perspectives into the teams we build and use that to do the kinda iteration evolution of what these systems are. And I think the same, the same is true and how this stuff bounces up against regulation. I mean, we keep trying to find a Corolla, the best thing I can think about is urban planning, but we don’t know how to regulate systems that live like this at this scale. And so leaving that in a lab as opposed to actually trying to figure out how to build that muscle into society would be insane. Like, we have to learn how to make social and democratic decisions about how we want this to go. And we don’t have the tools and the only way to get the tools is to like to try it out and build that muscle as, as societies.

Chris Albon:

Yeah, no, I, I mean I definitely think, I definitely think it’s early and I, I think one of the most interesting things about the field of ML and AI is that it is growing almost faster over time, right? And so, like we keep on talking about there, there’s like a peak and like people like five or six years ago were like, Oh, we’re at the peak of the AI hype and that kind of stuff. And it seems like every single sort of month there’s some like, mind blowing thing that gets released that no one had ever thought about. And I think that part of it means that it’s accelerating so much faster than the ability of, of sort of societies, but particularly politics to really like to hold onto it. Like I definitely feel like some of the conversations around like politics and legislation of ml are like six years behind, right?

Chris Albon:

They’re like, they’re like debating an issue that is like now completely like outta date. And we’ve gone even farther in the things that like what they’re, what they’re discussing are just like an old version of what’s happening. And it’s a really, it’s, it’s, it’s a really hard point. And one of the things that you definitely see when you work in sort of politics and tech is that it takes a while for politics and legislation to catch up to, to where that is. And you can see that with social media like in the early days of Twitter, it was like if you wanted to talk about a government in a negative way, in a repressive regime, Twitter was actually a place to go because the government didn’t know to look there. And now of course, it’s definitely not the place to talk about stuff in the repressive regime cuz they could search it using all these tools that were developed to basically find you if you’re saying something bad about a government on Twitter and like that’s the government catching up to a new technology.

Chris Albon:

But again, with AI, we’re seeing this super rapid pace of development that is, if anything going faster in a way that is like, is pretty rare in, in, even in tech. Normally there’s this like point of maturation where like, it’s sort of like the, you know, the technology sort of like becomes more steady and then we look for d applications for it. And so then the time after it matures, it’s like, okay, how do we apply this to biology? How do we apply this to law? How do we apply this to shipping and businesses and society and democracy and that kind of stuff? And you know, it, it’s, it is definitely like every six months we’re at like a new area. I mean, and if you’re in the field of AI, like I could name a bunch of stuff like recurrent neural networks or convolution neural networks, which you’re like cutting edge five years ago and now totally not cutting edge at all. And no one uses it in production. Like, it’s like a, it’s like that quickly of like a development that we’re going at that is a really hard thing for, for societies and particularly governments to, to catch up on so early and moving fast,

Stefano Maffulli:

Which I think you all touched on this, but my interpretation, my reading of this fast progress and this lack of trust comes with a, with a certain amount of fear that is been, that I think is one of the main drivers of, of the recent regulation efforts, at least the ones that have looked at more carefully, the EU AI Act and the, the American AI bill of right, which I glanced over at the past the last week looks like, you know, they’re really afraid of these algorithms making decision. And the AI Act,one of the, you know, the fear of,mind control or,subliminal messages that the AI Act mentions, looks to me like they’re afraid of algorithms making decisions and changing voting versus perspective, for example. I think they all have that.

Stefano Maffulli:

And it goes back to maturity like we, or how do we get to trust team models? Like what do we need to do for software? When we were talking about, when the internet started, you know, we were seeing well make it open source, like have many eyes look at the code and so that you can trust the browser, you can trust the web server, you can trust the database to serve you the data that you actually wanted. And, you know, and, and it goes with encryption, like trusting the algorithm because it’s been tested and battle tested by mathematicians. What, what do you think we need, what do you think that research in academia needs to come up with in order to, to get to that level of like, Oh, you know, this is the solution, this is what you should be looking at. Chris?

Chris Albon:

Yeah, I think that’s a really interesting point and definitely, you know, the analogy that I keep going to with this kind of stuff is, social media. And one of the things that Twitter’s very famous for is they’re very open with researchers. And so it has become like a place to actually do a lot of research on, on social media and to use Twitter as an example. And there’s lots of papers around that. And it’s interesting to contrast that to something like TikTok where there is a very, very popular application with, I mean, I don’t think it’s one algorithm, I think it’s variety of algorithms, but like with AI embedded in it that people don’t see and can’t test and can’t look at. And it’s very difficult for researchers to get a handle on that. And I, I definitely feel like that is the place that I would love to see more focused is like, how do you do research on a, on a black box kind of system?

Chris Albon:

Not just the models like black box but also like, they’re not inviting you in, they’re not giving you a great API access. Like how do you do that sort of adversarial research into that kind of stuff I think is very interesting because one of the things that is useful, not just with TikTok but everyone else is, is the idea that, you know, this is an international like development and stuff. And so a lot of the discussions we have are sort of US based or Europe based, and that’s based on certain notions of democracy and that kind of stuff. And that doesn’t apply. And that’s something we see at the Wikimedia Foundation where we have 330 communities around the world notions of, of what fairness is, of what, of what everything is sort of very different across all those different communities. And there needs to be a way for researchers to operate in, in environments where say, you know, there’s a broad use of AI by the government, but yet they’re not providing, you know, researchers with special accounts, with high API limits to take, you know, to do research on it.

Chris Albon:

And I do think a lot of the research tends to be focused on sort of the lowest hanging fruit, which tends to be a lot of stuff on Twitter. Whereas there’s lots of like, you know, really even more popular platforms like things like TikTok, which are just not open to research that well and yet are clearly using the algorithm in ways that we don’t understand, like really genuinely don’t understand because maybe they’re pointing it to, you know, I mean you could go seriously series left to right center how, how they wanna use it. Maybe it’s engagement, maybe it’s political engagement, maybe it’s, you know, forming dis discontent, who knows because who knows, right?

Mark Surman:

Let me bill on that. And I, and I wanna also just, I’ll loop back to this at the end. Re this is why I said yes you know, in terms of this call and why I wanna keep working with OSI cause we did answer this in another era with licenses and it wasn’t clear in the beginning why OSI formed was to adjudicate, right? Cause people will go and say all kinds of whatever about this is open and this is free. And so I think we’re kinda at that spot. I mean that took, I dunno, five, six years of wrestling as OSI’s founding. So that, and I would say partly who knows how we get to, to trustworthy. But one piece is what, what Chris says is transparency and, and doing the pragmatic stuff of figuring out what transparency means and building incentives for it at this point.

Mark Surman:

I mean, one of the things we do is try to I don’t know, reverse surveil a bunch of the surveillance economy by crowdsourcing people to, you know, donate their browser data where we can actually look at some of the boxes and show what you might see if they were transparent. Because it’s, it’s amazing that Twitter has opened up a bunch of that data, but we, we just released a report on our second crowdsource research study on the, the YouTube recommender where 20,000 people basically showed how they tested user controls in, in YouTube and showed how, how they don’t work. So I think one of the things, not just yes, transparency, but also like let’s start to figure out what it can look like in practice. And I think the same is true in, you know, what does fairness or at least looking at fairness in design look like in practice?

Mark Surman:

And we have decades to go in trying to figure that out. What do we mean by, and then good data governance in a particular community are three vectors to explore. And then going back to the beginning, I think look at what kind of licenses or other tools might let us explore those topics faster than law. And that, that again, is the big innovation of open source, right? If we, we hacked copyright law to, to get to a particular end. And I think we need to look at that again. It’s interesting to see some of the work, the hugging face and, and others are doing with this rails responsible AI licenses piece. I mean it’s, it’s very vague at this point, but I, I do think we should take up the spirit that open source had 25 years ago and look at how it would go faster than the law in prototyping some of these questions. Yeah.

Amy Heineike:

Amy, I see your mic up. Yep. So one thing I wanted to kind of point out is that for an AI application, there’s, there’s kind of layers of the onion. And so I think actually Mark U was great cuz you were pointing at different parts of those layers, but maybe it’s interesting for us to call that out a little bit in that. So what you often have is you have the, this kinda a base model, which is maybe one of these large trained models that holds up as much data as it can to create kind of an underlying model of the world. So it could be one of these kinds of image generation models or one of these big language models. So you’ve got kind of these base models that are trained on a huge amount of data and these things can then get built upon, so you can then fine tune those to do different things and then you can build applications around that, that use those, that use those train models for different aims.

Amy Heineike:

And so I think what’s interesting is I think there’s, there’s part of the game which is like, how is that end thing being used for people? So I think it’s very scary to think of, you know, models that you, you don’t know what’s going on, but suddenly being given the power to, you know, decide who gets hired or doesn’t and you know, is that biased in some deep way and like, how would we even know that if we don’t have access so that, that final application. But what’s interesting is those underlying models that are going out there, these kinds of big base models that could be used for many different applications . They make it easy for other people to go and build a lot of applications, but they themselves are kind of this cooked up bag of whatever data went into them. And so I think, you know, there’ve been some fascinating conversations as you mentioned, I think with kind of the Rails license or whatever about those underlying models where there’s a huge range of things that could happen.

Amy Heineike:

And I think we don’t necessarily know what’s baked into that data, so we don’t know if there’s actually, you know, if you poke it in certain ways, it’s kind of two really kind of scary and weird things and show us kind of the worst of human nature that got kind of sucked into that training data. And if you poke it in other ways, you see something super delightful, that’s kind of the, some of the most creative and lovely things that you can kind of pop out of it. And so I think we have to kind of wrestle with, you know, how do we feel about this potential being unleashed where we’ve, you know, we’ve built these models that can do lots and lots of things. How do we reason about that kind of capability? And then secondly, when we build applications on top of this where we don’t know what it’s doing, how do we check that it’s doing what we even want it to do? So we might have good intentions and just have not realized that it was deeply flawed or we might have kind of bad intentions as well. But they’re, they’re both ai but they’re kind of different kinds.

Stefano Maffulli:

Oh, that’s, that’s a very fair point. Ibrahim, Oh, you’re muted. Sorry.

Ibrahim Haddad:

Thank you. So I would like to go back to something Chris mentioned about the pace of development and pace of innovation and actually we track the top open source projects in the domain of AI and data. So there are about 3,330 of them that we consider key and critical in the ecosystem, and you’ll be able to explore that via the link I shared on the chat. And what’s really interesting is the sum of all these projects in terms of lines of code is about 500 million lines of code that is being implemented every week by one new million lines of code. So that’s like one new million lines of code every single week, week in and week out being added to that code base of these critical top projects representing tens of thousands of active developers contributing to these projects coming from thousands of organizations.

Ibrahim Haddad:

So this is really a massive challenge in terms of keeping up and, you know, it is a great thing that we have access to this kind of massive resource of external r and d as a collective effort, but also from different perspective keeping up with that pace of innovation and maintaining a good handle on what’s going on in terms of face stations and research in this space of ethical AI is actually extremely challenging. And to the point of you know, Mark and Amy with respect to, you know, what can we do now and kind of the value of open source. I think really there’s massive value from open source and that that lends itself to the domain of AI in general and data as well. And that can be summarized in one word as everybody mentioned, you know, transparency.

Ibrahim Haddad:

And I think, you know, from our perspective we try to look at four different challenges with that spec. The first one is, you know, ensuring fairness. So we need to be able to have, of course open source tools on and methods you know libraries, whatever the case is that would allow us to detect, indicate any kind of bias, whether it’s in the data sets or the models. So this is one aspect, it’s kind of the fair aspect. The second aspect is what’s referred to as robustness which are basically methods or, or libraries and tools that allow us to detect if there has been any tampering with the data sets and the mobiles. Basically try to identify if there has been any adversarial attacks on them. The third aspect, which I think one of the panelists mentioned in, in different languages is explainability, which is we need to be able to understand and have a model being self explanatory.

Ibrahim Haddad:

Basically we need methods and, and tools or libraries that will allow us to understand and interpret these different models and the outcomes and how the decision tree and so on. And the first aspect is lineage. And that applies to both data and models as well. Basically understanding the origin of the methods, the origin of the models and the data sets, any changes that were done to them by whom and the ability to produce the same results using the same data sets. So these are kind of four different challenges and I think this would be kind of a great part to, to, to address the general issue of ethical AI by addressing kind of the smaller subsets of challenges in relation to fairness, robustness, explainability, and through the open source methodology of kind of collaborative work openness and transparency.

Stefano Maffulli:

There is, there are definitely lots of, lots of questions in there. Ibrahim like we, we have also had that discussion somewhat related to talking about the, you know, the issue of fairness. How do you measure, and you mentioned a lot of technical issues that are technical tools that are being deployed and available to judge the, the fairness and the, the strength of a model. The objection that a lot of the users down in the field, like we had Jennifer Lee on Tuesday in the panel, is that a lot of the technical tools still measure things with the bias that we as a society have built in. So there is a lot of that needs to be done, a lot of conversations I think in this space. And that’s why I’m interested in this mentioned multiple times the ethical, the ethical approach.

Stefano Maffulli:

And I, I’ve heard it many, many times like it, and it looks like it’s becoming the most discussed. And the most discussed feature of, of this community, of the community of the researchers. They seem to be very, very concerned about putting, releasing their papers, releasing their datasets, their models, their train models and all the, and all the tools with, with care being very extra careful about how the systems are deployed or used downstream. Why do you think this is happening? Is this something that you’ve seen in other fields? Like, I don’t remember this Mark, I don’t remember this in the early days of the internet, I don’t, well, there were some concerns, but they were also being like, eh, we’ll fix it later. Like, or we’ll fix it with regulation laws or some, someone else will do it. It was not, I don’t remember seeing the research researchers being so mindful about this, but what’s going on here?

Mark Surman:

Well, I think that’s exactly it. It was, we’ll fix it later. And then, you know, we get Snowden and we get Cambridge Analytyca and we get Francis Hagan and we get, you know, whatever your politics are, the mess of polarization we have, and we get the denial of science. And, and I, I don’t think it’s about ai. I mean, it’s, it’s about a, a growing consciousness that the design decisions we build into systems as in all systems, automobiles, cities, whatever. But, you know, we really shaped a lot with digital systems. The design decisions matter and the guardrails matter and the attention of course impact matters more than attention. But thinking through potential impact at the, you know, in the design stage matters. We, you know, the Misso manifesto, which is, you know, what guides us and it, we really do use it kind of week in week out to say, Should we do this?

Mark Surman:

Should we do that? I mean, it’s for us internally, a very helpful tool was written in two chapters. The first chapter, maybe there’ll be a third, was written in 2007, where I think most of us were like, and these are the words in there, as long as it’s open, interoperable, decentralized, blah, blah, blah it’s gonna be good. And, and we still believe in those things. But the web, which was the thing those principles were, were written about by being those things without any design about, you know, intention really has done a lot of harm or been used to do a lot of harm. And partly because it, it’s, it’s open without any other side of the story. So we wrote the, the second chapter of the miss manifesto unsurprisingly in, in 2017 which talks about inclusion and human dignity and truth and the need to, in the design of and taking the internet forward, look at how you balance those, what I think of as like the technical values that probably most of us stand for with a set of human values that as the internet and, and digital technology and AI weave into every aspect of our lives, we have to look at how we do both.

Mark Surman:

So I don’t, I think it’s just because AI is the current era of computing that that question seems present in ai. I think it’s a question we have about the role of digital systems in the transformation of our society and making sure we still care about democracy and humanity. Abraham.

Ibrahim Haddad:

Yeah. So I don’t disagree with much of what Mark said but I’d like to add kind of a different perspective to it on why this is becoming important now. And I think you know, my personal view on this is there are a lot of researchers and, you know, technologists and practitioners out there that are realizing similar to what Chris mentioned, the fact that there’s really an incredible amount of innovation that is happening today. There’s massive development and basically what we thought was, you know, cutting edge three years ago today is, you know, yes, there’s the dish basically. It’s really, there are so many new things to it. And I would like to give you a couple examples of what I mean by that. I think in 2019 I was in China attending an event at Microsoft R&D Center there.

Ibrahim Haddad:

And they were showcasing basically an AI system that is able to, one, compose music and to play it. And their prediction is in a few years, but less than five. So we’re almost there. You’ll be sitting in your car driving and listening to the music on the radio and not knowing if there was an actual person like yourself or myself, who composed and played the music or is actually an AI system. So this is one example, another example from last week I was in Dubai attending an event and speaking. And they had a robot who, after looking at a number or scanning a number of images, is able to draw similar images and was really impeccable result. I mean, I was standing there at the booth looking at it and it’s just incredible. So I think there are a lot of people concerned about the advancement that is happening and what kind of potential danger that’s causing to society in general. Just you know so there’s really a lot of interest in exploring this and figuring out how to deploy this and keep the deployment in a, in a positive aspect with respect to, to kind of human and human nature and, and society in general versus all, you know, some of these technologies being used in a conceiving way.

Stefano Maffulli:

Yeah, I, I fully agree with you. It’s really interesting and it definitely something that I’ve been observing without really judging. I, it’s really a central topic. But I want to go back to something that Amy said, and you were talking about the vast amount of data that is necessary to build these models and to train these large models that then become the I think we lost –. Oh no. Okay. so the, the, these large data sets, the large amount of data that needs to go, and then the models that also can be retrained and, and specialized. So basically there is the need that there,  there you can, we can see that there is an advantage in, in having shared commons that where we, we can, we can pull, we can mix and match things. How, how do we get there? What, what’s necessary? What or what’s missing are we on the right path to build the commons or for, for building better AI? So to improve, or what, what’s your experience, what are your thoughts on that?

Amy Heineike:

So I think there’s, there’s probably answers at these two different levels. So one of them is probably around the data required to build these underlying models that could be used for many things. And you know, as I said in general they want to suck up as much data as they possibly can <laugh> from the world. So the more images you can throw through one of these basic image models, the more resolution, the better it’ll understand the world. The more language you can throw through, the more you can kind of be able to generate kind of long form text. So, there’s one question around that. So what kind of data do we feel comfortable putting through the models? I think the second question is about actually training things for specific purposes. So I might talk about that one even though maybe the first one’s a more interesting one because maybe I have a bit more unique insight on the other end.

Amy Heineike:

So one of the things that we found was kind of fascinating kind of fine tuning models for specific use cases. So this is working with maybe banks or different kinds of businesses. So some of the AI use cases are about in, in business anyway, are about kind of thinking about human workflows and trying to then build tools that can maybe enable people doing those workflows. So maybe they’re meant to read a lot of documents, find things that are relevant to their business, and then do something about it. And what it turns out when you go and work with companies is often they imagine like, Oh, it’d be really cool to bring in an AI for this. And then when they start the process, you actually realize that their business process isn’t really documented and you’ll have individuals who do that job go and train data for you and you’ll find out that they agree with each other less often.

Amy Heineike:

So they grew with each other less often than they disagree. So they’re actually actively and systematically doing different things than their neighbors. They’ve interpreted what their job is in different ways. And so when you, when you come to application, it’s, it’s very fascinating because yeah, we’re often getting into really like what do people do? How do they make decisions? How do they want to make decisions? And I think if you know, some of the kind of promise of getting AI to do useful things will actually be about us learning what we do every day in our jobs and talking about that in a systematic way that maybe we didn’t have to do before. And I think that’s kind of fascinating. I think that there are also fascinating questions then about kind of IP and that if your business operates in a certain way and then you can kind of represent that in terms of kind of training data and examples that get embedded into a model that that model, that fine train model is probably then very, very specific to your business and would become precious to it, if you could train that over time, but I’d love to hear other people’s answers on the question of these big base models and what data we feel comfortable throwing into them. Cuz they certainly the more, the more the merrier it seems.

Stefano Maffulli:

Yeah.

Stefano Maffulli:

Cause it’s also very hard to build these large data sets. And I, you know, I, I thought that Wikipedia was big, but then I realized it’s really tiny for natural language processing stuff. So I mean, and we’re starting to see some lot very, very strong pushback also from various, let’s call them copyright holders very various sorts from, from the record industry association to some, the motion picture association, the editors, books and publishers even some, you know, copy left supporters also pushing back and say, Look, this is my code, this is my, my my conditions to release this data. So yeah. Well, what you think, what do you think, what, what’s what should we, what should we be doing to create these comments to widen the availability of data in order to get better AI systems?

Chris Albon:

Yeah, I think there’s a few things. I definitely think one of the biggest things that would be interesting to see is more clarity from the legal and the licensing perspective around the data that’s being used. So like, as, as Amy pointed out, like a lot of these, these models are trained on really big data sets. The whatever base models, foundation models, however you wanna call it, are based on like, are built on really big data sets in order to acquire those data sets. The licensing is very varied between the different parts. So maybe they take Wikipedia, right? And we have one license for the content, and then they grab everything from GitHub, which has a different license, and then they go to DeviantArt, which is like an art sort of like, like showcase repository, which has a different license for every single author and take that in.

Chris Albon:

And the end database is actually this hodgepodge of a bunch of different licenses that have been kind of thrown together. And then you train a model on that, which then trains, you know, then you build off of that and then you build an application off of that and it becomes very difficult to understand the, the, the providence of the sort of legal part of it. And that’s, I think we’re seeing that where like both we’re seeing cases at least one case where it’s like with around LinkedIn where like, hey, scraping data and training a model for it is legal, but then we’re also seeing lawsuits around like GitHub Copilot, which is saying like, Hey, you’re taking all my code and then serving it to people basically as a product. This is illegal. We’re seeing stuff with like ballet and other generative art things like literally sometimes displaying the, like watermark cuz they had scraped all these like, stock photography sites that have the watermark.

Chris Albon:

And so sometimes the algorithm fails to remove the watermark in their, in their generative part. And like the more clarity we could get around this, the better. Like I would, I would love to see like actually a really permissive data license that was like, specifically around with like ML/AI, that was like, Hey, you’re allowed to, you know, make a model from this that’s that, that you gotta give us credit for. Or like, you’re allowed to use something around this that’s like really, like, focused on that. And, and it means that when you were going out and scraping this huge amount of data, you could actually go and like check to see if that licensing exists or doesn’t exist or that kind of stuff. Cuz I, you know, for us, the, the, the big issue is that we try to make things at the foundation.

Chris Albon:

We want you to build stuff off of the things that we make. So, like, we try to be very, very permissive, but then also researchers come to us and say, Hey, I’ve got this amazing way to detect vandalism on it. And then it’s like, okay, cool. We need to know the licensing behind everything just to make sure that, you know, like that we’re in a good spot and that we could then release it to the community to build stuff off of. Cause that’s what we’re trying to do. And so we need to be clear of the model that you’re submitting us has that ability to do that. And a lot of times it is not clear because, you know, especially I think in an academic setting that licensing information is, it’s not in a business context. Everything’s non-profit, everything’s for education. There’s a lot more leeway that’s allowed in those kind of settings.

Chris Albon:

And then when you try to apply that into like, you know, a more business case or nonprofit space or something like that, you run into a lot more concerns. So definitely like, I would, I would just love a lot more clarity around like, what, what, what is possible? Like where are the lines that I could, that I could bump up against to, and make sure that if I was gonna scrape every single image on the internet that I could possibly find, what images do I throw out, right? What images do I keep? Where do I go? What are the good sources for it? And that is, we’re getting there, but we’re not totally there.

Stefano Maffulli:

I, I think both Ibrahim and Mark, you have experience in building data sets, or at least Ibrahim don’t you have also licenses specifically written for data?

Ibrahim Haddad:

Yes. So we have what you call the CDLA which is a license that was specifically created for the purpose of licensing datasets. And as you know, the open source licenses are meant for source code, which has kind of, which is of a different nature than data. So the Linux Foundation members and community went to the exercise multiple times because today we are at version two of the CDLA. So we have the CDLA abbreviation for Community Data License Agreement. There was a version one couple years ago and maybe next year, there was an update to it. And it actually exists in two versionCDLA license kind of type one, which is kind of permissive CDLA license type two sharing.

Ibrahim Haddad:

And these, these lights or these two licenses, it’s basically the same license that, you know, with two different licensing agreement. They’re targeted specifically for data sharing. We don’t host data yet at the next foundation we are actually working on launching a new initiative with that purpose that will most likely be announced towards the end of the year or early next year. And I think Mark, they have way much experience in terms of data, especially with the voice projects they’ve done, you know, a few years back. And maybe he can speak

Mark Surman:

Yeah, the voice project is still going. I mean, I, I think the, and the, the project that’s talking about is something called Common Voice, which is, you know, a training data set for speech text Texas that’s we do within Nvidia and a huge community, people around the world. And, you know, there’s a bunch of things that are interesting in that. One is, you know, you ask the question stuff like how, how do we kinda think about building the commons? And in, in the cases where there really isn’t a competitive data set, like let’s say in African languages where we do a lot of work with Common voice, semi traditional open source, where you have the community out there contributing voice snippets, contributing text, validating the texts against the, the voice snippets works because you don’t have a competitive huge organic data set or scraped data set.

Mark Surman:

And so one way to build the commons where nothing exists, you know, we actually know from, from open source think that and the egg the more common where we don’t you know, as Chris was talking about, you know, what the, the legal ground is that we’re standing on. And it’s gonna be very different if the holders were ever successful in somehow saying, Yeah, actually these foundation models are full of all kinds of stuff that we own. And you gotta find some way, throw ’em in the garbage or discourse them. Like I, I don’t think we’ll get to that spot legally, but if you got that, you’d be in one situation where it’s like, oh, now we have to go back and say how do we build the commons legally and sort of the way that Chris is talking about. But I, I think the, the challenge is the horses left on that, I, I just can’t see us rolling that back.

Mark Surman:

And then it’s a question for the commons. ‘Cause a true commons of the kind Chris is talking about where we have thoughts around the licensing or the licenses that Ibrahim’s talking about are just never gonna compete against the stuff that is just a [incomprehensible]. And so I really just don’t know how to think about it.  and then I guess just a coda that comes from the Voice project, it is also clear if you do have a commons that in some of the ways that Ibrahim talks about in others, traditional open source software licensing just is not adequate. And I think we’re at the earliest stages of that. And one of the things that just is an interesting example of it is also questions of community ownership of data sets. Meaning often data sets are only meaningful in the aggregate, right? And that’s something that we don’t really have in the same way with code. And one of the things in Common Voice the map there’s a, a set of communities who took their data sets out of Common Voice and said, we don’t want it under CC Zero, which is our main licensing. We want community control and we wanna say who could use this?  and so I think there’s gonna be lots in places where commons do exist, lots of really juicy licensing questions for us to tackle in the next two decades around data.

Stefano Maffulli:

And I think that the major, the, the main question is what are the intentions here of the communities that are building these data sets? What are they trying to achieve? Cause when, when free software and open source softwares movement started, we started with the, they were started with the idea of explicitly creating the comments or keeping actually the software in the comments. Because what was happening at the time was that the software was being privatized all of a sudden from being shared across the research communities. It was going into private, behind private walls and not distributed in source code form. So interpretable to humans. So we are, we’re, we’re basically seeing something like this now where either by, because of the, the scary factor, like, Oh, this is too dangerous. You cannot touch it. You can only use it through an API like the company OpenAI has been doing.

Stefano Maffulli:

Or you know, it’s too scary. We need to put laws that prevent some usage or we recognize this, this is too scary as a community if we don’t want to, we don’t wanna share it. But it’s fascinating to me. It was fascinating to read about that, that data set being pulled away. That was another question that I received from Hippo AI, which is a foundation that is assembling medical data sets from patients who donate it to, to the commons, you know, to the, to to build a common dataset about, you know, the medical data. And what they’re trying to do is to, to build some sort of copyleft mechanism where they, they would love to have a way to say, if you build something cool from these from these dataset, you need to release the data, the data, you know, the model and all the information, all the toolings and all this stuff with, with to the public the same way that we gave you the data, which is somewhat what alpha fold has done with the, the protein database that it discovered.

Stefano Maffulli:

Like I, So what do you think of that? Like what are your thoughts on, on these ideas of building frameworks and policies and social norms first and then maybe legal contracts that give incentive to the sharing Chris?

Chris Albon:

Yeah, I both love the idea, and I think I, to reiterate a point that Mark made, there has to be a valuable incentive to doing that versus the other option. Like the reason that folks, you know, say you’re a startup, you vacuum up every piece of data, possibly about a thing, you build a product on it, you get rich. And like then at the, then when you’re like sitting there and your company’s worth like $4 billion, someone’s like, Hey, oh, those licensing information, that’s pretty, you know, like, but you’ve already gotten the success by the time, sort of like the, the, those like considerations like come up and like people start to have a problem with it. That’s like one path. And I don’t think that’s like the best path, but that’s definitely a path that is obviously incentivized to an individual to do and that’s why people do it.

Chris Albon:

I think it’s harder to take, you know, given that that path’s an option, it’s harder to take that and be like, instead what you should do is work out a framework where, you know, like every single image that you’ve ever gathered on these massive data sets is part of, you know, like is is license in a way and you’ve contacted every user and you know, like it’s just, it’s like you could see that one path is very difficult for not a huge amount of individual gain like societal gain shore, but like individual gain. And then the other one, which is like basically the defacto right now is like the easiest in both. It’s the ideal path for you is like, as an individual rational person as opposed to like a societal level. And shifting that is, I don’t have a great idea of how to do that. So if anybody, anybody else has a great idea, I’d love to hear it. But that’s definitely, you could see why people are acting in their self interest when they co up every, every piece of data, like ever and then make a model from it.

Ibrahim Haddad:

This is actually similar to the standards versus source code, you know, should we write standard and implement after or should we do the implementation and have the standards reflect implementation? Very interesting.

Mark Surman:

But of course it never works that way in, in, in either case, right? I mean, stand standards don’t work without reference implementations and they don’t say the same and, you know, vice versa. So I, I guess that’s, that’s what we already know is probably could be true here if the incentives were there. I think that’s the challenge that, you know, Chris re reiterated. I think there’s no incentive to do it. Anything other than I guess I I will say the bad way or, or certainly the, the Hoover very selfish way. It may be different where there are other kinds of roadblocks or risks or protections or how, I guess it depends what seat you’re sitting in like, you know, human rights law or health protection of data or labor law. And, and we may see some innovation in those places where vacuuming up and using a lot of PII in those settings that are already highly protected will force us to innovate in some of the ways you referenced, I can’t remember if it was, you know, because PII around patient data is so already protected, you people are very reticent to just go back in and then it is then a zone where innovation around the commons and different licensing regimes I think is, is more likely to emerge because the incentives of the guardrails are there.

Mark Surman:

So maybe if you care about the bigger topic in the abstract, looking I think especially at health and labor data you know, we may see ways that people innovate. And one of the places that we’ve seen it, we, this thing called the data futures at these things is that in a different way than we’re talking about here. But there is in a sense for, for some set of actors to figure out collaborative data governance in the gay economy world because they actually wanna be able to own and use their as as drivers or deliverers or whatever together. And so some innovation there and them building ways to their together cause they actually wanna to be, to see the, that the, the platforms are seeing about them to be able to have some leverage in negotiation. And, and

Stefano Maffulli:

Amy I saw you

Amy Heineike:

Yeah, I think one, I mean, one kinda small point to add on to that and that makes sense is, is the scale of the data makes a huge difference to how good the models are. So I think I imagine one of the other challenges here is if you, if you end up with small nicely licensed data, the models you’ll be able to build on that will not be able to achieve very much. And so, you know, a part of it is we can say you know, people wanna go get rich and kind of build a company on top of all the stuff they’ve harvested, but you know, you can also look at it as maybe, maybe some of the things that people will build on, on top of large amounts of data will actually have a bunch of social good or benefits or they’ll enable innovations that we want. And so I think you, you’re right, it’s all about the kind of in incentives and it’s, it’s tricky because we, we would need ways of getting that much data, so incredibly large amounts of data if we want this whole class of kind of models to be able to exist, to be used.

Stefano Maffulli:

Yeah, there is definitely, I can definitely see a conflict in there where there is the, there is so much material out there, like the internet has created a wealth of data, wealth of information. And but at the same time as it decreased the importance of copyright, it also increased the amounts of checks and controls and the power of, of large corporations. So I think it’s gonna be tricky, but it’s probably where the first of our concerns should be focused on, on this, right, to data mining that, that the EU has already started to define in, in law and re regulation. But I think they’re only, they didn’t go as far as I wish they could go because the right to data mining is only legal and on by default for for, for, for non-profit or research purposes and not for commercial purposes. So there is a little bit of a lack of clarity, but there’s gonna, there’s definitely gonna be challenges in courts and opinions very quickly, I think Chris, what, what are your thoughts?

Chris Albon:

Yeah, I just, I really, I really appreciated Amy’s point. I think it’s so good that, you know, I think someone said that sort of a lot, a lot of times with this tech, the horse is out of the barn or you know, like it’s sort of the, like, we’ve already gotten the benefit from some of these amazing things. And that was, you know, Amy’s point is just like, there’s, it’s like even if, and this is like the reality that you leave in the, even if there’s like difficult copyright situations that are creating these, if the value people are getting from it already, getting from it is so great. It’s so hard from a political legislative background to like pull it back. And I remember, you know, back in the day there was discussions about like, oh, we should not, social media should be illegal, right?

Chris Albon:

Like, and it was like everyone was talking on social media clearly. Like, that’s not gonna happen. And like, you know, I know Meta released recently what they call the NLLB model, which is no language left behind, which is direct translation between 200 less popular languages. So like, like Sesotho or like Urdu or something like that. And you can translate between those two languages. And that obviously has a huge benefit where you could translate like a book in Sesotho into Urdu, like without any kind of bridge through English and that kind of stuff. Like, that’s awesome. And imagine if that was used in order to accomplish that, that you were basing it off of copyright books or something like that, right? Like imagine that was the case. Like would there ever be any kind of strong resistance to say, Hey, you can’t do that, and all that benefit that people are getting from it. We have to claw that back because of like, some kind of copyright, you know, thing given that the copyright is already pretty, like whether or not you can use copyright is a fight being, being happening right now. And yet the benefit people are absolutely getting right now. I I really feel like, you know, it’s, it’s sort of, we’ve, we’ve passed the discussion a little bit where –

Stefano Maffulli:

Right, the scale is tipping already.

Chris Albon:

Yeah. And you, I mean, and like if you’re gonna get a self-driving car, imaginative self-driving car, that drove perfectly, but yet it was used on like, you know, like copyrighted data. I just don’t think people would call back. I just don’t think you’d have the ability to do that as a society.

Mark Surman:

You know, I think that we’re, we’re in a, we’re in a tricky zone, of course, you know, it, you you’re with the smaller data sets, you’re not gonna get to these innovations and then you’re not gonna get to the kind of benefit that, Amy and Chris are talking about. And the same, same is true of, you know, you’re not gonna get the acceleration of greed,and the abuse of things as well. And so I guess the, you know, the question maybe for us is does the commons have anything to do with, you know, mitigating some of the bad uses and encouraging, you know, some of the more beneficial uses? I don’t know that it does. I mean, again, I think that’s sort of what the, to some degree the RAILS experiment,you know, in its really big thing is trying to be, is how do we actually think about building responsibility into, to licensing as opposed to just openness or use. So I think that’s a big open question for us. The, the horse outta the barn, but the horse outta of the barn for, for both beneficial uses and evil uses or these greedy uses. And I think that greedy uses are more likely to keep the horse outta the barn than the beneficial ones, even if we get to have both. Uso, you know, maybe you end up in this silly but maybe real conversation that its really just silly human nature.

Mark Surman:

It’s something I think we have to grapple with because certainly the history of the open source industry and the web in terms of just saying it’s all open and we don’t ask the questions about this stuff being used poorly, you know, hasn’t gone well so far. And I don’t really know what we do about that yet, but I do think we have to pay attention to that.

Stefano Maffulli:

There’s definitely to pay attention, there’s definitely good reasons to pay attention to, especially given the current situation where we are. I think that we need to keep in mind the fact that the more barriers we put, the more the permission based access to research, to knowledge is also full of, full of obstacles full of, full of you know, the removes benefits clearly. So finding that balance, I think it’s, it’s where the challenge is. And I, I want to go back to the role of academia about finding this balance. And, and one of the and one of the podcasts, the, the researcher that I interviewed Connor o from Aeu AI was, was talking about how in the end he AI is, is basic, not basic math, but it, it’s math and, and linear algebra is basically accessible much more than we, than we think. It’s not really PhD level. At least that’s what he was claiming. And his call for better, better AI systems was also to have to give incentives for more students to start learning math and, and getting into these, these fields. What, what do you think is, is this part of the recipe to get better AI systems trustworthy and, and transparent and, and, and help legislators to, to do better things?

Amy Heineike:

I’m gonna disagree that it’s just maths. And the reason I’m gonna disagree with that is because it’s data plus math. I’m, I’m using the s cuz I’ve been living in England for a while. Sorry, feels weird to say math. Anyway, so it’s, but it’s data plus maths. And so the issue is that, you know, one of the, what is one of these large data sets, like if you go grab every image you can find on the web, those are all images that people have taken. It’s like our collective experience and hive mind of the world that we exhibit and the things we care about and the selection of things we thought worth taking photos of. If you go grab all of the text on the internet, you know, you know, what kinds of things do people write in Reddit? What kinds of things do people write on Wikimedia?

Amy Heineike:

There? They, they’re different. You know, there are different kinds of things in different places on the internet and there are reflections of us, right? The reflection of humanities, brilliance and weakness and, and everything. And so when you are reasoning about an AI model, you are reasoning about the mechanics of it. Like, oh, this thing trains this thing. But you are also reasoning about the kinds of patterns in the data that are driving the patterns in the outputs. And so I think like there’s, there’s a way of thinking about AI, which is, you know, to the earlier conversation about disciplines, there’s a way of thinking about it where you go into your mathematics head and you learn about the algebra and you write down a bunch of equations and you understand how they manipulate them, how you, how you can manipulate it. And there’s another way of thinking about it where you’re almost like, you know, like you’re like a zookeeper or something and you’re like, you’re like playing with this thing and you’re prodding it in different ways and you’re like, Oh, what happens if I if I do this kind of thing to it, you know, it’s, it, it is more like a biological system in a, in a funny way.

Amy Heineike:

It’s like a social system. It’s a microcosm of what we put into it. And so I think kind of you know, we’re, we’re trying to teach children social media literacy, like what does it mean to be online and how do you be careful about who you’re talking to online and how do you think about sources and how do you evaluate the kinds of news you get from different places? I think we need to have some way to let children and, and not just children, cuz I think as adults, most people don’t know how to reason about this. And I think this feeds into why this whole thing is so scary. You know, we wanna have some playgrounds where you can take some of these models and push them in different ways and see what happens. So it could be that one of the great things that happens with things like the stable of the fusion releases, if more and more people get to play with some of these AI models for something like an image generation and they try out some prompts and that becomes accessible, then they, they might be able to start to reason about, hey, sometimes it spits out something that looks like something that really already exists and sometimes it spits out complete chaos and sometimes it doesn’t yell.

Amy Heineike:

And, you know, when do I get what? And then they might realize that when those systems, when those models are used embedded in bigger systems, similar kinds of weird things happen depending on how people are using the models. And they might be able to ask different kinds of questions and they might know how to use at the moment. And so yeah, there’s a load of learning to do there and I think it’s a very interesting question about how we take people along that line and especially when we realize that like, you know, probability, basic probability is so hard for us to reason about, we’re generally really terrible about it, you know, and that’s just like compounded to infinity and beyond with these kind of, these models that just, you know, layer up in crazy ways. Chris and Mark.

Chris Albon:

Yeah, I mean definitely to, to pile on, pile onto that, you know, you can imagine a thought experiment where we say doubled the linear algebra and calculus education of, of students. Would that put ’em in a better place to understand like how TikTok is manipulating their behavior? Like, no, probably not, right? And of course like the math is like a big part of it and like that’s, if you get into the field, you should definitely know math and there’s lots of other, because of the impact of, of ML and AI so wide that there’s so many different areas to tackle it. You can tackle it from like a legal perspective or a political perspective or, or from the perspective of biology. Or say you’re an artist and you are like, the only part of the AI elephant that you see is sort of the generative art part.

Chris Albon:

And like what, like, to Amy’s point, like what gets spit out? How do I build something with that? Like how do, how do you build like a generative art that is like interesting and cool and takes advantage of the tool the most? Maybe that’s the thing that you spend a lifetime working on. And there’s so many different parts of it cuz it’s so big that it’s very reductivist to sort of like put it in one area. Like if we just knew more humanities, we’d have figured out AI or if we just had like twice as much, you know, like linear algebra at our heads, we’d figure out that, you know, ai, it, it doesn’t work like that. It’s so big, it’s so encompassing. There’s so many different components to it that it is an all society, all academic disciplines taking apart and every single academic discipline has an interesting say on it.

Chris Albon:

I mean, definitely some of the most interesting stuff I’ve seen has been like Japanese literature being like translated using ML and AI, you know, like, and that’s just like a, you know, a field that’s being, being worked on. I don’t think there’s one place. I think there’s lots and lots of places and the, the, part of that is that there’s value from any field. You could be in any field and think about how AI is, is taking on and how you can use it and how you can understand it better. And it is a rich place for, for discovery.

Stefano Maffulli:

Indeed. I just wanna interject before Mark the context by which in which this quote appeared in the podcast, we were talking about analyzing the models themselves and understanding what the models does in terms of security and safety. So just want to be –

Mark Surman:

We’re taking it a little far, but, to think on how far we’ve taken it,you know, I think I will repeat, you know, what Amy and Chris have said. I mean obviously it, it takes all kinds of perspectives and, but I really wanna pick up what you said, Amy, like these are systems, right? So then I think what does it mean to understand them? If you take broadly what quote was around, you know, questions around security, You know, I think of it in two ways. One is how do we train ourselves to live within living systems safely, productively, apply, respectfully. And that’s where I like the zoo as one metaphor of like, where can we go and learn? And you know, if I wanna live in a city, I learn how to know is this a dangerous place or not a dangerous space?

Mark Surman:

Who can I go to, what’s actually happening? There’s a lot of signals around me that require a lot of d skills and knowledge and kinda building experience. And I think we need to start to think about it in that way. Which there’s different skills and then it goes back to something I skated by earlier. It’s also a question of like, what skills do we need or what model we want in terms of inviting or designing for or regulating for, you know, safety and thriving and goodness in this. And, and I don’t think, you know, that is where I think going back to, to big scaled systems is important and, and cities is another thing that I, I think about as being like that. I mean, we struggle to regulate and express broad intent in our cities. But, we try to do it anyway, right?

Mark Surman:

And we do it as individuals with our neighbors and we do it at the city level with, you know, urban planning. And I, I think it’s, we need to start looking at tackling this all together and also looking for some models like how do we to live in it and shape it, and also how do we collectively find some, some models to shape It’s no it, right? It’s not, it’s not just the onions or the layer, the layers, onion, anything we talk about. It’s the connections of all the onions. And you know, how do we, how do we play with that?

Stefano Maffulli:

And I love that you bring up cities because I, that sort of my background and I think Amy, you have something to do with cities too and, and cities and, and social, social norms in general, social gatherings and social sciences. They teach us to live with imperfection, right? We have to live with the fact that you’re gonna have to fix a pothole somewhere, but, you know, you will never have everything functioning or everything perfectly aligned with a zero one type of approach.

Mark Surman:

Yeah, I think if you wanna read a good book about how to think and evolve with AI, read Jane Jacobs.

Stefano Maffulli:

Okay. Yeah. I put that on my list. So closing. I think we only have three minutes. Do you, what’s your message of hope for creating this comment and having this AI that works, services, service, the, the public serves society in a positive way, Chris?

Chris Albon:

Yeah, I, I think probably given the topic of this, this panel, I think, you know, the, the biggest thing that I would probably advocate for is that AI and ML is a big tent, right? There are, of course, like the folks you know, who are sitting down and like making the models day to day. I would include myself in that. But then there’s other folks who, you know, say who are, are building art from generative models or folks who are working on legislation around our people who are working on, you know, say how it applies to their particular field of forestry or something like that. And because it’s so large, like I wouldn’t think about it as like, this is the thing from the computer science department that we kind of poke at, and criticize. It’s instead like a broad societal change with like, you know, like the technology started in the computer science department now, it’s kind of expanded out to like so many different, different areas that it’s not the realm of the computer science department to understand everything. It’s the realm of all the different, you know, areas of, of academic discipline to, to take a look at this and find ways where they can contribute and find ways they can find a perspective that people haven’t thought of and have that broader discussion. Because you’re never gonna understand what ML and AI is strictly from one, one discipline

Stefano Maffulli:

Mark.

Mark Surman:

Yeah, I mean, I think it’s the, the, that same thing is, is we need to kinda look at it from all those angles and you know, as an individual being holistic, well rounded renaissance, I don’t know what the right word is as an aspiration is just really gonna be a more and more important thing to kinda bring in to this field and I think sort of as this field it to society. So I think looking, looking at that and then to, to tie it back to the academic piece, to, to really look at how we evolve, how we teach each other and teach ourselves in a way that isn’t stuck in, in disciplines. I mean, I think it’s where academia is, it’s our own worst enemy in this journey of being, you know, stuck in disciplines and incented around staying in silos. And so I, I think if we can respect each other and each other’s knowledge, but, but then kind of break down those patterns and, and try to be more holistic in who we’re and how we teach we can do better at this.

Stefano Maffulli:

Thanks. Ibrahim, do you wanna add comments with that?.

Ibrahim Haddad:

Sure. So I think from my perspective, given that most of my work is with technical project I think a way for me to close on, on, on this discussion is to encourage everyone listening to this panel to, and if they’re working on AI and data to consider open sourcing their work and driving the high level of transparency in developing solutions that addresses whatever sector and whatever problem they’re trying to solve.

Stefano Maffulli:

Thanks. And Amy,

Amy Heineike:

Yeah, I think so it’s kind of round back, we’re, we’re in this incredible time of innovation and exploration. We don’t really quite understand what this thing is that we’re building. And so I think some of these questions that you’re raising, I’m so glad that you’re running this series, getting people to try and pack a, you know, like figure out how we wrestle with it and that, you know, there’s some clearly very urgent questions around how to reduce harms and and difficulties. But I think it is also you know, the tooling that people are building especially is making it more and more accessible for people to come and play in this area and bring some of those different perspectives to try to understand what it is that we have and what these tools can do for us. And you know, what the next, you know, what the next decades are gonna look like as these kind of unfold and we see the consequences.

Stefano Maffulli:

Wonderful. Well, thank you very much. It’s been a wonderful, wonderful conversation today. I will go on for another hour, but I wanna be mindful of your time and, and also the listeners. So this closes our series of panels. There’s, I think, well, we’ll definitely have another step which is what we call Fathom III, which is a report that we’re gonna summarize what we learned with the podcasts and these four panel discussions. But I also have the feeling that despite calling the series Deep Dive, we haven’t gone deep enough or there is a lot more to explore. And I’ve already started to think about what we’re gonna do next in 2023. So everybody stay tuned, we’ll talk more about AI, and we’ll hear more about the challenges and the opportunities that these new technologies are building for the world. Thanks everyone.

Mark Surman:

Thank you and thanks to OSI for doing this.

Stefano Maffulli:

Oh, thank you.

35 responses to “Open Source software started in academic circles, and AI is not different.

Reposts

  • Informatik
  • Simon Phipps
  • Benjamin
  • BE4FOSS (KDE Eco)
  • Her Mavenship
  • Informatik
  • TheLipidHoff
  • BulBi!
  • eduardo osorio
  • Oscar
  • godfrey™
  • Ibrahim Haddad, Ph.D.
  • Informatik
  • Robert Cathey
  • ZUMA #EveryoneHatesElon 🌹🌞