Friday, April 17, 2026
Executive Summary
At the BTLJ Spring Symposium’s panel on software, copyright, and generative AI, scholars and practitioners traced how AI-generated code — likely uncopyrightable under current Copyright Office guidance — threatens to collapse the legal infrastructure underpinning open source software licenses, and how empirical evidence of large language model memorization is forcing courts to reconsider whether a model’s weights may themselves constitute a copy fixed in a tangible medium of expression.
Instructor(s)
Clark Asay, BYU Law
A. Feder Cooper, Yale University
Jule Sigall, former Microsoft
Pamela Samuelson, UC Berkeley Law
Keywords
AI-generated code copyrightability Copyright Office 2025 • open source software copyright foundation generative AI • LLM memorization fixed copy tangible medium of expression • Baker v. Selden functional works copyright exclusion • Computer Associates v. Altai abstraction filtration comparison test • Google LLC v. Oracle America fair use software copyright • free and open source software FOSS license copyright dependency • trade secrecy patents software AI era post-copyright • Does AI-generated code qualify for copyright protection under the 1976 Act? • Can a large language model’s weights constitute a fixed copy under copyright law? • human authorship requirement AI creative works Copyright Office guidance • 17 U.S.C. § 102(b) functional works software exclusion
Legal Analysis
Baker v. Selden, Section 102(b), and the Long Arc of Software Copyrightability Before Generative AI
The copyrightability of computer software has never been doctrinally settled in the way copyright orthodoxy might suggest. Samuelson opened the panel by grounding the current generative AI crisis in a deeper structural problem: the tension, dating to Baker v. Selden, 101 U.S. 99 (1879), between copyright’s aspiration to protect expression and its constitutional incapacity to monopolize functional methods. Under the Baker doctrine, Samuelson explained, source code could be treated as a literary work, but machine-executable code occupied a more ambiguous position, because “it’s essentially a virtual machine” — a process rather than a text. The Copyright Act of 1976’s Section 102(b), which excludes “procedure, process, system, method of operation” from protectable subject matter, was, in Samuelson’s reading of the legislative history, inserted precisely so “that the scope of copyright protection for computer programs would not be construed too broadly.” The National Commission on New Technological Uses of Copyrighted Works (CONTU) nonetheless recommended in 1978 that copyright extend to software, a recommendation Samuelson acknowledged she once criticized sharply — her tenure article, she noted, accused the commissioners of not knowing what they were talking about — but whose practical wisdom she has since conceded: “I was so wrong, and so it’s nice to be able to admit being wrong after many, many years, because copyright did a really good job, and it gave us a basis for international protection.” The 1980 amendments added a statutory definition of computer program and a revised 17 U.S.C. § 117, but Congress held no hearings on the scope of protection, leaving the doctrinal architecture to the courts. Early cases like Apple Computer, Inc. v. Franklin Computer Corp., 714 F.2d 1240 (3d Cir. 1983), rejected attacks on literal-copying liability but generated dicta suggesting that interfaces might be protectable, fueling the over-expansive Whelan Associates, Inc. v. Jaslow Dental Laboratory, Inc., 797 F.2d 1222 (3d Cir. 1986) regime before the Second Circuit’s Computer Associates International, Inc. v. Altai, Inc., 982 F.2d 693 (2d Cir. 1992) corrected course with the abstraction-filtration-comparison test — reinvigorating merger and scènes à faire as software-specific defenses and, Samuelson credited, drawing heavily on David Nimmer’s scholarly intervention. Google LLC v. Oracle America, Inc., 593 U.S. 1 (2021), added fair use as a potentially dominant doctrinal tool for interoperability, producing what Van Houweling has called “doctrinal cocktails” — a layering of § 102(b), merger, scènes à faire, and fair use whose exact boundaries remain contested. The relevance of this history, Samuelson argued, is that software copyright was always thinner and more contested than its commercial salience suggested, and generative AI is now stress-testing a foundation that was already underbuilt.
Strategic Copyright Choices in the Software Industry and What Generative AI’s Copyrightless Code Means for Open Source
Sigall reframed the panel’s doctrinal history as a story of business strategy, arguing that copyright’s formal role in software monetization has been declining for decades and that the more interesting question is whether AI-generated code that falls outside the copyright system will matter to the industry in practice. Tracing what he called “copyright salience” across five eras — the PC era of the 1980s, the World Wide Web, the cloud and open source era, and the mobile app ecosystem — Sigall showed that direct copyright-based business models gave way progressively to hardware bundling, ad-supported distribution, Software-as-a-Service subscriptions, and platform-mediated app stores. He observed that by the cloud era, Microsoft’s anti-piracy efforts had migrated inside its marketing department rather than its legal department — “the goal was not to stop people from using pirated software, the goal was to get them to use Microsoft software” — a posture that other creative industries, Sigall noted, did not follow. Open source software does rely on copyright, Sigall acknowledged, but in an inverted way: the most prevalent copyright-based model in the cloud era was “one that’s all about making software as accessible as possible,” using copyright licenses to govern redistribution rather than restrict copying. Against this background, Sigall predicted that the Copyright Office’s current position — that prompts alone do not generate copyrightable expression — may simply render AI-generated code a zone outside the copyright system, as machine-executable code arguably was before 1980: “The answer may be yes, copyright is no longer part of the system anymore… but that’s okay.” Asay challenged the sanguine view, arguing that the FOSS movement’s collaborative governance architecture — contribution agreements, copyleft obligations, attribution requirements, and patent grants embedded in licenses — presupposes copyrightable code. Without copyright, “there’s nothing to license,” and corporate legal departments’ risk aversion would accelerate a drift toward trade secrecy and patents that had already begun as high-profile projects like MongoDB and Elasticsearch abandoned open licenses for more restrictive models. Asay cautioned that generative AI simultaneously undermines the copyright foundation and produces an “AI-generated chaos” of unvetted pull requests that is causing some open source projects to restrict or close contributions entirely. Samuelson pressed further, asking whether the Creative Commons problem had essentially reproduced ProCD, Inc. v. Zeidenberg, 86 F.3d 1447 (7th Cir. 1996) — if AI-generated code is uncopyrightable but an open source license is slapped on it regardless, Asay agreed that the result “is a bad place to be,” though he suggested that “norms play a more important role in this space than the actual technicalities of copyright law.”
LLM Memorization, the Definition of a Fixed Copy, and Whether Model Weights Infringe Copyright
Cooper’s presentation offered the most technically granular analysis of the panel, advancing the argument — developed with co-author Mark Lemley — that memorization in large language models may require courts to revisit whether model weights constitute a “copy” fixed in a tangible medium of expression within the meaning of the 1976 Act. The statutory definition, Cooper noted, is functionally circular: copies are material objects in which a work is fixed, and fixation requires embodiment by or under the authority of the copyright owner — a formulation that, read literally, makes it impossible to infringe the reproduction right. Courts, she observed, have simply ignored this incoherence. On the technical side, Cooper distinguished memorization from general pattern learning: memorization occurs when a model’s training causes its probability distribution to become “so sharply peaked” on specific training-data sequences that those sequences dominate outputs in a way that is analogous to high-fidelity compression. Crucially, memorization need not be deterministic — a model might reproduce a memorized work only one in a thousand prompted generations — but Cooper cautioned that the distinction between deterministic and stochastic extraction is doctrinally significant. As a concrete illustration, Cooper described extracting “a near-pristine reproduction of Harry Potter and the Sorcerer’s Stone” from Meta’s LLaMA 3.1 7B model using fewer than the first line of the text as a prompt and under a hundred lines of boilerplate code, with the extraction running deterministically on an A100 GPU. The same technique yielded nearly all of Ta-Nehisi Coates’s “The Case for Reparations” without modification, demonstrating that the phenomenon is not unique to the most heavily duplicated texts on the web. Cooper analogized model weights to neither Microsoft Word — a clean slate that requires the user to supply all content — nor a traditional database — which deterministically returns data from a stored query — but to something novel for which copyright doctrine has no direct precedent. The closest cases, Kelley v. Chicago Park District, 635 F.3d 290 (7th Cir. 2011), and Micro Star v. FormGen Inc., 154 F.3d 1107 (9th Cir. 1998), address outputs rather than model architecture, and Cooper suggested Kelley‘s refusal to protect non-deterministically fixed works is probably wrong. Cooper and Lemley’s functional conclusion — that a model is a copy of works it can predictably and replicably extract, but not of works whose extraction is genuinely probabilistic — has significant consequences: if widely adopted, it would treat copying a sufficiently memorized open-weight model as an act of direct infringement, a result Cooper acknowledged would “really screw open source models.” She noted that fair use might still apply, but that models, as commercial objects, face a harder Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023) analysis than training datasets have faced, and that Authors Guild, Inc. v. Google, Inc., 804 F.3d 202 (2d Cir. 2015) does not necessarily provide a safe harbor either. Cooper closed by underscoring that the memorization phenomenon remains scientifically mysterious — she has continued working on it past two self-imposed deadlines — and that policy conclusions drawn from incomplete technical understanding are premature: “If we can’t understand basic facts about these models, I’m also just not really sure how we can responsibly come up with policy decisions around them.”
Generated by AI based on the Interview/Transcript below.
Key Takeaways
- AI code likely outside copyright system. The Copyright Office’s current guidance — that prompts alone do not generate copyrightable expression — may return AI-generated code to the pre-1980 status quo where machine-executable software was outside the copyright system, a result Sigall characterized as potentially untroubling: “The answer may be yes, copyright is no longer part of the system anymore… but that’s okay.”
- Open source governance faces structural collapse. Asay argued that copyleft obligations, contribution agreements, and patent grants embedded in FOSS licenses all presuppose copyrightable code, so that AI-generated code — if uncopyrightable — could hollow out the legal infrastructure of the most successful collaborative innovation system the world has seen.
- Model weights may be fixed copies. Cooper and Lemley’s framework suggests that if a model can predictably and replicably extract a memorized work, the model weights themselves may constitute a fixed copy under the 1976 Act — a conclusion that, if adopted, would make copying a sufficiently memorized open-weight model an act of direct infringement.
- Memorization is empirically documented and deterministic. Cooper reported that a near-pristine reproduction of Harry Potter and the Sorcerer’s Stone and nearly all of Ta-Nehisi Coates’s “The Case for Reparations” could be extracted from Meta’s LLaMA 3.1 7B using minimal prompting and code, and that the Harry Potter extraction runs deterministically — the same output every time.
- Trade secrecy and patents fill the vacuum. Asay predicted that as copyright’s already thin protection for software erodes further in the AI era, trade secrecy — which has no human-author requirement — and software patents will take on greater prominence, pushing the software ecosystem toward a more closed and fragmented architecture.
- Section 102(b) was always anti-expansionist. Samuelson argued that § 102(b)’s exclusion of procedures, processes, systems, and methods of operation was inserted into the 1976 Act specifically to prevent copyright protection for computer programs from being “construed too broadly” — a limitation the courts honored more in the Altai era than in the Whelan era.
- Copyright was always secondary to business models. Sigall’s historical survey showed that software companies across every era — from shrink-wrap licensing to cloud SaaS to mobile app stores — have found ways to recover their investment without primary reliance on copyright enforcement, and that anti-piracy efforts were most effective when treated as customer-acquisition rather than legal campaigns.
- Open source norms may outlast copyright formalities. Both Asay and an audience questioner noted that open source norms — developer motivation to influence the AI infrastructure stack, reputational incentives, and community standards — may sustain open collaboration even if the copyright technicalities underlying FOSS licenses become legally incoherent, though Asay cautioned this countervailing force may be insufficient without institutional legal support.
- Copyright Office registration guidance is unworkable. Samuelson asked whether the Copyright Office’s requirement to identify and disclaim AI-generated portions of a registered work was administrable; Asay replied that it is “unworkable” as human and AI contributions become increasingly intermixed and suggested that the systems collecting prompt-level data could eventually inform a more practicable registration regime.
- Court engagement remains essential — and contested. An audience questioner asked whether courts are still willing to engage in the kind of creative statutory adaptation they performed in the software era; Samuelson’s response — that “we ended up with sui generis within the copyright system” — acknowledged that the current judicial climate, particularly on digital first sale, suggests the appetite for expansive statutory construction may be lower than it was in the Altai and Oracle eras.
B-CLE Recording (CLE: FREE) | Youtube Recording | Resource(s) | Speaker Bio(s) & Contact Info
Download the interview/transcript and slides here!
Interview/Transcript
This interview/transcript was based on a conversation on April 16, 2026 about 29th Annual BTLJ-BCLT Spring Symposium: Origins, Evolution, and Possible Futures of the 1976 Copyright Act, hosted by Berkeley Center for Law & Technology, UC Berkeley School of Law. The topic on “Copyrightable Subject Matter and the Special Problem of Software” was presented by Clark Asay, BYU Law; A. Feder Cooper, Yale University; and Jule Sigall, former Microsoft; and moderated by Pamela Samuelson, UC Berkeley Law.
Pamela Samuelson 00:15
Good morning everyone. Welcome back to the 29th annual BCLT BTLJ symposium, co sponsored with the Columbia Kernochan Center, and I’m Pam Samuelson, and I’m going to kick us off. I thought I would first introduce our panel. So with us today is Jule Sigall, who for many years, was the copyright council for Microsoft Corporation, and I’m going to give a history of the evolution of copyright for computer software. And Jule is going to tell us something about how copyright was just one little piece of a legal protection strategy, and I think many of us just kind of get focused on the copyright stuff and don’t really think about that. And then we have two presentations about AI so Clark Asay will talk to us about artificial code. That is to say computer programs now are generating code, and as we heard from Shira Perlmutter yesterday, the question about copyrightability of computer generated works is an old problem, but raises some new questions these days. And then Feder Cooper, who is a professor now at Yale University, although on leave right now, but he’s been extremely active in the generative AI and the law field, and he’s been thinking a lot about about memorization and whether these models actually in the AI systems do, do memorize things, and how do you how should the law kind of be thinking about that? So that’s kind of the program that we have for for today. And so I’m going to start off and just remind many people that functionality in copyright law in the United States is actually an old problem, and something that the Baker V Selden case from 1880 has had considerable influence. So that decision, as you know, said that copyright protection is available for for the for the explanation of a useful art, but not for the useful art itself. And the ramifications of the Baker case, for the scope of copyright has basically meant we try to shunt all things that are functional out to the patent system, and we keep copyright for things that are truly expressive. And under that Baker doctrine, architectural drawings would be copyrightable as drawings, but the buildings would, until 1990 be not within the scope of copyright protection. And if you take that principle and apply it to computer programs, then the source code forms of programs would fit as literary works within copyright law, but machine executable code wouldn’t, because it’s essentially a virtual machine. It’s a machine process, and therefore would be outside of copyright. And I think that helps to explain why the copyright office, for a very long time, was granting registration certificates to computer programs under what they called the rule of doubt, as in, I’ll give you a certificate, and you can take it to court if you want, but I’m not saying that computer programs actually are in machine executable form, are actually protectable by copyright law. And in terms of what was in the 1976 Act about computer programs, if you listen to CONTU, oh, the 76 Act already covered this. And I think that was partly true. That is to say, I think the 76 Act still kept going, the idea that source code forms of programs would be, but if you look at the original section 117 of the Copyright Act of 1976 you will see that it basically said, I’m going to preserve the status quo, whatever that is for computer programs, and I think that meant that it wasn’t actually protected by the machine. Executable forms of programs weren’t really protected until 1980 after Congress amended the statute to add the definition of computer program and add a new section 117 that had a couple of exceptions that are important. But one thing that I want to note here is that Congress held no hearings about computer software and the scope of copyright protection when they passed the 1980 amendment. So we have in this 1976 act, this provision in Section 102 B, that says that in no case does copyright protection extend to any idea, procedure, process, system, method of operation, concept principle or discovery. Those words, procedure, process, system, method of operation, those words, I think, were added to the statute according to the legislative history, in order that the computer the scope of copyright protection for computer programs would not be construed too broadly. And yet, it has played relatively modest role in in the in the protection or the scope of copyright in computer programs. So I think it’s interesting to note that computer program and copyrightability was not in the original charter of CONTU. CONTU has actually asked to look at photocopying, whether inputting works into a computer where copyrightable, whether databases could be copyrighted, and whether computer generated works could be copyrighted. And yet, the most important thing that CONTU ended up doing was deciding that copyright and computer programs actually fit together quite well. Now, I think it’s important to sort of understand that the CONTU report, I wrote an article on this with my tenure piece about CONTU revisited, basically saying, you know, these people didn’t know what they were talking about, and one of the things that they didn’t do, they didn’t pay any attention to Justice Breyer’s or then Professor, now Justice Breyer’s, tenure piece, which I said that the case for copyright protection for computer programs is actually quite weak. And they also gave very little attention to Section 102B, and it didn’t also think about how programs have to interact with other programs, and what we should do about interfaces. And so that was left to the courts over time. And so that was actually important that you know the the initial kind of legislative history, more or less, is the CONTU report. But the CONTU report is really, shall I say, incomplete. Now, the initial attacks on the copyrightability of programs and the Apple V Franklin and Apple V Formula cases, I think, rightly, just rejected the attacks, because if copyright, once you got after the 1980 amendments, if copyright protection for computer programs was to mean anything, it had to mean that literal copying of code was was an infringement. But in the Apple V Franklin case, the Court interpreted section 102 B as excluding only abstract ideas and not the procedures process system, method of operation, and contain some dicta that said that Franklin’s desire to be compatible with the Apple two programs may have been a competitive objective, but it was completely irrelevant to copyright ability, and that led to a lot of speculation about whether interfaces of programs would end up being protected. And after the Whelan versus Jaslow case, which interpreted the scope of copyright protection extremely broadly, so that everything about a computer program except its except its function, was was protectable expression, unless there was only one way to achieve a function and then merger of idea and expression, and it endorsed the idea that the structure, sequence and organization, or SSO, of programs, could be protected by copyright, and it also, by the way, said oh and look and feel, is like really a valuable part of programs. And so this kind of kicked off a series of follow on cases that gave very broad protection. So in 1986 Whelan came down, and then it took six years for the Second Circuit Court of Appeals to basically say, you know, that’s not the right way of thinking about it, that Baker V Selden is actually the starting point for understanding what to do about functional works that are protected by copyright, and although Computer Associates, the plaintiff in the Altai case claimed that the list of services and the interfaces were protectable parts of the SSO of its program, the Second Circuit rejected that basically said computer scientists don’t think about things in terms of structure, sequence and organization, so that’s not a coherent concept. And so it basically then gave us the abstraction filtration comparison test. And one of the things that I think is remarkable about this is that is that it basically reinvigorated certain kinds of defenses in copyright cases involving software. So the merger doctrine ended up being the doctrine that the that the Second Circuit decided should be unprotectible. Efficient elements should be unprotectible under the merger doctrine. And I actually credit David Nimmer for having called attention to that, and also the Senate fair doctrine was basically reinvigorated and expanded to deal with elements of programs that were constrained by external factors. And again, David is entitled to credit for having helped the courts to have a new understanding about the scope of copyright protection for computer programs. And in years after that, the a number of courts adapted Altai in ways that I think made a lot of sense. So the Altai filtration didn’t include processes, procedures, systems and methods of operation. Later, decisions such as Gates Rubber said that those have to be filtered out too. The Bateman case decided that that case, that the Altai case, was not limited to cases involving non literal infringements, and more recently, the SAS versus WPL case involved a copyrightability hearing, so that if you didn’t show what was copyrightable in your computer program, that the case would be dismissed, and the Federal Circuit actually did that. So part of what’s interesting to me studying this over a period of years is that when Congress was acting, both in 1976 and also in 1980 they had no idea that the software industry was going to be a thing, okay. I mean, you know, at the time that CONTU was doing its work, only 1200 programs had been, had been registered, so you thought maybe, maybe they don’t, maybe the industry doesn’t care about it. But of course, in some sense, the industry changed dramatically, and all of a sudden, copyright turns out to be a really important asset and so now that you have what Molly Van Houweling calls doctrinal cocktails, where in terms of interfaces, sometimes they talk about scènes à faire, sometimes they talk about merger, sometimes they talk about 102B, and sometimes they talk about fair use, and we still don’t know what exactly it is that causes the exclusion, although Google versus Oracle says fair use, so maybe that’s going to become a new a new thing. So I think the question that I asked myself is, I was one of the people who thought sui generis form of protection for software was really the right thing. I was so wrong, and so nice to be able to admit being wrong after many, many years, because copyright did a really good job, and it gave us a basis for for international protection. I think it’s been really important, and so I think it’s enabled some stability in the software industry, is really, I think one of the great success stories and copyright played a role, but not that big of a role, as Jule Sigall will now tell us.
Jule Sigall 14:16
Thanks very much. Do I have a way of advancing the slides? So while that gets set up, I want to thank Pam and Molly and Peter and the Berkeley for inviting me and the Kernochan Center and Jane for co-hosting this tremendous event. Okay, let’s see. Very excited to be here, be here to talk about this. This is fun stuff. I retired three years ago. This is the first copyright thing I’ve done since then. We’ll see if I still have my fastball on. This a disclaimer, I’m on the on the board as former Microsoft I am also former copyright office, former lot of things. Everything I say here will be my own. The only clients I have anymore are my family, and they could care less what I have to say. Let’s start, let’s see if this works here. This is a quiz. There’s two there’s two quick pop quizzes here, raise your hand if you know who this group is. I thought we had more copyright nerds in this, in this group. This is CONTU. Most of the members of CONTU, some people weren’t in there. This was taken in 1978 this is right before they issued their report, which led to the 1980 amendments. So this is a very esteemed group. David’s father is there on the couch in the front. So this group now, next question, any, raise your hand if you know who this group is. So we have a couple tech nerds too. This is also taken in 1978 This was taken in Albuquerque, New Mexico. These are all the employees of Microsoft at the time. Inside the company, this is colloquially known as the Manson family photo in some ways. You often see this on social media, paying would you invest in the company run by these people? That was the question. The reality is, one of the folks there was either dating someone who was at Sears and could do portrait photography, so they all went and did the portrait one afternoon. So in many senses, this panel and my talk is a question of whether this group sufficiently incentivized this group to make software, right. Because I think that’s what the question is. Pam has given us the sort of very good legal background to this. My talk is going to be almost entirely pragmatic. It’s entitled strategic choices, because I want to explore the way that the software industry chose to use copyright or not use copyright over the history. And I’m going to start with something that was presented here 10 years ago by David Hayes, a fantastic presentation called brief history of software and IP. In it, he had these charts that kind of mapped the waxing and waning of various, of trade secret patents and copyrights as to the various eras of the software industry. The green line is patents, the red line is trademarks or copyrights. The gray line is trade secrets. We can isolate copyrights to look at what David said, and I’m just basically going to use his eras here and try to fill out this chart with a little bit of what the software companies were doing at the time to essentially recoup their investment in the creation of software. So if you start with 1980 the PC era, that’s when the computer software amendments took effect. You see this rapid rise of software being relevant to the to the software, or copyright being relevant to the software industry. That’s what I, you know, the 80s PC era, if you think about the basic problem of copyright and what or the basic problem copyright is trying to solve, which is takes a lot of time and effort and money to put people together to make software. Takes a lot of time and effort and money to package it and distribute it. How do you recover that fixed cost investment? This is the model I think, that CONTU had in mind when it decided, somewhat unexpectedly, as Pam pointed out, to embrace copyright and computer software and bring it into the copyright fold. This is the way that the model that they’re expecting it. So you make a product as much like a book or a record, and you send it out there through distribution channels, and you recover your investments that way. The key thing to focus on here is less about the law and copyright and about the practical measures of control that are leveraged here. In this case, it’s physical constraints and physical frictions, but also end user license agreements, shrink wrap license agreements, and rudimentary technological measures like key disks and dongles and things are used. The point here is that the copyright salience is what I’m calling it just a fancy word for that is copyright top of mind to execs in this in this space, was relatively high, the notion that you really are relying on a copyright based business model to make and sell software you get into the 90s, you’re talking about the World Wide Web. The obviously challenge is that when once you connect all the computers up to a network like the World Wide Web, becomes much, much easier and much less costly to send those software around as bits, as opposed to packaged products. That creates a threat to getting people to pay for your software, because they can get it for free from other sources. So the question is, what do you do? In that case, the software industry obviously relied on copyright and anti piracy efforts to a great extent, but also just different ways of getting paid. One of the big ways was to take software and bundle it with hardware and have people pay for the hardware. It’s very convenient to do that. They get access to the software the Apple V Franklin case to help Apple do that with its operating system, because you couldn’t put the Apple operating system or a compatible one on a different set of hardware. You had to buy it from Apple. Microsoft did a similar thing, but with independent OEMs, distributors like Dell and Compact and others who would pay for the operating system and then pass that cost along to the consumer. So there’s different models start developing the leverage there is around B to B contracts, things like that, business to business contracts that are a little bit more stable than retail or consumer based efforts. But more importantly, at the end of the 90s, you start seeing essentially ad supported models for software, primarily Google led the way here by basically saying, I’m going to assume that my software should be freely accessible by anyone. I will monetize that by monetizing the audience that it attracts to search. In this case, not a new business model. It’s a business model that radio broadcasters and television broadcasters had used, and newspapers, interesting had used for years. But it’s another way of saying, look, I have to deal with this problem of people getting my stuff for free. I’m just going to assume they’ll get it for free. I will have it paid for through some other means. You get into the 2000s an era David called, you know, the Cloud and OSS era, and you see some more business models develop. The great thing about the cloud as a software providers, from a software provider’s perspective, is you could put all the value in on a cloud server, and you could control access to that to your customers, right? So instead of shipping out disks or downloading bits to them, they could get that access by authenticating themselves on the server that basically makes sure that they can pay. They will pay for the software. There’s not really much that a pirate copy will do for them. They have to come to you for the for the access, and then you so you have subscription models, SAS models, Software and Service, and other things coming coming along, this is, you know, much more useful in terms of getting people to pay without necessarily, really relying as much on the copyright law. And in fact, during this period, you saw a lot of companies anti piracy efforts switch to being anti fraud efforts, because the piracy wasn’t as much of a problem. What was a problem what was a problem now is that scammers would purport to sell you a subscription to your software suite, but they were just selling you nothing. So you actually repurpose those efforts to try to make sure that people weren’t getting scammed or defrauded. You didn’t need as much anti piracy in this case. The other thing that happened here, which is interesting, is the rise of open source software, which you know you could, you could get access to this software and bundle it with consulting services, as lots of companies did, or build cloud services based on open source. Open Source, of course, does rely on copyright your access to and how you use that software is controlled by the copyright license that it comes with so it’s not to say that copyright is is completely waning as we go era to era, but it’s interesting that the most, during this time, the most prevalent copyright based model here, is one that’s all about making software as much as accessible as possible, and not necessarily about getting paid directly for the software, but using copyright to make sure that that software is used and redistributed in in certain ways. So it’s an interesting sort of wrinkle on how much copyright is used. The next era was the 2010, which is the mobile era, in David’s terms. It’s really the App Store and apps, app ecosystem. This looks and feels a lot like the 80s era, in the sense that for the first time you could, you could for the going back to a time when you could actually make software and get paid for it directly. I remember the time I realized that the apps, Apple App Store was going to be a big thing, because on the internal discussion lists about all things Apple inside Microsoft, Microsoft developers started asking questions like, how do I get an EIN for tax purposes? The reason is, they were selling so much software on the App Store, the iPhone App Store, that they were actually needed to report the taxes on it too. So you could see that this is like, this feels like, Okay, this is the old day. This is like, Egghead software on my phone, and in the cloud, I can go and make software and sell it to people directly. So it feels like that. I would say it’s feels like it, but it’s not really a copyright business model, because what you’re actually leveraging there is not the copyright law. You’re leveraging the platform control they do their Apple or Google in the Android store, excerpt, right? Everything you do that feels like copyright will essentially be mediated by Apple or or the platform owner, whether whether they let you on there, whether they take down something that you claim is infringing of what you’re doing, and they take care of all the payment and other things to make sure that you get paid. So it feels like copyrighted, but it’s not quite copyright. The other thing that happened too is these apps all connect back to the cloud, so that you make it somewhat not worth someone’s time to just get a pirate version of the app, because all their stuff is stored in the cloud. You have enough connection with them, and it may be even now, a subscription model that makes it more valuable for them to use your stuff directly than to get a sort of infringing copy, if you will. So this is sort of a quick, you know, rush through the history of how software companies made strategic choices. And I think my thesis anyway is that over this time, you see a waning of copyright as the way that software companies monetize their their software and and just developed a whole series of business models and models that would allow them to get paid for it. Obviously, the next question is, what is AI going to do to this. There’s lots of question marks up there a couple of things. Only remark on a couple of things I will make. Who knows what the business models are going to be? Who knows how you leverage, what types of physical control you leverage to get paid for your software? Who knows how that’s going to work? It’s all rapidly, rapidly changing. I would say just a couple of things. One, I think it’s pretty clear that we are entering an era where more software will be developed than ever before by more people and more organizations, big and small. There’s no question about that. I think that the tools allow people of all stripes, all shapes, all sizes, make software. They also allow software to make software. So you’ll have agents making software for agents, making software for agents, as opposed to humans making software for humans, you’ll just have this explosion in the creation of software. If you take a sort of macro step back and say how much software is being produced, there will be enormous amounts, much more than you ever thought. Organizations right now are already saying, I can use the AI tools to do software programs that I would never even consider before, because the costs of hiring someone to do it in do it, and the time effort would be, will be too much. I can do this in 30 minutes and have a useful piece of software, small little, doing relatively small things, but it’s immediately, almost immediately, accessible to them now. I think this is embodied in the tweet from one of the former founders of open AI. He said, You know, this is the new that’s the hottest new programming language, right? And it’s interesting, because if you think about that, people will now basically program computers with English, and you can do it that way. I’ve been talking about strategic choices that companies make in light of copyright. You can also look at the strategic choices that policy makers make in copyright. As Pam was pointing out, there was probably very good, lots of good reasons why CONTU should have said, the world is fine. We don’t need to take copyright and bring in computer software into the world of copyright. Everything is fine. There’s no need for it. However, they were very sort of open and said, let’s bring it in. Let’s see if it can help this nascent industry. Right now, we have a situation. We have a nascent technology, rapidly moving, rapidly changing. The question is, should we bring it into the copyright fold? The reality is, the copyright office, in its pronouncements about what is copyrightable and what is not, has said that, you know, although they’re mostly talking about other types of creativity that comes out of AI if you apply it to copyright to computer software, the question is, they said, prompts alone do not create copyrightable material, because most of the actual code is going to be determined by the AI in response to your prompts, but not by your prompt. So the question is, could Andrej have said the hottest new programming is prompts alone, right? That you could think about it in that term. So it feels like the copyright of saying we don’t need copyright to embrace this form of creation of computer software, it’s not really part of the system anymore. And the question is, if that’s the case, if the courts uphold that view, and that’s where, again, the question, I think the more interesting question, is, will that even matter? I mean, is it the case that developers and others using these tools to create software, are they going to be rushed to the copyright office’s door or to Congress’s door and say, we need copyright protection. I, my prediction, who knows, maybe my prediction and maybe there, you know, obviously, there’s lots of questions about displacement of jobs for developers. Will it become, you know, we have the struggling artist metaphor and example. Will we have the struggling developer example come up now in light of what these technologies do to the economics of software development, but I’m not sure copyright will be the thing that people will be asking for. So I think the answer may be yes, copyright is no longer part of the system anymore. Computer softwares are no longer part of the copyright system, but that’s okay. That may not be something that’s necessary anymore. So we may be back where we started before CONTU started. The next, obviously, next question, which I’ll end with, is AI can do a lot more than create code and write software. It can do lots of different types of creativity that can be digitized. Does that presence and the interpretations of whether copyrightable are part of the copyright system or not for other types of creativity. Does that engender a migration away from copyright that we’ve seen in the software industry for other types of creative works? And who knows, but I think it’s a possibility. It’s possibility that if we close the door to copyright to these forms of creativity, it’ll still go on and like the the software industry, others will develop business models that don’t need copyright as much to do their creativity. Happy to answer any questions. Thanks again for the opportunity to speak.
Clark Asay 30:22
Thank you. Like everyone else, I’m grateful to Pam, Peter, Molly, Richard, Jane, the Columbia folks for having me. It’s great honor to be here. And what I what I wanted to talk about today flows, I think, very nicely from the remarks from Pam and Jule. I’m going to really focus on open source software, and my basic thesis might be overstated, but that’s what we do as academics, legal academics, is that generative AI by enabling mass production of as Jule just referenced copyright list code may be sowing the seeds of its own demise by undermining the copyright. Copyright foundation that has been so important to collaborative software and AI innovation. So I believe there are some reasons to be concerned. I’m also going to acknowledge some, I think, countervailing forces, reasons to be more optimistic, and so I want to briefly touch upon both. So as Pam explained, sort of gave us this beautiful history of how the 76 Act eventually gave software a copyright home, and that home became an important basis as Jule just referenced for the most successful experiment in collaborative innovation the world has seen the free and open source software movement, FOSS movement. I don’t want to rehearse in detail here how that movement came about, and how important that movement has been, is the backbone, I think we all know of much modern technological innovation, but importantly for our purposes, software’s copyrightability played an immensely important role in fostering that movement through copyright based FOSS licenses, but now generative AI, as Jule pointed out, poses some unique challenges to the copyrightability of software and potentially that collaborative system of software innovation. We’ve talked a lot in this conference about the adaptability of the 76 Act and currently that the AI is posing some immense challenges to that adaptability with respect to AI generated code. How, frost licenses, the system that has been so successful in encouraging and facilitating mass amounts of innovation presuppose copyrightable code contribution agreements, copy left obligations, attribution that come as part of many of these licenses, all rest on that copyright foundation. Without copyright, it’s possible this governance architecture becomes much less straightforward, much more complicated and and crumblesome. So a bit more about what Jule brought up is AI generated artificial code not being copyrightable, at least under the copyright office’s current interpretation right. The Act requires a human author. Some of the reports in the Copyright Office treat AI outputs as non human expression, and as Jule mentioned, even detailed, iterative prompting does not seem to be enough to trigger copyrightable expression, because those are considered unprotectible ideas, not authoring expression and English being the language of coding. By coding, it’s, you know, these, these low touch practices, are growing and unlikely to satisfy any future authorship standard. So this, as I mentioned, creates us cascading problems for the open source governance model right, contribution agreements break down because contributors cannot represent copyright ownership of AI generated code and the projects, companies that are behind many of these projects might be reluctant to accept without these representations and warranties. Copy left obligations lose force, right, there’s nothing to license if there is no copyright, or at least based on copyright, and corporate legal departments, you know, I mean, one response to a lot of this is just like developers will just do what they’re going to do. They don’t care that much about copyright, but corporate legal departments do. So lawyers always get in the way. A risk aversion may contribute to sort of this, in light of this uncertain provenance, and I think it’s important to this weakening copyright foundation comes on the heels of several currents in the software industry and in the open source software industry that I think have already pushed open collaboration or threatened it in some respects. A lot of high profile projects have abandoned recently their open source licenses for more restrictive models, MongoDB, Elasticsearch, just to name a few. Jule sort of talked about monetizing open source hasn’t been the point, but it’s always been sort of a difficulty for companies and navigating sort of how we make this business model work when we can’t directly monetize the software. So at least some corporate sponsors are switching to source available or proprietary licensing, and one sort of, I think, important, another force that is accelerating possible, sort of closing of this commons is AI. Everyone’s deploying agents, and those agents are creating, as Jule pointed out, tons of software and making pull requests contributions to these projects and projects are being overwhelmed in many cases. These haven’t been vetted by humans, and so at least in some cases that I’ve read about, projects are restricting contributions or closing off entirely, and just just like we can’t handle the AI era and the mass production of all the software and contributions being made to to our projects. So I think these open collaboration norms are potentially eroding from multiple directions simultaneously, both the copyright found copyright foundation as well as just some of these, the AI generated chaos for a lot of these projects. And so where does that leave us? Well, I think, I think it pushes if, if not, copyright. What else? Well, trade secrecy is a natural fit for artificial code. There’s no human author requirement, right? Trade secrecy has been a very important form of protection already, and in the open source world, Jule referenced the hosted the cloud, hosted software, right. This is one of the reasons why many projects are changing their licenses, because companies, in many cases, are taking but not giving back, because they can keep things as trade secrets, their innovations, on top of the freely available open source. So it’s already a growing trend, and it may, I think, accelerate as the copyright foundation, becomes weaker. Now this is all to say, I mean to Pam’s point, right, copyright has always been constrained and weak, weaker in software code for a variety of reasons, and so I’m not suggesting a robust copyright or a more robust copyright in software. I’m just saying what what little copyright we had in the era of AI is going away completely, and that might push us even more forcefully in the direction of trade secrecy and then patents gaining ground to, I think the USPTO report on the most recent one points to conception. Right, as long as the human can can be said to have conceived of the invention than whatever the AI use, fine, right. So patents have always been a big, or have long been a big deal in the software world. I think with with copyright waning, they are likely to take on an even more prominent role, and as false licenses that come with implicit and explicit patent grants are abandoned, that sort of moderating role of copyright related to patents, enforcement of software patents, perhaps, in some cases, at least becomes more likely, and so I think together, trade secrecy and patents point toward a more closed, fragmented software ecosystem and possibly AI system. So does this closed software ecosystem threaten AI’s future, and software’s future? I think Jule seemed, you know, we’ll find out, perhaps optimistic, that it’s all going to be fine. I think, you know, the AI systems, ironically, have really depended on a lot of access to open source software in terms of training. And so, if that shuts off, if that spigot shuts off, because more and more software is kept as trade secrets or it’s patented, and so it’s harder to access synthetic data alone, risks potentially model collapse and so generative AI, by undermining this copyright foundation that has been so instrumental in open collaborate, open innovation in the software industry may be sowing the seeds of its own stagnation. Now, I think one really strong countervailing force that’s important to note is that developers desire to influence the AI stack. I mean, I think this is one of the biggest motivations that is likely to keep the ecosystem at least partially open. Many developers have a strong interest in shaping which AI, tools, frameworks and models gain traction, right, the infrastructure. They want to influence the infrastructure. They want to continue to contribute to it and sort of help move it in the right direction, so to speak, and open contribution and collaboration despite possible copyright complications again, maybe, you know the copyright complications are, maybe I’m inflating them, or they’ll just be somewhat ignored and business as usual will continue because of this desire to this proven model of developing and standardizing around different parts of the AI and software stack. So I think this is a meaningful counterweight, but without institutional legal support may not be enough. So you know, paths forward, maybe one obviously is perhaps broader recognition of human AI collaborative authorship under the act. So the copyright office, right, is not the last word. We’ve talked a lot about how courts have been instrumental in interpreting the 76 Act in a way that is conducive to new technological developments. This might be another, and I think it almost certainly will be an area where courts will step in. I think Pam, Matt and Chris have part of the reports or references their argument that, and if I’m misstating, you can correct me, Pam, but the at some point, iterative prompting should probably be a basis for copyright ability of outputs. At some point you’re going from ideas to expression. We don’t know exactly where that line is, but I think this is the future of creativity. People, creative people, including software developers, are going to be using these tools and and they’re still going to be creative, even if the AI tools are making some of the decisions, they don’t have complete control over the outputs. And so one possibility going forward is harking back to Feist and even the earlier cases under the 1909 Act that reference intellectual conception as the guiding principle of authorship, and therefore copyrightability. It might take on a more, that phrase and that concept might take on a more prominent role in our AI generated world. So again, I think the 76 Act gave software a copyright home. That home is now, I think, under threat with artificial intelligence and artificial code the results of artificial AI systems. And I think the challenge ahead is ensuring that the law can continue to sustain this open collaborative environment that has been so successful in pushing innovation forward. Thank you.
- Feder Cooper 45:06
All right, I similarly want to thank everyone for having me here today. I’m Cooper. I’m sorry that I’m not mark. He couldn’t be here, but he’s my co author, so I will be representing both of us. In a nutshell, I want to talk about revisiting the question of whether model weights that give the possibility, but not necessarily certainty, of generating infringing outputs may constitute a copy under the statute. And we think it’s the right time to bring this to the fore again, because of recent empirical evidence about this thing called memorization and large language models. So I want to start with a bit of brief background here on machine learning side to make clear why this question makes sense. It’s common to describe a model as learning statistical correlations, relationships or patterns in the training data, and that description isn’t wrong, but it’s also very high level. It’s a gloss that oversimplifies the information that’s represented by the model. There’s another important part here, which is how the model is used to generate outputs. So for an LLM, generating text requires a decoding procedure, a rule for selecting tokens from the model’s predicted probabilities. Some decoding procedures are deterministic. You get exactly the same output every time you run with the same input, but many are stochastic. You get different outputs from the same input because they sample from the distribution. So during generation, sometimes the model, the model might produce Potter after Harry, and sometimes it might produce Truman. How often it produces either depends on the learned distribution in the model weights. So this distinction between model and decoding actually matters a lot, even though they often get collapsed into being talked about as the same thing. And the reason why is the model itself defines a fixed probability distribution, so variability in outputs doesn’t arise from changes in the model, but from how that distribution is used during decoding. So why have I gone on this LLMs 101 spiel? Well, it turns out that these details matter a lot for what understanding, what memorization actually is, what it means to extract memorized training data in a model’s outputs. The technical notion of memorization can be understood as a property of that distribution. So loosely speaking, memorization is when based on its training, the model places unusually high probability on specific sequences from its training data. The model’s probability distribution becomes really highly concentrated on particular sequences seen during training, such that those sequences are far more likely than alternatives. This doesn’t mean that the model stops being probabilistic. It still defines a probability distribution, and may still give non zero probability to many possible sequences, but the distribution is so sharply peaked that one sequence, or some small set of sequences, dominates. In this case, imagine sharply peaked intuition, imagine rolling a die where there’s a 95% chance that you return three as opposed to 1/6 equal chance for every number. So when this happens, the high level description of models as learning statistical correlations or patterns can be quite misleading. It makes it sound like the model has learned some sophisticated soup where all of the original training data are unrecognizably transformed into abstract patterns. But in these specific cases, that’s not how it works, and this point is actually very closely related to compression. I won’t get into the details of why, but at a high level, it’s reasonable to extend Ted Chiang’s blurry JPEG of all the text on the web analogy to say that memorization basically means that part of the JPEG, parts of the JPEG, are blurry and others aren’t blurry at all. When it comes to memorization, we’re talking about really high fidelity compression, and that’s a thing that’s in the model. And depending on how good the compression is, you maybe actually can get the memorized training out, training data out deterministically. There might not be any probabilistic element there at all at the time of use. You could use one of those determinist decoding, deterministic decoding schemes and outcomes the memorized training data every single time. We still have a lot to learn about memorization on the technical side. It’s pretty mysterious frankly, it’s why I personally haven’t quit this research area, even though I probably should go do some other things as well. It just keeps giving new insights about language model behavior, and I think it’s fairly commonly accepted on the technical side at this point to say that this behavior isn’t a bug. It’s far too interesting and far too complicated to dismiss it like that. So now let’s turn to copyright, starting with how doctrine addresses what is a copy. So the statute gives us an answer, but as Mark has observed, it turns out to be pretty incoherent. The copyright owner has the exclusive right to reproduce the work in copies, and copies are material objects in which a work is fixed, and a work is fixed when it’s embodiment in a copy, by or under the authority of the copyright owner is sufficiently permanent to permit it to be perceived, reproduced or otherwise communicated for a period of more than transitory duration. So if you take those words literally, nothing is a copy unless the copyright owner authorizes it. But if the copyright owner authorizes it, it is infringing. So it’s impossible to infringe copyright to make copies, the right to make copies. So this is clearly not what the statute meant, and flows from the fact that we use the same definitions for both protectability and infringement without thinking about the consequences. And so courts have basically just ignored this part of the statute. And as Mark would say, sort of in brief, so much for textualism. Nonetheless, the requirement of fixation in material objects is well established, and that brings us to AI. AI cases have mostly been litigated based on training or on outputs, and in those cases, parties often take pretty extreme positions. Models don’t memorize. They just learn word relationships versus AI output is just a collage of prior works. Neither one of those is really entirely right, or even in some cases partially right. And the reason why memorization is interesting is that it makes it very clear how this is more complicated. And some work that Mark and I have done recently is just some of the latest on this. This is far from the only result in our work, let alone the only result like this one, but I’ll stick to it as an example. We extracted a near pristine reproduction of Harry Potter and the Sorcerer’s Stone from LLaMA 3.17 DB. And at the time that we did this meta, this was meta flagship open source LLM, we did this with a very short prompt drawn from the beginning of the book, less than the first line, and we got the remaining 300 or so pages of the book, and it was really near exact. And that particular finding is deterministic. If you run our code on an A 100 GPU with the first few, you know, few words of Harry Potter and the Sorcerer’s Stone, and it’s like less than 100 lines of boilerplate code, you will get the same output that I got. And so it turns out, though, like that’s a pretty extreme result. Extraction turns out to be possible for some works and some models, but not others. Memorization is not a uniform phenomenon. So extraction, of course, isn’t either, and this is why high level numbers, like a model only memorizes 1% of its training data conceal a lot of underlying nuance. Most of the experiments in that paper measure whether verbatim memorization is occurring. So this is exact matches to the training data, but our new work, released a couple weeks ago, shows a lot more extraction if we expand from identity to near verbatim extraction. So here, variance is not like a substantial similarity type thing. It’s really just, oh, is there an extra space or a comma instead of a semicolon, very, very small changes. There’s also other extraction protocols beyond what we’ve studied. In other work with folks, other work I’ve done with folks at Stanford, we were able to extract enormous amounts of copyrighted text from production language from production language models like Claude. In some cases we did need to use adversarial strategies like jailbreaking, but in others, for Google, Gemini 2.5 Pro, we didn’t at all with very, very simple property strategies. None of that work changes the model’s weights, but you can also do that in controlled ways, to expose more memorization. In 2024 with some colleagues at Google, we fine tuned ChatGPT on new data, which then made it possible to expose more memorization from its pre training data, from data that had been trained on previously and more recently, directly on the copyright side, Jane Ginsburg and some of my colleagues in machine learning have shown that if you can fine tune on public domain books, like books by Virginia Woolf, that can reveal memorization of other works, in particular, in copyright material from other authors that the model was previously trained on. So given all of this, is work memorized in a model, a copy fixed in a tangible medium of expression? Well, the answer is, it’s complicated. Models don’t literally store works like a file system, but that isn’t really a problem. You can clearly make a copy, like in a computer with zeros and ones by storing parts of things in a relational database from which data can be deterministically drawn. On the other hand, we wouldn’t say that Microsoft Word encodes War and Peace, merely because all the elements, the letters are present in the program, and they just need to be placed in the right order to get War and Peace or Microsoft Word, you need to have an entire copy of War and Peace to type into it. Traditional databases are clearly different. You put in some query and deterministically outcome some additional information that was not contained in the query, and memorization is clearly different than both those things. It’s not like Microsoft Word, which is a perfectly clean slate, and it’s not like a traditional database, and that extraction doesn’t have to be deterministic, even though it can be so, right, yes, the Harry Potter case was deterministic, but a lot of other extraction isn’t. Sometimes you get the memorized content out once out of every sum number of prompts, and in some case that some number might be 1000 and that’s still technical evidence for memorization, for reasons I won’t go into, but that’s clearly not deterministic. There’s a lot more that I could say about this 1 in 1000 example. It’s deceptively complicated, but for now, let’s oversimplify, and let’s say you produce 1000 generations, and one of them is the memorized text, and the other 999 aren’t, they give us something different. Presumably, we wouldn’t say the 999 copies, or non copies, are also stored in the model. We would clearly get to storage of more outputs than there are atoms in the universe. Again, this example is more complicated than the gloss that I’m giving, but the point is that there’s not a determinism here, suggesting the potential for a non deterministic copy in the model. The copyright law hasn’t dealt directly with non deterministic copies. The closest examples that Mark and I have just sort of discussed are Kelley V Chicago Park District, where a garden isn’t copyrightable because it’s not deterministically fixed, and video game output sequences in Micro Star V FormGen, video game displays based on new maps are fixed. But we think that Kelley is probably wrong and that the game content is predictable and replicable. So procedurally generated games like No Man’s Sky complicate this, but for like this is they’re not like perfect mappings here. Further, those cases involve the output. The Micro Star court presumably wouldn’t say that the new levels are somehow in the game. It’s only when the infringing output is generated that it matters. Nor would we say that all possible roles that could be generated in No Man’s Sky already exist in the game. So Mark and I think the question of whether the model itself is a copy depends on how predictable and predictable and replicable the output is and the extraction is. If the model weights can easily generate the same work and that it’s readily replicable, it probably makes sense to say there is a copy in the model as a functional matter. If not, we think it makes more sense to say the model is generating the work only if it produces the work as an output, but that the copy isn’t sitting in the model the whole time. It’s encoded in there, with respect to a valid technical claim about memorization, but it maybe shouldn’t be viewed as a copy in the technical copyright sense of the word copy. And this matters because if the model is a copy of some or all works, making a new copy of the model itself is an active infringement. So to put it really technically, that would really screw open source models. Maybe the model itself is fair use, just as the training data set has been held to be. But things are different here, because models are themselves commercial objects, and that matters a lot after Warhol, and some of the success of training data set fair use claims may depend on intermediate copying status, but also not necessarily, because Authors Guild, for example, doesn’t depend on it. So I don’t love the functional conclusion that some, but not all, works are sufficiently memorized to be treated as a copy. Nor do I love the threat that the type of memorization work that we’ve done poses to open weight models. In particular. I don’t think Mark is particularly thrilled about this, either, but we do think this is where the empirical evidence and the current law currently lead us. Thank you.
Pamela Samuelson 57:11
I should have said, I should have said at the outset that Mark Lemley was expecting to be here. He had a family emergency, and so he was not able to join us, but fortunately, his co author was able to do that. So thanks Cooper for that contribution. I have a couple questions for people on the panel before opening it up to, for questions, and to Jule, you didn’t say anything at all about patents, and I’ve often imagined that part of the reason that Whelan case came out the way it did, and it’s like everything’s protectable, is because at the time in the kind of the early to mid 1980s the patent office wasn’t issuing patents, or not thinking that they were issuing patents, and so the industry thought they weren’t available, so you needed thick copyright because patents weren’t available. But then patents started becoming available, and how did Microsoft and other companies kind of think about the role of patents in relation to copyrights, because those two things were not supposed to overlap.
Jule Sigall 58:24
Yeah, it’s a good question. I mean, the real answer as to why I didn’t say anything about patents is I went to night law school, and the only Friday night class was patent law, so I never took patent law. So that’s why I didn’t that’s why I didn’t say the patents. I think the reality, and I always this is one of the conclusions I drew from David Hayes’ presentation, and to Pam’s point, in her presentation about abstraction, filtration, comparison, and all the sort of complicated doctrines that developed under copyright law in the early to mid 90s about copyright protection for software. I think the sort of real world reaction of the software industry was, if I’m going to go through the trouble of demonstrating that my copyright survives this filtration process under the copyright cases, I might as well start trying just get a patent instead. It’s sort of the same type of effort and, that you would need, and at that time, you have State Street and other cases that sort of relax the restriction on patents from your algorithms, and sort of brought the software into the patent world in the late 90s. So I think it was just a marriage of those two sort of historical trends and just the notion that if you want to, if you want to go the IP route for software, it’s probably more efficient and useful to go the patent the software patent route, because it’s sort of the same type of uncertainty and risk of just whether you’re going to actually get the IP protection. I think that’s the other main point that I probably I don’t think I made in my presentation, which is and one of the things I learned, you know, advising in house to tech engineers and executives, the reality of copyright and patents, I think, to a great extent, they come with embedded strategic choices about your business. So if you choose to say, I’m going to use IP to protect my business, you will be making choices about what you can and can’t do that may not be the most optimal as a business matter. There’s a really good book on this called capitalism without capital from a few years ago, which you know, talks about the notion that most of the, many and if not most of the successful companies today have intangible assets on their balance sheet, not physical assets. And it explores what does that mean to have an intangible asset business? And they talk about IP explicitly in there, and say that a lot of the benefit of these companies is that they take advantage of synergies and spillovers that are present in intangible assets and IP potentially could you would think, oh, intangible assets, they’re all IP protected businesses, but they point out that IP can interrupt and interfere with the benefits you might get from synergies and spillovers in these intangible asset worlds. And so it may not be the optimal thing for your business. And we’ve actually seen part of what I touched on, which is a lot of businesses surviving without IP because they’re able to capitalize on the other aspects that an intangible business comes from. So short answer is, it’s complicated, but I do think that there is sort of a natural trade off that you have to make if you’re going to say, I’m going to protect myself with IP, to say, should I go down the copyright route, or should I go down the patent route. I think by and large, since the late 90s, most of those who want to do it as an IP matter will go down the patent route and the trade secret route and the copyright route.
Pamela Samuelson 1:01:31
Well, that was very helpful. Thank you. And so Clark, I found your presentation really interesting, and one of the questions that popped into my mind is, what do you do with the with the copyright office policy that basically says you have to identify, right, if you’re if you have a program and parts of it are AI generated, according to the copyright office policy, as at least I understand it, is you have to identify the parts that are AI generated, and then you have to disclaim authorship in them. And does that make sense to you? And also, will people do that, or will they just pretend that they actually human authored the whole thing?
Clark Asay 1:02:16
Does this work? Yeah. Does it work? Yeah, okay, yeah, I think that’s a great question. I the I think it’s unworkable, right, especially as creativity and innovation become so intermixed that it will be very, very difficult to sort of separate what the human has done versus what the AI has done. It sounds sort of like a neat division in theory, and I, I suspect, I haven’t looked into this at all, but I suspect that there are a lot of works that people are registering. You know, the Copyright Office is not doing a thorough examination, similar to the patent office, and so I suspect that a lot of people are submitting and registering works that include this sort of intermixing, and they’re doing their best, or in some cases, maybe not their best to sort of delineate between the parts that they are responsible for and what the AI systems have done. And so I think that needs some deep sort of rethinking in terms of how registration works in this, in this modern era, and I don’t have a great answer. I do think, though, that the systems sort of automatically collect what is happening when you are using AI. They collect information about like your prompts and how the process is is is working, at least on a superficial level. So I do think we have information on that front that could be helpful in terms of registering works.
Pamela Samuelson 1:04:08
I have a second question for you, and that is one of the things that we know about Creative Commons and about other kinds of open licenses is that they depend on copyright, because here’s a copyright now I’m going to slap this open license on it, but the ability to slap the license partly depends on the ability of copyright, and so if you have AI generated code that actually isn’t copyrightable, at least under the policy prevailing today and then you slap a license on it, we’re back to proCD all over again, aren’t we? Do You have any comments on that?
Clark Asay 1:04:51
I think it’s a bad place to be. Yeah, and this, I think this gets back to my comment during the presentation of it is a distinct and perhaps likely possibility that the industry will just, the Creative Commons open source software developers will just continue as usual, and they’ll sort of ignore some of these copyright complications. I think as, and so they slap the license on, technically, perhaps there’s nothing to license because it’s AI generated code. But maybe that’s a problem for lawyers or litigation down the road, and you get another proCD type of approach by a court. So maybe norms play a more important role in this space than the actual technicalities of copyright law, but adapting copyright law could could help avoid like the proCD type of outcome.
Pamela Samuelson 1:05:53
Great. Cooper, I wonder if you would say a few words, less about your paper, and more about the role of developers of these generative AI systems, and technologists who work with them and their perspectives on some of these copyright issues. And I think a lot of the people in the in the computing field would just wish that this whole thing would go away. And now there are 100 lawsuits, and you’ve been one of the few people who’s been willing to kind of engage and cross the boundaries a little bit, and is that a lonely exercise?
- Feder Cooper 1:06:31
I didn’t realize this is gonna become a therapy session. Okay, so I think it’s actually it’s, it’s mixed. I think a lot of people, i’ not gonna pretend to speak for all computer scientists or machine learning people, or people in academia or in industry, but in terms of the folks that I speak to, people really do just want this to go to what go away for the most part. I was at a wedding recently where I didn’t realize, well someone at one of these companies wasn’t realized they were sitting next to me, and he was talking about being subpoenaed, and can’t this, just this, just stop already. And I was like, I don’t, i’m not the right person to talk to you about that. The good news is, I think that there are, you know, is a growing research community of folks who are really interested in these issues, not just because of, you know, the copyright aspects of it all, because there are really serious questions about labor and dispossession that people are interested were for which, you know, copyright has become this proxy battle for some folks in on the academic side, where they’re interested in this work. So as I mentioned when I was speaking, I mean, you know, Jane Ginsburg and some of my colleagues in machine learning are also very interested in some of these issues, and have been doing work in this space. Is it lonely? Yeah, it’s still a little bit lonely. But I think that it’s it’s still it’s important, because if we still don’t really understand basic aspects about model behavior, and I really do want to underscore that even just this memorization stuff, it really is mysterious. I was supposed to stop working on it two years ago, and I’m still saying, oh, it’s my last paper, It’s my last paper, and still working on it. If we can’t understand basic facts about these models, I’m also just not really sure how we can responsibly come up with policy decisions around them.
Pamela Samuelson 1:08:14
I’d love to open it now to questions from the audience so I see several hands. Can I get some microphones here?
Audience 1:08:25
So thanks for everyone. My question is for Cooper, how do you feel as a careful academic when you see how do you feel as a careful academic, which is how I read your work, when you see your work being used to say, and the quote is, large language models don’t learn, they copy.
- Feder Cooper 1:08:48
(inaudible)
Audience 1:08:48
I’m sorry, is it not coming through?
- Feder Cooper 1:08:49
I’m having trouble hearing you. Say it a little bit louder
Audience 1:08:54
Yeah. So how do you-
- Feder Cooper 1:08:55
Acoustics in this room are not great, so you really need to talk up.
Audience 1:08:59
All right. How do you feel when you see your work being used as evidence that, and I’m quoting from the Atlantic, large language models don’t learn. They copy.
- Feder Cooper 1:09:10
You’ve caught me. I never, ever say anything definitive about this. I don’t feel good about it, right? Like we spend a ton of time doing research and trying, you know, I write very long, long papers. I’m famous for writing, unfortunately, very long papers, because I want to be very careful with the work that I do. And reducing these things to one sentence is very challenging. I don’t think that that sentence is an accurate loss of what models are doing. Having said that, I still think it’s important to do this work, because it says that is illustrating sort of information about model behavior that we didn’t, that we didn’t know before. I have a paper that I’m working on that I’ll talk about with you offline that is sort of responding to this, but it’s in progress right now.
Pamela Samuelson 1:10:02
Quite a few hands there.
Audience 1:10:06
First off, this panel was outstanding. Thanks, Pam and to all of the panelists. And my question is for Clark. In the early days of both free and open source software and Creative Commons, I think some of the advocates saw it as a second best solution, and the best solution would be for software to be in the public domain, for copyright, not to apply automatically, as it did as part of the aftermath of the 1976 Act. And so I’m curious how, I guess, that side of the community should feel about this. I would think maybe great, or maybe, you know, in the decades since we’ve seen so much utility from being able to leverage what you can leverage from having a copyright in terms of the viral nature and the share alike, etc. And also, I just want to say copy left without copyright is a great article title, and if you don’t want it, I would like it.
Clark Asay 1:10:59
Thank you. Actually my, one of my first academic articles was arguing that the open source software movement should ditch open source software licenses and dedicate everything to the public domain. And I guess I might just be biased in favor of what has proved to work, proven to work. And I do think with, I mean, I guess my worry is, I mean, Creative Commons is a different beast, just because if it goes into the public domain, then it’s there, and you don’t have patents and trade secrecy necessarily, sort of playing a role, I think, with software, especially with how the open source software movement has been corporatized, is, if that’s a word, I just made it up, but where corporate influence is so heavy in the open source software world, and I think it’s very unlikely that they will not utilize trade secrecy and patents to the extent they can. Now that might be fine that you know, if the if the norms of openness continue, that open collaboration norms continue, the fact that copyright isn’t is even less robust than it has been, maybe is not a problem. Maybe that countervailing motivation that I spoke to where it’s like, this is really work, let’s keep doing it on certain parts of the infrastructure. Then maybe my concerns about trade secrecy and patents are not as relevant as I thought they were, but that’s sort of the distinguishing thing, I think that comes into play with software in particular.
Audience 1:12:58
Hi, question for Cooper. Given that we know the deduplication of training data is supposed to be a technique that lessens memorization and models. Have you thought about the fact that the first Harry Potter book appears to be one of the most duplicated books on the web, as any search query will bring up many, many copies. If that’s not the explanation for your team’s better results on that work as compared with others, then what is?
- Feder Cooper 1:13:32
So we know, we’ve there’s two things. One response, I think the first thing you said, we, there are, as far as I know, no foolproof mechanisms for you know, promising that a certain piece of training data is not going to be memorized when you when you train a model, soo just want to say that. There are people doing research in this area, but it’s a set of open questions. There’s the question then, of duplicates in the training data, which is what we’re what you’re getting at, which we absolutely know that duplicates, duplicates, impact risk of memorization. It is still astonishing, from a technical standpoint, that from the first few lines of a text that you could pull out an entire book for model. I’m still astonished by that finding, and I want to point out that if all of these models are trained on the web, loosely speaking, this is not how I view it, but loosely speaking, wouldn’t they all exhibit this behavior if they’re the same, a certain size class? That is not the case for this finding. That’s why I scoped it specifically to say it was this model, this book. It’s also not just that book. It’s plenty of other texts as well. So duplicates end up being part of the story here. There, I can loosely gesture to it and say, Yes, we know that increases training, data, memorization, risk, but to this specific behavior, I can’t really connect those dots with a lot more information. And again, like we wrote this paper and Harry Potter and Sorcerer’s Stone has kind of become our machine learning community’s test case for a lot of things, but I really do want to underscore this is not the only text that we were able to do this for. There’s also a popular text, but the only other text that I’ve tried to do this with is Ta nehisi Coates essay, The Case for Reparations. I did that because he was one of the CADRE plaintiffs, and it without any modifications to the code other than the prompt that I use I got almost the entirety of that essay, the exact same model, and that’s maybe very famous, but not Harry Potter and the Sorcerer Stone level famous.
Audience 1:15:32
Hi. I think my question is for Jule. When you were doing your slide and talking about how the business models have changed, it seemed like your point was kind of that copyright wasn’t the operating power and guaranteeing the business model. But I want to push back on that a little bit, because copyright is still this monopolistic power that becomes this lever against people doing other things. And I want to bring into the discussion the idea of, what about consumer protection, because you all of these things are about we can assure ourselves of some profit model because we can stop certain behaviors. And I think that all rolls back to copyright as being the lever that they’re doing, but they’re stopping some behaviors. And I think from a public policy standpoint, we need to pay attention to what the consumer protection effects of these stopping consumer behaviors, because we end up stopping legitimate consumer behaviors to access software they legitimately bought, et cetera. So I guess it’s two questions, where does consumer protection go in. And what do you think about the idea that even for these other models, it’s still copyright, that is like, what if you took copyright out of the equation? Would these business models continue to survive?
Jule Sigall 1:16:53
Yeah, it’s a good point. And you know, obviously, in going through something in 12 to 15 minutes, you allied a lot of things that are the reality. And there’s a lot of overlap over these eras as to what was there and what was not. The right question, the one you’re asking, which is, what role does copyright as a backdrop copyright, almost as a meme in people’s minds, as to what what they can and can’t do with with software. And I think that’s — I look at that question sort of as it’s a behavioral one. The question is, what behavior is copyright shaping or not shaping? And it is possible, and I think it’s a reality that the fact that copyright exists, and people think about copyright, and I certainly think, you know, there are executives and engineers and consumers and other ones have this vision or understanding of copyright and that actually can change the behavior that they do, or what they do or not do with particular software. I think when you look at it that way, though, the question, you know, I think there’s — how do I put this? It’s there. The question is, how do you as a business, how do you use that fact, or what? What does that do to your strategic choices? I think one of them is, and I think this is a big, relatively clear difference between the software industry and other creative industries, which is to your point where you said, How do you stop some potential consumer behavior that see as that you see as threatening to your business model? The interesting thing I think, about software companies, and like Microsoft, for example, Microsoft housed its anti piracy efforts inside the marketing departments of the company, not in the legal or policy parts of the company, because the goal was not to stop people from using pirated software. The goal was to get them to use Microsoft software. Right? I found in my work that that was not that model wasn’t really followed by other creative industries, that most of the anti piracy stuff was in legal or policy. So there wasn’t like a channel of taking people who were consumers, who were quote, unquote, pirating software and moving them into authorized channels. It was, it was more of what you’re suggesting, I think, is you want them to stop doing certain things which they might be entirely, you know, legally entitled to do, and should be able to do for very good reasons. So it’s a long winded answer. I don’t know if I answered your question. I do think the software industry — maybe I’m wrong on this — but I think software industry across the board, not just big companies like Microsoft, but everyone has done a much better job of trying to understand the actual behavior of users with their creative works and adapt to that, and come up with models that adapt to that, as opposed to saying, Well, I’ve already said I’m going to be a copyright computer producer, so I need to change that behavior In some way to conform to that. I do think software, by and large, has been better at saying, let’s adapt to what people are actually doing or not doing with our software, and figure out if we can, we can make a business out of that.
Audience 1:19:56
Hey, good morning, and thank you. This is a question for Clark that I think follows up on Pam’s question that is this: suppose I’m providing legal counsel to the manager of a open source project, and I say, look, all you have to do is make sure that there’s a certain amount of human coding mixed in with whatever has been produced by AI. You register that you disclose to the copyright office that some of it’s produced by AI. You do not have to disclose line by line. The disclosure that is in the registration is very vague. It does not disclose which lines are our code. You make sure that you don’t have comments in the code itself and say, this is the human authored part. And then when it’s released, people would be at risk of not being able to discover which is the human authored part, and so they have to go ahead and respect copyright, because they can’t easily figure it out. I mean, it seems to me that is the counsel I would probably, luckily, I’m not practicing law, but that’s the counsel I would give, and I think it’s pretty effective. And if that’s pretty effective, then I don’t see this current legal regime, you know, threatening open source projects that much. Thanks.
Clark Asay 1:21:18
So I think that may be right. I think it might be more about the breakdown. I mean, I think the norms are what are important in this industry. And so if the norms break down based on perceptions about copyrightability that’s probably more important than, like, the technicalities of whether and how you can register this or not, and how you go about that. So that’s my bigger fear is — this issue of well, this isn’t copyrightable, and sort of the governing mechanism that copyright plays a role in, sort of sustaining an open collaboration, more than like the technicalities of registration itself. Does that make sense? Ish, yeah. Yeah, right. Something. Right, something, yeah. I think that’s right. And this is why, if Heather Meeker, for those of you who know her, she were here, she’d be mad at me for even suggesting that open source software faces sort of any sort of demise or threat from this. I just, I do, I think again, just to re emphasize, copyright has been more it is a legal institution, but it’s in this space, I think the main role it’s played is norm setting and helping facilitate norms, and to the extent that those are threatened by the uncopyrightability of or perceptions of uncopyrightability, then that might sort of push parties to, especially corporations, to trade secrecy and patents.
Pamela Samuelson 1:23:35
This is a little bit of an old problem, though. I mean, I think back to some of the cases involving SAS Institute. So SAS, basically, the original program, that was SAS program, was actually in the public domain. I think it must have been published without notice or something, or as a government sponsored, whatever it was in the public domain. And then they made a derivative work. And the derivative work, then was the work that was alleged to be infringed by somebody who was doing similar software and trying to sort of say, Oh, the parts, this part, this part, this part, was in the public domain, and therefore, if I copied that, that’s not really a problem. And in the more recent case, the SAS versus WPL case, the defendants expert basically talked about this, that, and the other thing, or in the public domain, because it came from the original SAS program that was in the public domain. So in some sense, it’s not an entirely new problem.
Clark Asay 1:24:42
Yeah, and I guess the role of AI — so we already have a relatively thin copyright, and AI hollows it out even more. Could you claim, like, some infringement in that possibly it just becomes, I think, a harder task to to prove.
Audience 1:24:58
(Unable to transcribe)
Clark Asay 1:25:04
To the outputs.
Pamela Samuelson 1:25:05
I think it’s time for at least one more question.
Audience 1:25:18
Pam, I hope you can hear me. Yes, you can. It was fascinating to hear you say — it was fascinating to hear you say that you were wrong about sui generis. And of course, the reason why you can say it is that after a difficult period, the courts roll the sleeves went in, had taken what Congress had done and did their job and developed and adapted it exactly into a kind of a sui generis regime, within copyright, within certain limitations. The issue is, do you think that courts are still willing to do it today? This seems much less willing in many contexts. You know, digital first sale is just one example. I think there are others.
Pamela Samuelson 1:26:05
Yeah, we ended up with sui generis within the copyright system.
Audience 1:26:08
Yeah, but are courts are still willing to engage in similar activities today. Or is it just scary?
Pamela Samuelson 1:26:13
I just thought people would find it interesting that I confess error.
Audience 1:26:22
A quick follow up to Bob’s point, I agree it can be made a non issue. I don’t think many people here focus on a portion of the 1976 act that continues to this date — 17 USC, section 403 — that says, when your work incorporates products of the United States government, your work has to have a copyright notice that specifically points to which aspects are works the United States government. Any article that quotes from a case, a paragraph from a case is a copyrighted work that’s quoting a work of the United States government, and under 17 USC Section 403, should specifically identify that. I put it to you that very few works actually do incorporate that, and it’s been effectively a non issue. And I think Bob has pointed the way for this to be a non issue as well.
Pamela Samuelson 1:27:17
Great. Thank you. I think with the with that wonderful commentary and question, let’s, let’s thank the panel for for a really stimulating comment.