Judge Sidney Stein affirmed an order compelling OpenAI to produce 20 million anonymized ChatGPT logs in a landmark AI copyright MDL in the Southern District of New York.
U.S. District Judge Sidney H. Stein of the Southern District of New York affirmed a magistrate judge's discovery order requiring OpenAI to produce a 20-million-entry anonymized sample of ChatGPT interaction logs to plaintiffs in a multidistrict copyright infringement proceeding [1]. The ruling rejected OpenAI's proposal to limit production to keyword-filtered logs tied to the plaintiffs' specific works [2]. The court found the full, unfiltered sample relevant under the Federal Rules of Civil Procedure and concluded that anonymization combined with a protective order adequately addresses any residual user privacy concerns [1].
The ruling arises from In Re: OpenAI, Inc. Copyright Infringement Litigation, a consolidated MDL before Judge Stein in the Southern District of New York. Magistrate Judge Ona T. Wang issued the underlying order; Judge Stein's affirmance resolves OpenAI's objection to that ruling [1]. Plaintiffs include major news organizations, among them The New York Times and the Chicago Tribune, which contend that ChatGPT was trained on and reproduces their copyrighted journalism without authorization or compensation [2].
The substantive significance turns on fair use. OpenAI has signaled a fair use defense, which requires the court to assess, among other factors, whether ChatGPT's outputs substitute for, and thereby harm the market for, the original works [2]. Judge Stein found that output patterns across 20 million interactions bear directly on that market-harm analysis, and that a plaintiff-specific keyword filter would artificially narrow the evidentiary record in OpenAI's favor [1]. The ruling means plaintiffs will gain visibility into whether ChatGPT systematically generates content that competes with or displaces publisher content at scale, a dataset that could materially strengthen the fourth fair-use factor at summary judgment or trial [2].
The production timeline is now operative, and the parties will next likely dispute how the log data is reviewed, categorized, and designated under the protective order [1]. Motions practice around expert analysis of the logs is expected to follow, with the record building toward dispositive motions on fair use and liability. The case remains an early but consequential data point for courts evaluating how discovery rules apply when AI training and output behavior are at the center of copyright claims across the industry.