The NYT vs. OpenAI Copyright Case: A…

Jan 11, 2024

Where we finish our quest to understand memorization in LLMs and how it applies to copyright infringement claims before we move on to the next parts.

Read →

1 Comment

Jurgen Gravestein

Jan 15, 2024

There seems to be a slight contradiction in the following statements:

1) Interestingly, OpenAI and anyone training LLMs invest a lot of energy in actively deduplicating content in the training data because, as Lee et al. (2022) show, proactive deduplication of training data makes the LLMs perform better at downstream tasks.

2) LLM builders are actively trying to minimize memorization because doing so is good for their bottom line.

While memorization is not the intended goal, the importance of certain data to the success of the model is clear. It makes the data more valuable; the makers know this and the NYT sure knows this too.

Expand full comment

AI Research & Strategy

The NYT vs. OpenAI Copyright Case: A…