There seems to be a slight contradiction in the following statements:
1) Interestingly, OpenAI and anyone training LLMs invest a lot of energy in actively deduplicating content in the training data because, as Lee et al. (2022) show, proactive deduplication of training data makes the LLMs perform better at downstream tasks.
2) LLM builders are actively trying to minimize memorization because doing so is good for their bottom line.
While memorization is not the intended goal, the importance of certain data to the success of the model is clear. It makes the data more valuable; the makers know this and the NYT sure knows this too.
There seems to be a slight contradiction in the following statements:
1) Interestingly, OpenAI and anyone training LLMs invest a lot of energy in actively deduplicating content in the training data because, as Lee et al. (2022) show, proactive deduplication of training data makes the LLMs perform better at downstream tasks.
2) LLM builders are actively trying to minimize memorization because doing so is good for their bottom line.
While memorization is not the intended goal, the importance of certain data to the success of the model is clear. It makes the data more valuable; the makers know this and the NYT sure knows this too.