Building and training machine learning systems often requires data a lotof data.
Sometimes were lucky, and we can make our own data.
Sometimes we cant make our own data.

In art, defining winners and losers is considerably harder.
But where do we find that?
Where we find the answer to every problem in life: on random websites.

Theyre also massive, with one popular dataset, LAION-5b, containing over fivebillionimages.
These images are all gathered automatically, with very minimal attempts to filter the contents.
This is the chief reason why youll see so many people posting about AI stealing content online.

Theft is sometimes simple, and sometimes complicated.
In the real world, where courts and legal systems get involved, theft is a lot greyer.
But we dont always agree on what theft is, especially in courtrooms.
Deciding on what constitutes theft is unfortunately often more about power than it is about justice.
Its not clear which of these arguments, if any, will break through in court.
However, the arguments that defend this large-scale exploitation of creative work are missing the point of the problem.
This isnt a question of legality, but one of humanity.
It’s this long-term damage that a lot of creative people and AI researchers fear the most.
But the impact of disruption like this can take years to be noticed, and decades to be reversed.