Researchers at Stanford University say the data set used to train AI imaging tools contains at least 1,008 verified cases of child sexual abuse material (CASS). The researchers noted that the presence of CSAM in the dataset could allow AI models trained on the data to generate new and even realistic materials of this type.
A non-profit organization LAIONwhich created the dataset, said it applies policy zero tolerance against illegal content and as a precaution temporarily downloaded datasets. The organization added that it created filters to detect and remove illegal content before first publishing its datasets. Even that, it seems, was not enough.
According to earlier reports, the file in question contains data LAION-5B millions of images containing pornography, violence, child nudity, racist memes, hateful symbols, copyrighted artwork, and works from private company websites. In total it contains over 5 billion images and related labels (the dataset itself does not contain any images, but rather links to images and alt texts). LAION founder Christoph Schuhmann said earlier this year that while he was not aware of any CSAM in the dataset, he had not examined the data very closely.
However, given the huge amount of data the AI has downloaded, CSAM is impossible to avoid. However, some AI models freely available on the Internet have no protection against this type of data. According to researchers from Stanford University, these models should be phased out as soon as possible.