Saturday, September 21, 2024

We reached Peak Knowledge

404 Media reports:
Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project's GitHub, creator Robyn Speer wrote that the project "will not be updated anymore."

"Generative AI has polluted the data," she wrote. "I don't think anyone has reliable information about post-2021 language usage by humans." She said that open web scraping was an important part of the project's data sources and "now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies." While there has always been spam on the internet and in the datasets that Wordfreq used, "it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere," she wrote.

Human knowledge has peaked. The knowledge sources of the future will be dominated by AI. Art, photography, and music have also peaked. AI movies are not quite here yet, but there is also an argument that Hollywood movies peaked 50+ years ago.

No comments: