Reddit is taking a firm stand against what it calls unauthorized AI data scraping by limiting the Internet Archive’s Wayback Machine to only store its homepage. Beginning August 11, 2025, most post and comment data is now blocked from archival.
The company says the move is directly tied to protecting its growing data licensing business, which now generates about 10% of its total revenue through high-profile deals with Google and OpenAI.
AI-driven data protection
A Reddit spokesperson said the block will remain until the Internet Archive can enforce stricter anti-scraping measures, including removing deleted content and honoring platform policies. Executives have accused AI developers of bypassing licensing deals by scraping from archived pages.
This stance follows March 2023 API changes that made third-party access prohibitively expensive, signaling a long-term strategy to control who uses Reddit’s user-generated content.
Did you know?
The Internet Archive’s Wayback Machine has indexed over 866 billion web pages since its launch in 2001, making it the largest library of archived websites in history.
Licensing is big business
Reddit’s licensing arrangements are lucrative: a $60 million annual agreement with Google and a $70 million deal with OpenAI, together worth around $130 million per year. These AI training partnerships have transformed data access into a significant new revenue channel.
By restricting archival, Reddit can limit free external access and funnel more negotiations toward paid deals, strengthening its control in an AI-hungry market.
ALSO READ | How will GitHub’s org change affect developers day to day?
Legal enforcement against AI
Reddit has already filed a lawsuit against AI startup Anthropic, accusing it of scraping more than 100,000 posts without permission, including deleted materials. The company is leaning on contractual violations instead of copyright claims to push its case.
These proactive measures reflect a growing trend among social platforms to set firm ground rules for AI companies looking to harvest their content.
Impact on web preservation
For researchers, historians, and journalists, the block is a blow to the long-term preservation of online discourse. Future Wayback Machine captures will show trending headlines but not the community discussions that defined Reddit’s cultural footprint.
The Internet Archive has yet to comment, but the standoff underscores a broader tension between monetizing data and maintaining an open record of internet history. Whether Reddit will restore access depends on its compliance with privacy and anti-scraping requirements.
Comments (0)
Please sign in to leave a comment