专注于分布式系统架构AI辅助开发工具(Claude
Code中文周刊)

Website Anti-Scraping Upgrades: Combating the Surge in LLM Training Data Bots

智谱 GLM,支持多语言、多任务推理。从写作到代码生成,从搜索到知识问答,AI 生产力的中国解法。

As a flood of web scrapers descended on sites to collect LLM training data in early 2025, website administrators have been forced to strengthen their anti-scraping measures. These bots predominantly use outdated browser user agents, particularly Chrome versions, placing immense strain on website servers. The article details how the author identifies and blocks these scrapers by detecting suspicious browser versions, specifically highlighting issues with archival sites like archive.* that employ fake user agents and IP addresses. The author recommends using the more standardized archival service, archive.org. The piece reveals the real-world impact of AI training data collection on website operations, offering the tech community frontline experience in dealing with LLM training data scrapers.

Original Link:Hacker News

赞(0)
未经允许不得转载:Toy Tech Blog » Website Anti-Scraping Upgrades: Combating the Surge in LLM Training Data Bots
免费、开放、可编程的智能路由方案,让你的服务随时随地在线。

评论 抢沙发

十年稳如初 — LocVPS,用时间证明实力

10+ 年老牌云主机服务商,全球机房覆盖,性能稳定、价格厚道。

老品牌,更懂稳定的价值你的第一台云服务器,从 LocVPS 开始