 

网站反爬虫升级：应对LLM训练数据爬虫激增

2025-12-21 分类：前沿哨所阅读(1) 评论(0) 赞(0)

智谱 GLM，支持多语言、多任务推理。从写作到代码生成，从搜索到知识问答，AI 生产力的中国解法。

随着2025年初大量爬虫涌入网站收集LLM训练数据，网站管理员不得不加强反爬虫措施。这些爬虫多使用旧的浏览器用户代理，特别是Chrome版本，给网站服务器带来巨大压力。文章详细介绍了作者如何通过识别可疑浏览器版本来阻止这些爬虫，并特别指出archive.*等归档网站存在使用伪造用户代理和IP地址的问题。作者建议用户使用archive.org这一更规范的归档服务。文章揭示了AI训练数据收集对网站运营产生的实际影响，为技术社区提供了应对LLM训练数据爬虫的一线经验。

原文链接：Hacker News

赞(0)

未经允许不得转载：Toy Tech Blog » 网站反爬虫升级：应对LLM训练数据爬虫激增

分享到

LLM训练反爬虫网站安全

评论抢沙发

快讯

AnyRouter Transparent Proxy Update: New Web Request Monitoring Feature

The open-source project AnyRouter-Transparent-Proxy recently received a significant update with the addition of a web request monitoring management panel. Developed based on FastAPI, this project aims to resolve the 500 error issues encountered by AnyRouter in the Claude Code for VS Code plugin. The newly added monitoring panel allows users to visually view the success or failure status of API requests, accessible simply by visiting http://IP:port/admin. Although the configuration page is currently read-only and cannot be modified, this does not affect the use of the monitoring functionality. This tool provides developers using Claude Code with a local transparent proxy solution, effectively resolving API connection issues and enhancing the development experience. The project has been open-sourced on GitHub, and developers are welcome to follow and use it.

Original Link:Linux.do

28分钟前
Fix AnyRouter Public Claude API Error 520: A Simple Environment Variable Solution

Recently, users have encountered error code 520 when using the Claude service on AnyRouter's public platform, preventing normal service operation. The error appears as 'All providers failed, status: 520' with a retry prompt. For this issue, the technical community has provided a simple and effective solution. Users only need to locate Claude's configuration file (macOS path: /Users/username/.claude/settings.json, similar path for Windows systems), create it if it doesn't exist, and add the property: 'CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC':true. After saving, restart Claude to restore normal functionality. Analysis suggests this issue may be related to API forwarding interference caused by reverse engineering, and this environment variable can effectively filter out non-essential traffic interference. This solution is straightforward and requires no complex operations, resolving the troublesome 520 error and improving the AnyRouter public platform user experience.

Original link:Linux.do

28分钟前
Website Anti-Scraping Upgrades: Combating the Surge in LLM Training Data Bots

As a flood of web scrapers descended on sites to collect LLM training data in early 2025, website administrators have been forced to strengthen their anti-scraping measures. These bots predominantly use outdated browser user agents, particularly Chrome versions, placing immense strain on website servers. The article details how the author identifies and blocks these scrapers by detecting suspicious browser versions, specifically highlighting issues with archival sites like archive.* that employ fake user agents and IP addresses. The author recommends using the more standardized archival service, archive.org. The piece reveals the real-world impact of AI training data collection on website operations, offering the tech community frontline experience in dealing with LLM training data scrapers.

Original Link:Hacker News

39分钟前
Self-Hosted PostgreSQL: A Cost-Benefit and Technical Advantage Analysis

The article author shares their two-year experience self-hosting PostgreSQL databases, pointing out that cloud database service providers have promoted the narrative that 'database hosting is dangerous' over the past decade, when in reality most cloud hosts simply run slightly modified open-source Postgres servers. The author provides a detailed comparison of the pros and cons of self-hosting versus cloud database services, including cost, performance, reliability, and operational complexity. The article offers specific PostgreSQL configuration parameters and optimization recommendations, including memory configuration, connection management, storage optimization, and WAL configuration. Through actual migration experience, the author demonstrates that self-hosting PostgreSQL is not only more cost-effective (saving hundreds of dollars monthly) but also provides better performance and greater control. The article concludes that while self-hosting may not be the best choice for all scenarios, it's a worthwhile option for most teams to consider under specific conditions.

Original link:Hacker News

39分钟前
AI Video Generation Breakthrough: Student Cafeteria Scene Shows Major Progress

This article shares a high-quality AI-generated video centered around students arriving at the cafeteria before meals are ready, designed to stall them. The author notes that while the video's tone is somewhat unusual, it represents significant progress compared to previous AI videos that couldn't feature speech. The video features complex and challenging plot design, and the author highly recommends watching it. The link points to the actual video file, showcasing the latest applications of AI technology in video generation, with potential involvement from Doubao AI. This reflects the rapid development of generative AI in content creation, particularly breakthroughs in natural language processing and video synthesis, offering valuable insights for tech enthusiasts.

Original Link:Linux.do

40分钟前
Fixing Antigravity Login Redirect Issues: A Tested 3-Step Solution

This article provides practical solutions for users experiencing browser redirect issues and region restrictions when using Antigravity services. The core steps include: 1. Using Proxifier software to resolve redirect obstacles; 2. Changing your Google account's associated region by submitting an application through the official form; 3. Enabling proxy software and selecting a node that matches your account's region. This method is based on real user experience, with successful login achieved after changing the region from Hong Kong to Singapore. It emphasizes that proxy software must be configured in TUN mode or global proxy mode. The content includes detailed operational guidelines, suitable for tech-savvy users handling network service access challenges, offering practical technical value.

Original Link:Linux.do

40分钟前