No One Wants Apple To Scrape Their Websites for AI Training

Hello!

Wired reports that, as of 2026, a growing number of major websites—including influential news publishers and leading social media platforms—are blocking Apple’s web crawler from scraping their pages for AI training content.

Publishers Draw the Line

According to the report, media companies that have updated their robots.txt files to exclude Applebot include The New York Times, The Atlantic, The Financial Times, Gannett, Vox Media, and Condé Nast.

No One Wants Apple To Scrape Their Websites for AI Training

Social Platforms Follow Suit

On the social media side, Facebook, Instagram, and Tumblr have all confirmed they are blocking Apple from scraping their sites, as has the long-standing platform Craigslist.

Robots.txt as a Battleground

Robots.txt files have become an increasingly revealing lens on the digital politics of AI. Some of these companies—including Vox, Condé Nast, and The Atlantic—have signed content licensing deals with OpenAI, while The New York Times has taken a firm stance, actively suing OpenAI for copyright infringement.

Facebook and Instagram are owned by Meta, a direct competitor to Apple in the AI space. Meanwhile, user-generated platforms such as Tumblr and Craigslist hold valuable repositories of human-created data. In parallel, Apple has already entered a deal with OpenAI to integrate the chatbot ChatGPT into Apple experiences.

The AI industry remains fiercely competitive, especially when it comes to access to high-quality, human-generated training data. As relationships between AI companies and content sources continue to evolve, the decisions about where crawlers like Applebot are permitted to operate offer a clear window into strategic priorities on both sides.

Apple-Extended: Opt-Out by Design

No One Wants Apple To Scrape Their Websites for AI Training According to Wired, these sites have specifically blocked “Apple-Extended,” a crawler that, per an Apple blog post, gives publishers the explicit option to “opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.”

An Apple spokesperson confirmed to Wired that blocking Applebot-Extended does not stop the original Applebot from crawling a site; it simply prevents any collected data from being used to train Apple’s AI models.

Applebot itself continues to gather data for Siri and Spotlight, underscoring Apple’s careful separation of general web indexing from generative AI training.

Legal Context and Strategic Caution

The New York Times is not alone in pursuing legal action against AI developers. Given ongoing litigation across the industry, Apple’s decision to offer a clear opt-out mechanism appears designed to avoid incorporating potentially disputed content into its training datasets—especially while relying on OpenAI for key product features. It serves as an early indicator of how publishers and AI companies alike are navigating copyright risks in 2026.

Thank you!
Join us on social media!
See you!