/ #software / #automation / #API
#ZenRows turns hours of tedious lead hunting into a fast, reliable, and automated data workflow that fills your data pipelines.
/ #software / #automation / #API
#ZenRows turns hours of tedious lead hunting into a fast, reliable, and automated data workflow that fills your data pipelines.
Cloudflare bloqueará por defecto los rastreadores de contenido web IA https://blog.elhacker.net/2025/07/cloudflare-bloqueara-por-defecto-los-rastreadores-bots-trafico-web.html #inteligenciaartificial #CloudFlare #rastreo #scraper #robots #CDN
»Cloudflare Introduces Default Blocking of A.I. Data Scrapers«
Nett, wird aber kaum funktionieren. Weil: Fortgeschrittene Scraper nutzen Browser-Emulation und rotierende IPs, um sich als echte Nutzer auszugeben und technische Erkennung zu umgehen. Da es sich nur um eine serverseitige Maßnahme ohne rechtliche Bindung handelt, können solche Akteure die Sperren leicht und folgenlos ignorieren.
https://www.nytimes.com/2025/07/01/technology/cloudflare-ai-data.html
/kuk
Mastodon untersagt das Training von KIs mit Inhalten der eigenen Plattform.
Neue #Nutzungsbedingungen gegen #KITraining #Mastodon verbietet ab 1. Juli explizit das #Scrapen und die Nutzung von User-Daten zum #Training von #KIModellen auf seinem #Hauptserver.
Klarer #Schutz der #Community Die neuen Regeln untersagen automatisierte Tools wie #Bots und #Scraper, um Daten abzugreifen – mit Ausnahme von normalen Suchmaschinen und Browsern. (1/2)
@spielleitung also ich mach's mir professionell einfacher und blockiere einfach alle bekannten #Scraper...
https://github.com/greyhat-academy/lists.d/blob/main/scrapers.ipv4.block.list.tsv
The most disgusting feature of this relatively new #AI #scraper |s plague is that they are about to defile everything we like in the *good* internet.
Images with relevant #AltText? Perfect training materials for text-to-image generative models.
Static webpages? No #Anubis - no problem to scrape.
#Anubis uses proof-of-work ( #PoW ), which implies either #JavaScript or manual instructions. No, it is a good solution... Best of the worst (as if there were any good ones...)
Last days I learned that (1) #Tor has a #PoW mechanism (2) Anubis seems to somehow whitelist #lynx browser, allowing no-JS Lynx users in (a big favour for #accessibility and #smolweb ). Good (let's hope all these will persist).
MWoffliner, the @mediawiki #scraper has been released in version 1.15!
1.15 brings a significant amount of improvements:
* Support of wide used (outside Wikimedia) "ActionParse" API
* Use latest libzim (we were stuck with an older version) which fixes many suggestion problems with non-latin alphabets
* Move to Node.js 24 + many install fixes
* Better & sophisticated remote error handling
Full changelog at https://github.com/openzim/mwoffliner/releases/tag/v1.15.0
Available as container image and Npmjs package!
3/
For more on scraping (as in web-scraping) see here:
https://mastodon.social/@reiver/114353728684249608
CC: @404mediaco
2/
Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.
If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.
...
And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.
CC: @404mediaco
1/
If these researchers used a typical HTTP-based API that returns JSON, then —
What these researchers did is NOT scraping.
CC: @404mediaco
RE: https://www.404media.co/researchers-scrape-2-billion-discord-messages-and-publish-them-online/
And it's much better
fuck you OpenAI and Amazon
Update: I reported the bot. Thanks.
A Mastodon bot account at mastodon.cloud scans the fediverse, scrapes selected web pages shared there, rewrites them with AI, posts them to its own site, and shares on Mastodon as tech news the rewritten AI slop. The bot scraped a post of mine (including the attached image) within minutes of my federated blog publishing it.
Is it worth flagging the bot and reporting it to its instance? Are the mods likely to take action?
Anubis: self hostable scraper defense software | Anubis
Wikimedia is buckling under the weight of AI scrapers.
"Our content is free, our infrastructure is not".
https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects
web scraper
web scraper
web scraper
web scraper