mastodontech.de ist einer von vielen unabhängigen Mastodon-Servern, mit dem du dich im Fediverse beteiligen kannst.
Offen für alle (über 16) und bereitgestellt von Markus'Blog

Serverstatistik:

1,4 Tsd.
aktive Profile

#webscraping

0 Beiträge0 Beteiligte0 Beiträge heute

New Open-Source Tool Spotlight 🚨🚨🚨

Scrapling is redefining Python web scraping. Adaptive, stealthy, and fast, it can bypass anti-bot measures while auto-tracking changes in website structure. A standout: 4.5x faster than AutoScraper for text-based extractions. #Python #WebScraping

🔗 Project link on #GitHub 👉 github.com/D4Vinci/Scrapling

#Infosec #Cybersecurity #Software #Technology #News #CTF #Cybersecuritycareer #hacking #redteam #blueteam #purpleteam #tips #opensource #cloudsecurity

✨
🔐 P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking 💻🏴‍☠️

"The report, titled “Are AI Bots Knocking Cultural Heritage Offline?” was written by Weinberg of the GLAM-E Lab, a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law, which works with smaller cultural institutions and community organizations to build open access capacity and expertise. GLAM is an acronym for galleries, libraries, archives, and museums. The report is based on a survey of 43 institutions with open online resources and collections in Europe, North America, and Oceania. Respondents also shared data and analytics, and some followed up with individual interviews. The data is anonymized so institutions could share information more freely, and to prevent AI bot operators from undermining their countermeasures.

Of the 43 respondents, 39 said they had experienced a recent increase in traffic. Twenty-seven of those 39 attributed the increase in traffic to AI training data bots, with an additional seven saying the AI bots could be contributing to the increase.

“Multiple respondents compared the behavior of the swarming bots to more traditional online behavior such as Distributed Denial of Service (DDoS) attacks designed to maliciously drive unsustainable levels of traffic to a server, effectively taking it offline,” the report said. “Like a DDoS incident, the swarms quickly overwhelm the collections, knocking servers offline and forcing administrators to scramble to implement countermeasures. As one respondent noted, ‘If they wanted us dead, we’d be dead.’”"

404media.co/ai-scraping-bots-a

404 Media · AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums"This is a moment where that community feels collectively under threat and isn't sure what the process is for solving the problem.”

Are AI bots overwhelming digital collections?
A new GLAM-E Lab report shows how scrapers for AI training datasets are putting real strain on the infrastructures of galleries, libraries, archives, and museums. Technical bottlenecks, ethical dilemmas, and escalating costs—open culture is under pressure.
Read the full analysis:
glamelab.org/products/are-ai-b
#DigitalHeritage #GLAM #WebScraping #OpenAccess #CulturalData #MuseTech #DigitalHumanities #GLAMlab

GLAM-E LabAre AI Bots Knocking Cultural Heritage Offline?

"To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers."

eff.org/deeplinks/2025/06/keep

Electronic Frontier Foundation · Keeping the Web Up Under the Weight of AI CrawlersIf you run a site on the open web, chances are you've noticed a big increase in traffic over the past few months, whether or not your site has been getting more viewers, and you're not alone. Operators everywhere have observed a drastic increase in automated traffic—bots—and in most cases attribute...
Antwortete im Thread

@georgfischer Stehlen ist vielleicht das falsche Wort, aber kommerzielle LLMs richten Schaden an und profitieren vom Werk Anderer, ohne dafür irgendwie zu bezahlen. Ihre Crawler überlasten Server, und sie speisen ihre Inhalte aus allem was sie sehen, auch aus Werken die nicht für kommerzielle Weiternutzung lizensiert sind. Was auch immer das richtige Wort hierfür ist, ich finde diese Praxis parasitär und unethisch. #llms #Webscraping #aislop #chatgpt #cclizenzen

Fortgeführter Thread

2/

Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

...

And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.

CC: @404mediaco

#Scraper#Scraping#WebScraper
Antwortete im Thread

@JoergA Und wie gut ist das #WebScraping in #FreshRSS? (freshrss.org/ preist das immerhin als Feature)

Den separaten #RSS-#Reader habe ich schon lange abgeschafft, ich verwende einfach #RSS2Email und pumpe meine Feeds durch die vorhandene Email-Infrastruktur. Da kann ich's dann auch gleich auf allen Geräten ohne Mehraufwand nutzen.

freshrss.orgFreshRSS, a free, self-hostable feeds aggregatorFreshRSS is lightweight, easy to work with, powerful, and customizable.