๐Ÿ›œ haze

Ohy, who's crawler is abusing TLGS?

I'm getting constant requests hitting the search API. It's not a big problem due to the low volume. And sicne I log nothing I don't know who is doint it. But you are generating a lot of errors. ,

3 years ago

Actions

๐Ÿ‘‹ Join Station

6 Replies

๐Ÿ‘ฝ elektito

I'm afraid my first post on the station is going to be an apology. One of those might have been mine. I've already sent you an email explaining my situation and apologizing. Hopefully this has already been resolved. But if not, y'all now know where to find me. ยท 3 years ago

๐Ÿ›œ haze

It's sad to see I'm not the only one to face this issue. The ones I saw is quite weird thogh. It's hitting `/search?Gemini%20size:>20KB=========` with varying amount of equals each time. It's actually the parsing errors lead me to discovering it,

I don't know about logging to blackhole though. I'm trying my best to not log. But it seems necessary in this case. ยท 3 years ago

๐Ÿซ  acidus

yeah, its kind of gross. I'm fine with crawlers as long as they 1) respect robots.txt and 2) are somewhat slow (~1 request sec). Most seem to fail #1 expect for ones like TLGS, Lupa, mine, etc. oh well ยท 3 years ago

๐Ÿ‘ฝ moddedbear

@acidus Ugh. You made me check mine again and I found another crawler even worse than the first. Over 60,000 requests from it in the logs with it first appearing March 13. A ton of the responses it's been getting back are rate limit or client certificate errors so you'd think it'd take a hint. Neither of the two I've blackholed so far are coming from that capsule though. ยท 3 years ago

๐Ÿซ  acidus

Same. I tracked down a crawler coming from the same IP address as this capsule:

gemini://frrobert.net/

They were aggressively crawling links through NewsWaffle, hitting over 450000 URLs in a week. It was literally crawling the entire Internet through my poor CGI ๐Ÿคฏ

Problem was, NewsWaffle caches the HTML and this caused by VPC's disk to fill up faster than the cron job could clean it. I messaged them several times and got no response, so I blackholed their IP...

๐Ÿคท๐Ÿปโ€ ยท 3 years ago

gemini://frrobert.net/

๐Ÿ‘ฝ moddedbear

I noticed a poorly made crawler on the rocketcaster capsule a couple weeks ago. Ignoring robots.txt, spamming expensive endpoints, and ignoring rate limits were just some of the things it was doing. I wonder if yours is the same one.

If it's causing you trouble, one option is to do some logging to find the IP of the crawler and black hole it. That's what I ended up doing. ยท 3 years ago

Proxied content from gemini://station.martinrue.com/haze/6f52994095a342d9b009fdbe28037a85 (external content)

Gemini request details:

Original URL
gemini://station.martinrue.com/haze/6f52994095a342d9b009fdbe28037a85
Status code
Success
Meta
text/gemini
Proxied by
kineto

Be advised that no attempt was made to verify the remote SSL certificate.