Ban me at the IP level if you don't like me
It's not like we can capitalize on commerce in China anyway, so I think it's a fairly pragmatic approach.
If it works for my health insurance company, essentially all streaming services (including not even being able to cancel service from abroad), and many banks, it’ll work for you as well.
Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
The blocks don't stay in place forever, just a few months.
The only way of communicating with such companies are chargebacks through my bank (which always at least has a phone number reachable from abroad), so I’d make sure to account for these.
Visa/Mastercard chargeback rules largely apply worldwide (with some regional exceptions, but much less than many banks would make you believe).
I have first-hand experience, as I ran a company that geoblocked US users for legal reasons and successfully defended chargebacks by users who made transactions in the EU and disputed them from the US.
Chargebacks outside the US are a true arbitration process, not the rubberstamped refunds they are there.
Chargebacks outside the US are a true arbitration process, not the rubberstamped refunds they are there.
What's true is that in the US, the cardholder can often just say "I've never heard of that merchant", since 3DS is not really a thing, and generally merchants are relatively unlikely to have compelling evidence to the contrary.
But for all non-fraud disputes, they follow the same process.
Again, you're not aware of the reality outside the US.
It's a significant burden of proof for a cardholder to win a dispute for non-compliance with card network rules
That's true, but "fraud" and "compliance" aren't the only dispute categories, not by far.
In this case, using Mastercard as an example (as their dispute rules are public[1]), the dispute category would be "Refund not processed".
The corresponding section explicitly lists this as a valid reason: "The merchant has not responded to the return or the cancellation of goods or services."
Again, you're not aware of the reality outside the US.
Repeating your incorrect assumption doesn't make it true.
[1] https://www.mastercard.us/content/dam/public/mastercardcom/n...
a) a Refund Not Processed chargeback is for non-compliance with card network rules,
and b), When the merchant informed the cardholder of its refund policy at the time of purchase, the cardholder must abide by that policy.
We won these every time, because we had a lawful and compliant refund policy and we stuck to it. These are a complete non-issue for vendors outside the US, unless they are genuinely fraudulent.
Honestly, I think you have no experience with card processors outside the US (or maybe at all) and you just can't admit you're wrong, but anyone with experience would tell you how wrong you are in a heartbeat. The idea you can "defeat" geoblocks with chargebacks is much more likely to result in you losing access to credit than a refund.
It's quite possible that both of our experiences are real – at least I'm not trying to cast doubt on yours – but my suspicion is that the generalization you're drawing from yours (i.e. chargeback rules, or at least their practical interpretation, being very different between the US and other countries) isn't accurate.
Both in and outside the US, merchants can and do win chargebacks, but a merchant being completely unresponsive to cancellation requests of future services not yet provided (i.e. not of "buyer's remorse" for a service that's not available to them, per terms and conditions) seems like an easy win for the issuer.
Are you even trying to see things from a different perspective, or are you just dead set on winning an argument via ad hominems based on incorrect assumptions about my background?
I'm very open to a different perspective if it's grounded in reality. I'm only judging you on your comments, which to date have been factually inaccurate (to the point that I wonder if you're trolling?),
Both in and outside the US, merchants can and do win chargebacks,
At vastly different rates (~10% vs ~80%)
but a merchant being completely unresponsive to cancellation requests of future services not yet provided (i.e. not of "buyer's remorse" for a service that's not available to them, per terms and conditions)
Geoblocking a region is not being unresponsive and will not result in a breach of network rules. Lots of precedent and completely uncontroversial but yet you believe otherwise.
seems like an easy win for the issuer.
Seems is the operative word here, but it only seems so from your uninformed position. Even after quoting the MC terms that show that you're incorrect, you're still not open to new information.
At vastly different rates (~10% vs ~80%)
Is that your observed rate or an industry-wide trend?
If it's the former, I'll stick with my theory – you're extrapolating from a pretty specific scenario to a different one. My guess would be that you're conflating geoblocking of content (what you seem to have experience with) with geoblocking of the cancellation method (what this thread is about).
If it's the latter, you're wildly off base:
Merchants win an average of 50% of representments, though there are differences by country: U.S.: 54%, U.K.: 49.1%, AU: 46.7% and Brazil: 36.9%.
(from https://www.mastercard.com/us/en/news-and-trends/Insights/20...)
In fact, this is the opposite of what you're claiming (i.e. a higher win rate for merchants outside the US).
Visiting the website" is the method. It's nonsense to say that visiting from a different location is a different method.
This is a naive view of the internet that does not stand the test of legislative reality. It's perfectly reasonable (and in our case was only path to compliance) to limit access to certain geographic locations.
I don't care if you won those disputes, you did a bad thing and screwed over your customers.
In our case, our customers were trying to commit friendly fraud by requesting a chargeback because they didn't like a geoblock, which is also what the GP was suggesting.
Using chargebacks this way is nearly unique to the US and thankfully EU banks will deny such frivolous claims.
Are you saying they tried a chargeback just because they were annoyed at being unable to reach your website? Something doesn't add up here, or am I giving those customers too much credit?
Were you selling them an ongoing website-based service? Then the fair thing would usually be a prorated refund when they change country. A chargeback is bad but keeping all their money while only doing half your job is also bad.
Are you saying they tried a chargeback just because they were annoyed at being unable to reach your website?
In our case it was friendly fraud when users tried to use a service which we could not provide in the US (and many other countries due to compliance reasons) and had signed up in the EU, possibly via VPN.
If you read back in the thread, we're talking about the claim that adding geoblocking will result in chargebacks, which outside the US, it won't.
As a response to someone talking about customers traveling and needing support. But yeah geoblocks can occur in different situations with different appropriate resolutions.
In our case it was friendly fraud when users tried to use a service which we could not provide in the US (and many other countries due to compliance reasons) and had signed up in the EU, possibly via VPN.
If you provided zero service at all, they should get their money back. And calling a chargeback in that situation "friendly fraud" is ridiculous.
If they weren't even asking for a refund and using a chargeback out of spite, that's bad, but that's a different problem from fraud.
For someone that did sign up via VPN, would they be able to access the cancellation page via VPN?
If you provided zero service at all, they should get their money back. And calling a chargeback in that situation "friendly fraud" is ridiculous.
No, if a company upholds their side of a contract, the customer must too, within the bounds of the law.
A chargeback in that situation is the _definition_ of "friendly fraud" and is actual criminal fraud.
If they weren't even asking for a refund and using a chargeback out of spite, that's bad, but that's a different problem from fraud.
That's also criminal fraud.
US consumer are often shocked that "customer is always right" customer service doesn't extend beyond their borders and that they can't chargeback their way out of contracts they've signed.
For someone that did sign up via VPN, would they be able to access the cancellation page via VPN?
It doesn't matter. If our terms prohibited VPN use to avoid geoblocking (which they did), it's irrelevant whether your VPN can or cannot access the cancellation page on a given day. You can email or write to us. All perfectly legal, lawful, and backed by merchant account providers.
No, if a company upholds their side of a contract, the customer must too, within the bounds of the law.
The company upholding their side by... doing nothing? Just give a refund if you're not providing service. And what is this about upholding your side if you're legally unable to provide the service in the first place?
A chargeback in that situation is the _definition_ of "friendly fraud" and is actual criminal fraud.
They have to get the thing and then chargeback. Your definition is nonsense if it doesn't include them getting the thing.
That's also criminal fraud.
It might be if they lie about something. But this isn't worth going on a tangent.
It doesn't matter. If our terms prohibited VPN use to avoid geoblocking (which they did), it's irrelevant whether your VPN can or cannot access the cancellation page on a given day. You can email or write to us. All perfectly legal, lawful, and backed by merchant account providers.
Do they know who to email while the site is blocked? At least that's something.
But I'm not even asking about things fluctuating from day to day, I'm worried about a situation where a VPN can sign up but the same VPN at the same time can't be used to cancel.
You can email or write to us.
How do I find your email or postal address if you're blocking every request from a given region? My original point was about companies that do that.
If you're not, I agree that there's much less of a problem (some jurisdictions require online cancellation methods, though).
I can imagine a merchant to win a chargeback if a customer e.g. signs up for a service using a VPN that isn't actually usable over the same VPN and then wants money for their first month back.
But if cancellation of future charges is also not possible, I'd consider that an instance of a merchant not being responsive to attempts at cancellation, similar to them simply not picking up the phone or responding to emails.
I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.
Visa and Mastercard aren't even involved in most disputes. Almost all disputes are settled between issuing and acquiring bank, and the networks only step in after some back and forth if the two really can't figure out liability.
Yes, the issuing and acquiring banks perform an arbitration process, and it's generally a very fair process.
We disputed every chargeback and post PSD2 SCA, we won almost all and had a 90%+ net recovery rate. Similar US businesses were lucky to hit 10% and were terrified of chargeback limits.
I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.
Are you sure? More likely, the vendor didn't dispute the successful chargebacks.
But "merchant does not let me cancel" isn't a fraud dispute (and in fact would probably be lost by the issuing bank if raised as such). Those "non-fraudulent disagreement with the merchant disputes" work very similarly in the US and in Europe.
I can only assume you are from the US and are assuming your experience will generalise, but it simply does not. Like night and day. Most EU residents who try using chargebacks for illegitimate dispute resolution learn these lessons quickly, as there are far more card cancellations for "friendly fraud" than merchant account closures for excessive chargebacks in the EU - the polar opposite of the US.
I say that because I can't count how many times Google has taken me to a foreign site that either doesn't even ship to the US, or doesn't say one way or another and treat me like a crazy person for asking.
I was in UK. I wanted to buy a movie ticket there. Fuck me, because I have an Austrian ip address, because modern mobile backends pass your traffic through your home mobile operator. So I tried to use a VPN. Fuck me, VPN endpoints are blocked also.
I wanted to buy a Belgian train ticket still from home. Cloudflare fuck me, because I’m too suspicious as a foreigner. It broke their whole API access, which was used by their site.
I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too. And of course my bank card… and I just wanted to order a pizza.
The most annoying is when your fucking app is restricted to your stupid country, and I should use it because your app is a public transport app. Lovely.
And of course, there was that time when I moved to an other country… pointless country restrictions everywhere… they really helped.
I remember the times when the saying was that the checkout process should be as frictionless as possible. That sentiment is long gone.
I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too.
Your mobile provider was routing you through Austria while in the US?
When I was in China, using a Chinese SIM had half the internet inaccessible (because China). As I was flying out I swapped my SIM back to my North American one... and even within China I had fully unrestricted (though expensive) access to the entire internet.
I looked into it at the time (now that I had access to non-Chinese internet sites!) and forgot the technical details, but seems that this was how the mobile network works by design. Your provider is responsible for your traffic.
For the record, my website is a front end for a local-only business. Absolutely no reason for anyone outside the US to participate.
In my experience running rather lowish traffic(thousands hits a day) sites, doing just that brought every single annoyance from thousands per day to zero.
Yes, people -can- easily get around it via various listed methods, but don't seem to actually do that unless you're a high value target.
Due to frosty diplomatic relations, there is a deliberate policy to do fuck all to enforce complaints when they come from the west, and at least with Russia, this is used as a means of gray zone cyberwarfare.
China and Russia are being antisocial neighbors. Just like in real life, this does have ramifications for how you are treated.
Why stop there? Just block all non-US IPs!
This is a perfectly good solution to many problems, if you are absolutely certain there is no conceivable way your service will be used from some regions.
Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
Not a problem. Bad actors which are motivated enough to use VPNd or botnets are a different class of attacks that have different types of solutions. If you eliminate 95% of your problems with a single IP filter them you have no good argument to make against it.
if you are absolutely certain there is no conceivable way your service will be used from some regions.
This isn’t the bar you need to clear.
It’s “if you’re comfortable with people in some regions not being able to use your service.”
There are some that do not provide services in most countries but Netflix, Disney, paramount are pretty much global operations.
HBO and peacock might not be available in Europe but I am guessing they are in Canada.
Funny to see how narrow perspective some people have…
Netflix doesn't have this issue but I've seen services that seem to make it tough. Though sometimes that's just a phone call away.
Though OTOH whining about this and knowing about VPNs and then complaining about the theoretical non-VPN-knower-but-having-subscriptions-to-cancel-and-is-allergic-to-phone-calls-or-calling-their-bank persona... like sure they exist but are we talking about any significant number of people here?
Not letting you unsubscribe and blocking your IP are very different things.
How so? They did not let me unsubscribe via blocking my IP.
Instead of being able to access at least my account (if not the streaming service itself, which I get – copyright and all), I'd just see a full screen notice along the lines of "we are not available in your market, stay tuned".
Capitalism is a means to an end, and allowable business practices are a two-way street between corporations and consumers, mediated by regulatory bodies and consumer protection agencies, at least in most functioning democracies.
Traffic should be "privatize" as much as possible between IPv6 addresses (because you still have 'scanners' doing the whole internet all the time... "the nice guys scanning the whole internet for your protection... never to sell any scan data ofc).
Public IP services are done for: going to be hell whatever you do.
The right answer seems significantly big 'security and availability teams' with open and super simple internet standards. Yep the javascript internet has to go away and the app private protocols have too. No more whatng cartel web engine, or the worst: closed network protocols for "apps".
And the most important: hardcore protocol simplicity, but doing a good enough job. It is common sense, but the planned obsolescence and kludgy bloat lovers won't let you...
Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
They usually don't bother. Plus it's easier to take action against malicious traffic within your own country or general jurisdiction.
Re: China, their cloud services seem to stretch to Singapore and beyond. I had to blacklist all of Alibaba Cloud and Tencent and the ASNs stretched well beyond PRC borders.
So the seychelles traffic is likely really disguised chinese traffic.
Then they're making the claim that those binaries have botnet functionality.
And you are right, kernel anti-cheat are rumored to be weaponized by hackers, and making the previous even worse.
And when the kid is playing his/her game at home, if daddy or mummy is a person of interest, they are already on the home LAN...
Well, you get the picture: nowhere to run, orders of magnitude worse than it was before.
Nowadays, the only level of protection the administrator/root access rights give you, is to mitigate any user mistake which would break his/her system... sad...
[1] https://mybroadband.co.za/news/internet/350973-man-connected...
So the seychelles traffic is likely really disguised chinese traffic.
Soon: chineseplayer.io
The explanation is that easy??
it wont be all chinese companies or ppl doing the scraping. its well known that a lot of countries dont mind such traffic as long as it doesnt target themselves or for the west also some allies.
laws arent the same everywhere and so companies can get away with behavior in one place which seem almost criminal in another.
and what better place to put your scrapers than somewhere where there is no copyright.
russia also had same but since 2012 or so they changed laws and a lot of traffic reduced. companies moved to small islands or small nation states (favoring them with their tax payouts, they dont mind if j bring money for them) or few remaining places like china who dont care for copyrights.
its pretty hard to get really rid of such traffic. you can block stuff but mostly it will just change the response your server gives. flood still knockin at the door.
id hope someday maybe ISPs or so get more creative but maybe they dont have enough access and its hard to do this stuff without the right access into the traffic (creepy kind) or running into accidentally censoring the whole thing.
It wouldn't surprise me if this is related somehow. Like maybe these are Indian corporations using a Seychelles offshore entity to do their scanning because then they can offset the costs against their tax or something. It may be that Cyprus has similar reasons. Istr that Cyprus was revealed to be important in providing a storefront to Russia and Putin-related companies and oligarchs.[2]
So Seychelles may be India-related bots and Cyprus Russia-related bots.
[1] https://taxjustice.net/faq/what-is-transfer-pricing/#:~:text...
[2] Yup. My memory originated in the "Panama Papers" leaks https://www.icij.org/investigations/cyprus-confidential/cypr...
My public SFTP servers are still on port 22 and but block a lot of SSH bots by giving them a long "versionaddendum" /etc/ssh/sshd_config as most of them choke on it. Mine is 720 characters long. Older SSH clients also choke on this so test it first if going this route. Some botters will go out of their way to block me instead so their bots don't hang. One will still see the bots in their logs, but there will be far less messages and far fewer attempts to log in as they will be broken, sticky and confused. Be sure to add offensive words in versionaddendum for the sites that log SSH banners and display them on their web pages like shodan.io.
The internet has become a hostile place for any public server, and with the advent of ML tools, bots will make up far more than the current ~50% of all traffic. Captchas and bot detection is a losing strategy as bot behavior becomes more human-like.
Governments will inevitably enact privacy-infringing regulation to deal with this problem, but for sites that don't want to adopt such nonsense, allowlists are the only viable option.
I've been experimenting with a system where allowed users can create short-lived tokens via some out-of-band mechanism, which they can use on specific sites. A frontend gatekeeper then verifies the token, and if valid, opens up the required public ports specifically for the client's IP address, and redirects it to the service. The beauty of this system is that the service itself remains blocked at the network level from the world, and only allowed IP addresses are given access. The only publicly open port is the gatekeeper, which only accepts valid tokens, and can run from a separate machine or network. It also doesn't involve complex VPN or tunneling solutions, just a standard firewall.
This should work well for small personal sites, where initial connection latency isn't a concern, but obviously wouldn't scale well at larger scales without some rethinking. For my use case, it's good enough.
We have been using that instead of VPN and it has been incredibly nice and performant.
There also might be similar solutions for other cloud providers or some Kubernetes-adjacent abomination, but I specifically want something generic and standalone.
CloudFront is fairly good at marking if someone is accessing from a data centre or a residential/commercial endpoint. It's not 100% accurate and really bad actors can still use infected residential machines to proxy traffic, but this fix was simple and reduced the problem to a negligent level.
(It sometimes comes to funny situations where malware doesn't enable itself on Windows machines if it detects that russian language keyboard is installed.)
Source: stopping attacks that involve thousands of IPs at my work.
My single-layer thought process:
If they're knowingly running a residential proxy then they'll likely know "the cost of doing business". If they're unknowingly running a residential proxy then blocking them might be a good way for them to find out they're unknowingly running a residential proxy and get their systems deloused.
And what if I'm behind CGNAT? You will block my entire ISP or city all in one go, and get complaints from a lot of people.
Alas, the "enough users get annoyed by being blocked and switch ISPs" step will never happen. Most users only care about the big web properties, and those have the resources to absorb such crawler traffic so they won't get in on the ISP-blocking scheme.
But my main point was in the second paragraph, that "enough of them would" will never happen anyway when the only ones doing the blocking are small websites.
What, exactly, do you want ISPs to do to police their users from earning $10 of cryptocurrency a month, or even worse, from playing free mobile games? Neither one breaks the law btw. Neither one is even detectable. (Not even by the target website! They're just guessing too)
There are also enough websites that nobody is quitting the internet just because they can't get Netflix. They might subscribe to a different steaming service, or take up torrenting. They'll still keep the internet because it has enough other uses, like Facebook. Switching to a different ISP won't help because it will be every ISP because, as I already said, there's nothing the ISP can do about it. Which, on the other hand, means Netflix would ban every ISP and have zero customers left. Probably not a good business decision.
The end user will find out whether their ISP is blocking them or Netflix is blocking them. Usually by asking one of them or by talking to someone who already knows the situation. They will find out Netflix is blocking them, not their ISP.
You seem to think I said users will think the block is initiated by the ISP and not the website. I said no such thing so I'm not sure where you got this idea.
What, exactly, do you want ISPs to do
Respond to abuse reports.
Neither one is even detectable. (Not even by the target website! They're just guessing too)
TFA has IP addresses.
Which, on the other hand, means Netflix would ban every ISP and have zero customers left.
It's almost like I already said, twice even, that the plan won't work because the big web properties won't be in on it.
the ISPs will be motivated to stay in business and police their customers' traffic harder.
You can be completely forgiven if you're speaking from a non-US perspective, but this made me laugh pretty hard -- in this country we usually have a maximum of one broadband ISP available from any one address.
A small fraction of a few of the most populous, mostly East-coast, cities, have fiber and a highly asymmetrical DOCSIS cable option. The rest of the country generally has the cable option (if suburban or higher density) and possibly a complete joke of ADSL (like 6-12Mbps down).
There is nearly zero competition, most customers can choose to either keep their current ISP or switch to something with far worse speed/bandwidth caps/latency, such as cellular internet, or satellite.
If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches.
Are you really? How likely do you think is a legit customer/user to be on the same IP as a residential proxy? Sure residential IPS get reused, but you can handle that by making the block last 6-8 hours, or a day or two.
- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked
- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much
- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.
You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.
I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.
The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.
There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...
And none of the accounts for the residential ISPs that bots use to appear like legitimate users https://www.trendmicro.com/vinfo/us/security/news/vulnerabil....
Would it make sense to have a class of ISPs that didn't peer with these "bad" network participants?
Not sure what my point is here tbh. The internet sucks and I don't have a solution
Google indexes in country, as does a few other search engines..
Would recommend.
Is there a public curated list of "good ips" to whitelist ?
It should be illegal, at least for companies that still charge me while I’m abroad and don’t offer me any other way of canceling service or getting support.
Cloudflare has been a godsend for protecting my crusty old forum from this malicious, wasteful behavior.
Say you whitelist an address/range and some systems detect "bad things". Now what? You remove that address/range from whitelist? Doo you distribute the removal to your peers? Do you communicate removal to the owner of unwhitelisted address/range? How does owner communicate dealing with the issue back? What if the owner of the range is hosting provider where they don't proactively control the content hosted, yet have robust anti-abuse mechanisms in place? And so on.
Whitelist-only is a huge can of worms and whitelists works best with trusted partner you can maintain out-of-band communication with. Similarly blacklists work best with trusted partners, however to determine addresses/ranges that are more trouble than they are worth. And somewhere in the middle are grey zone addresses, e.g. ranges assigned to ISPs with CGNATs: you just cannot reliably label an individual address or even a range of addresses as strictly troublesome or strictly trustworthy by default.
Implement blacklists on known bad actors, e.g. the whole of China and Russia, maybe even cloud providers. Implement whitelists for ranges you explicitly trust to have robust anti-abuse mechanisms, e.g. corporations with strictly internal hosts.
It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.
Gating robots.txt might be a mistake, but it also might be a quick way to deal with crawlers who mine robots.txt for pages that are more interesting. It's also a page that's never visited by humans. So if you make it a tarpit, you both refuse to give the bot more information and slow it down.
It's crap that it's affecting your work, but a website owner isn't likely to care about the distinction when they're pissed off at having to deal with bad actors that they should never have to care about.
It's also a page that's never visited by humans.
Never is a strong word. I have definitely visited robots.txt of various websites for a variety of random reasons.
- remembering the format
- seeing what they might have tried to "hide"
- using it like a site's directory
- testing if the website is working if their main dashboard/index is offline
In fairness, however, my daughters ask me that question all the time and it is possible that the verification checkboxes are lying to me as part of some grand conspiracy to make me think I am a human when I am not.
--- though I think passing them is more a sign that you're a robot than anything else.
The latest was a slow loris approach where it takes forever for robots.txt to download.
I'd treat this in a client the same way as I do in a server application. If the peer is behaving maliciously or improperly, I silently drop the TCP connection without notifying the other party. They can waste their resources by continuing to send bytes for the next few minutes until their own TCP stack realizes what happens.
Additionally, it's not going to be using that many resources before your kernel sends it a RST next time a data packet is sent
The latest was a slow loris approach where it takes forever for robots.txt to download
Applying penalties that exclusively hurt people who are trying to be respectful seems counterproductive.
Never assume malice what can be attributed to incompetence.
The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.
Sometimes I find an odd password being probed, search for it on the web and find an interesting story [...].
Yeah, this is beyond irresponsible. You know the moment you're pwned, __you__ become the new interesting story?
For everyone else, use a password manager to pick a random password for everything.
Do they not run web servers on the open web or something?
Until AI crawlers chased me off of the web, I ran a couple of fairly popular websites. I just so rarely see anybody including passwords in the URLs anymore that I didn't really consider that as what the commenter was talking about.
It's not like if someone sends me a request for /wp-login.php that my rails app suddenly becomes WordPress??
You're absolutely right. That's my mistake — you are requesting a specific version of WordPress, but I had written a Rails app. I've rewritten the app as a WordPress plugin and deployed it. Let me know if there's anything else I can do for you.
So unless you're not logging your request path/query string you're doing something very very wrong by your own logic :). I can't imagine diagnosing issues with web requests and not be given the path + query string. You can diagnose without but you're sure not making things easier
plaintextPassword = POST["password"]
ok = bcryptCompare(hashedPassword, plaintextPassword)
// (now throw away POST and plaintextPassword)
if (ok) { ... }
Bonus points: on user lookup, when no user is found, fetch a dummy hashedPassword, compare, and ignore the result. This will partially mitigate username enumeration via timing attacks.if you put your server up on the public internet then this is just table stakes stuff that you always need to deal with, doesn't really matter whether the traffic is from botnets or crawlers or AI systems or anything else
you're always gonna deal with this stuff well before the requests ever get to your application, with WAFs or reverse proxies or (idk) fail2ban or whatever else
also 1000 req/hour is around 1 request every 4 seconds, which is statistically 0 rps for any endpoint that would ever be publicly accessible
Background scanner noise on the internet is incredibly common, but the AI scraping is not at the same level. Wikipedia has published that their infrastructure costs have notably shot up since LLMs started scraping them. I've seen similar idiotic behavior on a small wiki I run; a single AI company took the data usage from "who gives a crap" to "this is approaching the point where I'm not willing to pay to keep this site up." Businesses can "just" pass the costs onto the customers (which is pretty shit at the end of the day,) but a lot of privately run and open source sites are now having to deal with side crap that isn't relevant to their focus.
The botnets and DDOS groups that are doing mass scanning and testing are targeted by law enforcement and eventually (hopefully) taken down, because what they're doing is acknowledged as bad.
AI companies, however, are trying to make a profit off of this bad behavior and we're expected to be okay with it? At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
but yeah the issue is that as long as you have something accessible to the public, it's ultimately your responsibility to deal with malicious/aggressive traffic
At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward
I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward
Kinda my point was that it's only the internet being the internet if we tolerate it. If enough people give a crap, the corporations doing it will have to knock it off.
if you wanna rage against the machine then more power to you but this line of thinking is dead on arrival in terms of outcome
I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.
Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.
Also to be clear I doubt those big guys are doing these crawls. I assume it's small startups who think they're gonna build a big dataset to sell or to train their own model.
Also, they might share the common viewpoint of "it's the internet; suck it up."
thousounds of requests an hour from bots
That's not much for any modern server so I genuinely don't understand the frustration. I'm pretty certain gitea should be able to handle thousands of read requests per minute (not per hour) without even breaking a sweat.
Serving file content/diff requests from gitea/forgejo is quite expensive computationally
One time, sure. But unauthenticated requests would surely be cached, authenticated ones skip the cache (just like HN works :) ), as most internet-facing websites end up using this pattern.
Saying “just cache this” is not sustainable. And this is only one repository; the only reasonable way to deal with this is some sort of traffic mitigation, you cannot just deal with the traffic as the happy path.
It blocks a lot of bots, but I feel like just running on a high port number (10,000+) would likely do better.
There are attackers out there that send SIP/2.0 OPTIONS requests to the GOPHER port, over TCP.
If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.
Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.
I encountered exactly one actual problem: the temporary folder for zip snapshots filled up the disk since bots followed all snapshot links and it seems gitea doesn't delete generated snapshots. I made that directory read-only, deleted its contents, and the problem was solved, at the cost of only breaking zip snapshots.
I experienced no other problems.
I did put some user-agent checks in place a while later, but that was just for fun to see if AI would eventually ingest false information.
The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.
60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.
Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.
There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.
You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.
Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.
Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times.
Google has invested decades of core research with an army of PhDs into its crawler, particularly around figuring out when to recrawl a page. For example (a bit dated, but you can follow the refs if you're interested):
https://www.niss.org/sites/default/files/Tassone_interface6....
We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.
And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.
they were on links marked nofollow
if i'm understanding you correctly you had an indexable page that contained links with nofollow attribute on the <a> tags.
It's possible some other mechanism got those URLs into the crawler like a person visiting them? Nofollow on the link won't prevent the URL from being crawled or indexed. If you're returning a 404 for them, you ought to be able to use webmaster tools or whatever it's called now, to request removal.
They were meant to be interactive URLs on search pages. Someone implemented them I think trying to allow A11y to work but the bots were slamming us. We also weren’t doing canonical URLs right in the destination page so they got searched again every scan cycle. So at least three dumb things were going on, but the sorts of mistakes that normal people could make.
And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
Googlebot uses different IP space from gcp
The point is they’re getting paid to run cloud servers to keep their bots happy and not dropping your website to page six.
btw you don't get dropped if you issue temporary 429s only when it's consistent and/or the site is broken. that is well documented. and wtf else are they supposed to do if you don't allow to crawl it and it goes stale?
The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product.
I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.
I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.
Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.
Also: who's your webhost? $1/m sounds like a steal.
Serving up a page that takes a few dozen db queries is a lot different than serving a static page.
The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.
However, it's obviously not a real solution. It depends on people knowing about it, and adding the complexity of checking it to their crawler. Are there other more serious solutions? It seems like we've heard about "micropayments" and "a big merkle tree of real people" type solutions forever and they've never materialized.
It depends on people knowing about it, and adding the complexity of checking it to their crawler.
I can't believe any bot writer doesn't know about robots.txt. They're just so self-obsessed and can't comprehend why the rules should apply to them, because obviously their project is special and it's just everyone else's bot that causes trouble.
-A PREROUTING -i eth0 -p tcp -m tcp -d $INTERNET_IP --syn -m tcpmss ! --mss 1280:1460 -j DROP
Example rule from the netfilter raw table. This will not help against headless chrome.The reason this is useful is that many bots first scan for port 443 then try to enumerate it. The bots that look up domain names to scan will still try and many of those come from new certs being created in LetsEncrypt. That is one of the reasons I use the DNS method, get a wildcard and sit on it for a while.
Another thing that helps is setting a default host in ones load balancer or web server that serves up a default simple static page served from a ram disk that say something like, "It Worked!" and disable logging for that default site. In HAProxy one should look up the option "strict-sni". Very old API clients can get blocked if they do not support SNI but along that line most bots are really old unsupported code that the botter could not update if their life depended on it.
Of course. Nifty thing about open source means I can configure a system to allow or disallow anything. Each server operator can monitor their legit users traffic and find what they need to allow and dump the rest. Corporate VPN's will be using known values. "Free" VPN's can vary wildly but one need not support them if they choose not to. On some systems I only allow and MSS of 1460 and I also block TCP SYN packets with a TTL greater than 64 but that matches my user-base.
We have no chinese users/customers so in theory this does not effect business at all. Also russia is sanctioned and our russian userbase does not actually live in russia, so blocking russia did not effect users at all.
I did a quick search and found a few databases but none of them looks like the obvious winner.
If you want to test your IP blocks, we have servers on both China and Russia, we can try to take a screenshot from there to see what we get (free, no signup) https://testlocal.ly/
If your site is behind cloudflare, blocking/challenging by country is a built-in feature.
Your offhand comment also doesn't make sense in the context of this subthread. The effort companies have to invest to do business with Russia and China is prohibitively high, and that's a completely valid concern. It's not that everyone universally hates or loves these countries. It's simply impractical for most businesses to navigate those markets.
I've been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:
* As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.
* Most of them don't bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don't generate much traffic).
* The first thing I tried was to throttle bandwidth for URLs that contain id= (which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn't care. They just waited and kept coming back.
* BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/ locations. Failed that enough bots would routinely hog up all the available connections.
* My current solution is to deny requests for all URLs containing id= unless they also contain the `notbot` parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn't go away. They still request, get 403, but keep coming back.
My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot in query string) that whoever runs the bots won't bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some "standard" solution (like rate limit, Anubis, etc) is not going to work -- they have enough resources to eat up the cost and/or adapt.
* https://geminiprotocol.net/docs/protocol-specification.gmi#r...
The reasoning for disallowing them in GEMINI pretty much applies to static HTTP service (which is what publicfile provides) as it does to static GEMINI service. They moreover did not actually work in Bernstein publicfile unless a site administrator went to extraordinary lengths to create multiple oddly-named filenames (non-trivial to handle from a shell on a Unix or Linux-based system, because of the metacharacter) with every possible combination of query parameters, all naming the same file.
* https://jdebp.uk/Softwares/djbwares/guide/publicfile-securit...
* https://jdebp.uk/Softwares/djbwares/guide/commands/httpd.xml
* https://jdebp.uk/Softwares/djbwares/guide/commands/geminid.x...
Before I introduced this, attempted (and doomed to fail) exploits against weak CGI and PHP scripts were a large fraction of all of the file not found errors that httpd had been logging. These things were getting as far as hitting the filesystem and doing namei lookups. After I introduced this, they are rejected earlier in the transaction, without hitting the filesystem, when the requested URL is decomposed into its constituent parts.
Bernstein publicfile is rather late to this party, as there are over 2 decades of books on the subject of static sites versus dynamic sites (although in fairness it does pre-date all of them). But I can report that the wisdom when it comes to queries holds up even today, in 2025, and if anything a stronger position can be taken on them now.
To those running static sites, I recommend taking this good idea from GEMINI and applying it to query parameters as well.
Unless you are brave enough to actually attempt to provide query parameter support with static site tooling. (-:
There's a recent phishing campaign with sites hosted by Cloudflare and spam sent through either "noobtech.in" (103.173.40.0/24) or through "worldhost.group" (many, many networks).
"noobtech.in" has no web site, can't accept abuse complaints (their email has spam filters), and they don't respond at all to email asking them for better communication methods. The phishing domains have "mail.(phishing domain)" which resolves back to 103.173.40.0/24. Their upstream is a Russian network that doesn't respond to anything. It's 100% clear that this network is only used for phishing and spam.
It's trivial to block "noobtech.in".
"worldhost.group", though, is a huge hosting conglomerate that owns many, many hosting companies and many, many networks spread across many ASNs. They do not respond to any attempts to communicate with them, but since their web site redirects to "hosting.com", I've sent abuse complaints to them. "hosting.com" has autoresponders saying they'll get back to me, but so far not a single ticket has been answered with anything but the initial autoresponder.
It's really, really difficult to imagine how one would block them, and also difficult to imagine what kind of collateral impact that'd have.
These huge providers, Tencent included, get away with way too much. You can't communicate with them, they don't give the slightest shit about harmful, abusive and/or illegal behavior from their networks, and we have no easy way to simply block them.
I think we, collectively, need to start coming up with things we can do that would make their lives difficult enough for them to take notice. Should we have a public listing of all netblocks that belong to such companies and, as an example, we could choose to autorespond to all email from "worldhost.group" and redirect all web browsing from Tencent so we can tell people that their ISP is malicious?
I don't know what the solution is, but I'd love to feel a bit less like I have no recourse when it comes to these huge mega-corporations.
if ($http_user_agent ~* "BadBot") {
limit_rate 1k;
default_type application/octet-stream;
proxy_buffering off;
alias /dev/zero;
return 200;
}
Blocking IPS is much cheaper for the blocker.
https://en.wikipedia.org/wiki/Zeno%27s_paradoxes#Dichotomy_p...
IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.
IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.
That's traffic I want to block, and that's behaviour that I want to punish / discourage. If a set of users get caught up in that, even when they've just been given recycled IP addresses, then there's more chance to bring the shitty 'scraping as a service' behaviour to light, thus to hopefully disinfect it.
(opinion coming from someone definitely NOT hosting public information that must be accessible by the common populace - that's an issue requiring more nuance, but luckily has public funding behind it to develop nuanced solutions - and can just block China and Russia if it's serving a common populace outside of China and Russia).
I can't believe the entitlement.
And no, I do not use those paid services, even though it would make it much easier.
If you feel like you need to do anything at all, I would suggest treating it like any other denial-of-service vulnerability: Fix your server or your application. I can handle 100k clients on a single box, which equates to north of 8 billion daily impressions, and so I am happy to ignore bots and identify them offline in a way that doesn't reveal my methodologies any further than I absolutely have to.
After some fine-tuning and eliminating false positives, it is running smoothly. It logs all the temporarily banned and reported IPs (to Crowdsec) and logging them to a Discord channel. On average it blocks a few dozen different IPs each day.
From what I see, there are far more American IPs trying to access non-public resources and attempting to exploit CVEs than there are Chinese ones.
I don't really mind anyone scraping publicly accessible content and the rest is either gated by SSO or located in intranet.
For me personally there is no need to block a specific country, I think that trying to block exploit or flooding attempts is a better approach.
The directory structure had changed, and the page is now 1 level lower in the tree, correctly hyperlinked long since, in various sitemaps long since, and long since discovered by genuine HTTP clients.
The URL? It now only exists in 1 place on the WWW according to Google. It was posted to Hacker News back in 2017.
(My educated guess is that I am suffering from the page-preloading fallout from repeated robotic scraping of old Hacker News stuff by said U.S.A. subsidiary.)
You can blunt instrument 403 geoblock entire countries if you want, or any user agent, or any netblock or ASN. It’s entirely up to you and it’s your own server and nobody will be legitimately mad at you.
You can rate limit IPs to x responses per day or per hour or per week, whatever you like.
This whole AI scraper panic is so incredibly overblown.
I’m currently working on a sniffer that tracks all inbound TCP connections and UDP/ICMP traffic and can trigger firewall rule addition/removal based on traffic attributes (such as firewalling or rate limiting all traffic from certain ASNs or countries) without actually having to be a reverse proxy in the HTTP flow. That way your in-kernel tables don’t need to be huge and they can just dynamically be adjusted from userspace in response to actual observed traffic.
This whole AI scraper panic is so incredibly overblown.
The problem is that its eating into peoples costs, and if you're not concerned with money, I'm just asking, can you send me $50.00 USD ?
Care to share how I can make that happen given scrapers are hellbent on ignoring any rules / agreements on how to conduct themselves?
Then turn the tables on them and make the Great Firewall do your job! Just choose a random snippet about illegal Chinese occupation of Tibet or human rights abuses of Uyghur people each time you generate a page and insert it as a breaker between paragraphs. This should get you blocked in no time :)
Well, my user agents work for me, not for you - the server guy who is complaining about this and that. "Your business model is not my problem". Block me if you don't want me.
The problem is that there is no way to "block me if you don't want me". That's the entire issue. The methods these scrapers use mean it's nigh on impossible to block them.
I suspect we'll get integrity attestation or tokens before it becomes an unsurmountable problem to block bots.
43.131.0.0/18 43.129.32.0/20 101.32.0.0/20 101.32.102.0/23 101.32.104.0/21 101.32.112.0/23 101.32.112.0/24 101.32.114.0/23 101.32.116.0/23 101.32.118.0/23 101.32.120.0/23 101.32.122.0/23 101.32.124.0/23 101.32.126.0/23 101.32.128.0/23 101.32.130.0/23 101.32.13.0/24 101.32.132.0/22 101.32.132.0/24 101.32.136.0/21 101.32.140.0/24 101.32.144.0/20 101.32.160.0/20 101.32.16.0/20 101.32.17.0/24 101.32.176.0/20 101.32.192.0/20 101.32.208.0/20 101.32.224.0/22 101.32.228.0/22 101.32.232.0/22 101.32.236.0/23 101.32.238.0/23 101.32.240.0/20 101.32.32.0/20 101.32.48.0/20 101.32.64.0/20 101.32.78.0/23 101.32.80.0/20 101.32.84.0/24 101.32.85.0/24 101.32.86.0/24 101.32.87.0/24 101.32.88.0/24 101.32.89.0/24 101.32.90.0/24 101.32.91.0/24 101.32.94.0/23 101.32.96.0/20 101.33.0.0/23 101.33.100.0/22 101.33.10.0/23 101.33.10.0/24 101.33.104.0/21 101.33.11.0/24 101.33.112.0/22 101.33.116.0/22 101.33.120.0/21 101.33.128.0/22 101.33.132.0/22 101.33.136.0/22 101.33.140.0/22 101.33.14.0/24 101.33.144.0/22 101.33.148.0/22 101.33.15.0/24 101.33.152.0/22 101.33.156.0/22 101.33.160.0/22 101.33.164.0/22 101.33.168.0/22 101.33.17.0/24 101.33.172.0/22 101.33.176.0/22 101.33.180.0/22 101.33.18.0/23 101.33.184.0/22 101.33.188.0/22 101.33.24.0/24 101.33.25.0/24 101.33.26.0/23 101.33.30.0/23 101.33.32.0/21 101.33.40.0/24 101.33.4.0/23 101.33.41.0/24 101.33.42.0/23 101.33.44.0/22 101.33.48.0/22 101.33.52.0/22 101.33.56.0/22 101.33.60.0/22 101.33.64.0/19 101.33.64.0/23 101.33.96.0/22 103.52.216.0/22 103.52.216.0/23 103.52.218.0/23 103.7.28.0/24 103.7.29.0/24 103.7.30.0/24 103.7.31.0/24 43.130.0.0/18 43.130.64.0/18 43.130.128.0/19 43.130.160.0/19 43.132.192.0/18 43.133.64.0/19 43.134.128.0/18 43.135.0.0/18 43.135.64.0/18 43.135.192.0/19 43.153.0.0/18 43.153.192.0/18 43.154.64.0/18 43.154.128.0/18 43.154.192.0/18 43.155.0.0/18 43.155.128.0/18 43.156.192.0/18 43.157.0.0/18 43.157.64.0/18 43.157.128.0/18 43.159.128.0/19 43.163.64.0/18 43.164.192.0/18 43.165.128.0/18 43.166.128.0/18 43.166.224.0/19 49.51.132.0/23 49.51.140.0/23 49.51.166.0/23 119.28.64.0/19 119.28.128.0/20 129.226.160.0/19 150.109.32.0/19 150.109.96.0/19 170.106.32.0/19 170.106.176.0/20
Here's a useful tool/site:
You can feed it an ip address to get an AS ("Autonomous System"), then ask it for all prefixes associated with that AS.
I fed it that first ip address from that list (43.131.0.0) and it showed my the same Tencent owned AS132203, and it gives back all the prefixes they have here:
https://bgp.tools/as/132203#prefixes
(Looks like roguebloodrage might have missed at least the 1.12.x.x and 1.201.x.x prefixes?)
I started searching about how to do that after reading a RachelByTheBay post where she wrote:
Enough bad behavior from a host -> filter the host.
Enough bad hosts in a netblock -> filter the netblock.
Enough bad netblocks in an AS -> filter the AS. Think of it as an "AS death penalty", if you like.
(from the last part of https://rachelbythebay.com/w/2025/06/29/feedback/ )
If I saw the two you have identified, then they would have been added. I do play a balance between "might be a game CDN" or a "legit server" and an outright VPS that is being used to abuse other servers.
But thanks, I will keep an eye on those two ranges.
eg. Chuck 'Tencent' into the text box and execute.
Edit: I also checked my Apache logs, I couldn't find any recent logs for "thinkbot".
u can also filter for allowing but this gives a risk of allowing the wrong thing as headers are easy to set, so its better to do it via blocking (sadly)
But ultimately it's worth it, you are responsible for your neighbours.
[Y]ou are responsible for [how] your neighbours [use the Internet].
Nope.
I'm very much not responsible for snooping on my neighbor's private communications. If anyone is responsible for doing any sort of abuse monitoring, it is the ISP chosen by my neighbor.
If there's a neighbour in your building who is running a bitcoin farm on your residential building, it's going to cause issues for you. If people from your country commit crime in other countries and violate visas, then you are going to face a quota due to them. If you bank at ACME Bank, and then it turns out they were arms traffickers, your funds were pooled and helped launder their money, you are responsible by association .
Reputation is not only individual, but there is group reputation, regardless of whether you like it or not.
If there's a neighbour...
What ass-backwards jurisdiction do you live in where any of the things you mention in this paragraph are true, let alone the notion that uninvolved bystanders would be responsible for the behavior of others?
If there's a neighbour in your building who is running a bitcoin farm on your residential building, it's going to cause issues for you.
Natural phenomenon, not legal, your power block will go down.
If people from your country commit crime in other countries and violate visas, then you are going to face a quota due to them.
https://www.whitehouse.gov/presidential-actions/2025/06/rest...
Visa overstays are tracked and they may affect policy decisions on inmigration. Common in many countries not just the US.
If you bank at ACME Bank, and then it turns out they were arms traffickers, your funds were pooled and helped launder their money, you are responsible by association .
I don't know if you've ever done international banking of any significant amount, but try receiving money from a Seychelle's account or something like that. In whatever jurisdiction you open an account in, you will share the reputation of that jurisdiction.
I'll add another one, spam in emails is combatted not only on a domain and IP reputation basis, but ip blocks or even ASN's can be marked for spam. And another one, opening a company in a jurisdiction might buy you the reputation of said jurisdiction.
Reputation is not only individual but group-based, this is because identities can be forged by an identity-provider, be it a passport-issuing country, an ASN, a Bank, a DoS company registry, etc..
[Y]our power block will go down.
So, it is my responsibility to prevent my neighbors from buying a high-end gaming PC for every member of their family, an induction stove, central A/C, and an electric car because my local power company might not be able to provide the contracted service. Right. The rest of your examples are just as poor as this one.
You seem to have confused "being responsible for" and "being affected by". I am affected by the effects that the geography of the region I live in has on the local weather. I am not responsible for that geography.
/128: single application
/64: single computer
/56: entire building
/48: entire (digital) neighborhood
There’s a great talk on this: Defense by numbers: Making Problems for Script Kiddies and Scanner Monkeys https://www.youtube.com/watch?v=H9Kxas65f7A
What I’d really love to see - but probably never will—is companies joining forces to share data or support open projects like Common Crawl. That would raise the floor for everyone. But, you know… capitalism, so instead we all reinvent the wheel in our own silos.
I would guess directory listing? -But I'm an idiot, so any elucidation would be appreciated.
On the other hand, I had to deploy Anubis for the SVN web interface for tug.org. SVN is way slower than Git (most pages take 5 seconds to load), and the server didn't even have basic caching enabled, but before last year, there weren't any issues. But starting early this year, the bots started scraping every revision, and since the repo is 20+ years old and has 300k files, there are a lot of pages to scrape. This was overloading the entire server, making every other service hosted there unusable. I tried adding caching and blocking some bad ASNs, but Anubis was (unfortunately) the only solution that seems to have worked.
So, I think that the main commonality is popular-ish sites with lots of pages that are computationally-expensive to generate.
another issue is things like cloud hosting will overlap their ranges with legit business ranges happily, so if you go that route you will inadvertently also block legitimate things. not that a regular person care too much for that, but an abuse list should be accurate.
For what it's worth, I'm also guilty of this, even if I made my site to replace one that died.
None of these are main main traffic drivers, just the main resource hogs. And the main reason when my site turns slow (usually an AI, microsoft or Facebook ignoring any common sense)
China and co is only a very small portion of my malicious traffic. Gladly. It's usually US companies who disrespect my robots.txt and DNS rate limits who make me the most problems.
There is no reason to query all my sub-sites, it's like a search engine with way to many theoretical pages.
Facebook also did aggressively, daily indexing of way to many pages, using large IP ranges until I blocked it. I get like one user per week from them, no idea what they want.
And bing, I learned, "simply" needs hard enforced rate limits it kinda learns to agree on.
it will work better than regex. a lot of these companies rely on 'but we are clearly recognizable' via fornexample these user agents, as excuse to put burden on sysadmins to maintains blocklists instead of otherway round (keep list of scrapables..)
maybe someone mathy can unburden them ?
you could also look who ask for nonexisting resources, and block anyone who asks for more than X (large enough not to let config issue or so kill regular clients). block might be just a minute so u dont have too many risk when an FP occurs. it will be enough likely to make the scraper turn away.
there are many things to do depending on context, app complexity, load etc. , problem is there's no really easy way to do these things.
ML should be able to help a lot in such a space??
Alex Schroeder's Butlerian Jihad
That's Frank Herbert's Butlerian Jihad.
Speaking of the Butlerian Jihad, Frank Herbert's son (Brian) and another author named Kevin J Anderson co-wrote a few books in the Dune universe and one of them was about the Butlerian Jihad. I read it. It was good, not as good at Frank Herbert's books but I still enjoyed it. One of the authors is not as good as the other because you can kind of tell the writing quality changing per chapter.
A further check showed that all the network blocks are owned by one organization—Tencent. I'm seriously thinking that the CCP encourage this with maybe the hope of externalizing the cost of the Great Firewall to the rest of the world.
A simple check against the IP address 170.106.176.0, 150.109.96.0, 129.226.160.0, 49.51.166.0 and 43.135.0.0 showed that these IP addresses is allocated to Tencent Cloud, a Google Cloud-like rental service.
I'm using their product personally, it's really cheap, a little more than $12~$20 a year for a VPS, and it's from one of the top Internet company.
Sure, it can't really completely rule out the possibility that Tencent is behind all of this, but I don't really think the communist needs to attack your website through Tencent, it's just simply not logical.
More likely it's just some company rented some server on Tencent crawling the Internet. The rest is probably just your xenophobia fueled paranoia.
I have a firewall that logs every incoming connection to every port. If I get a connection to a port that has nothing behind it, then I consider the IP address that sent the connection to be malicious, and I block the IP address from connecting to any actual service ports.
This works for me, but I run very few things to serve very few people, so there's minimal collateral damage when 'overblocking' happens - the most common thing is that I lock myself out of my VPN (lolfacepalm).
I occasionally look at the database of IP addresses and do some pivot tabling to find the most common networks and have identified a number of cough security companies that do incessant scanning of the IPv4 internet among other networks that give me the wrong vibes.
[0] Uninvited Activity: https://github.com/UninvitedActivity/UninvitedActivity
P.S. If there aren't any Chinese or Russian IP addresses / networks in my lists, then I probably block them outright prior to the logging.
Here's how it identifies itself: “Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.)”.
I mean you could just ban the user agent?
The real issue is with bots pretending not to be bots.
Though it would seem all bets are off and anyone will scrape anything. Now we're left with middlemen like cloudflare that cost people millions of hours of time ticking boxes to prove they're human beings.
The TL;DR is that there are malicious browser plugins that make the browser into a web scraping bot.
I see this all the time in web server logs; it is recognizable as a GET on a deep link coming from some random IP, usually residential.
I don't know if its because they operate in the service of capital rather than China, as here, but use of those methods in the former case seems to get more of a pass here.
So, are hackers and internet shittery coming from China? Block China's ASNs. Too bad ISPs won't do that, so you have to do it yourself. Keep it blocked until China enforces computer fraud and abuse.