1. Home
  2. Hosting Management
  3. Website Security
  4. How to Block Bad Bots and Spiders using .htaccess

How to Block Bad Bots and Spiders using .htaccess

Is your site suffering from spam comments, content scrapers stealing content, bandwidth leeches, and other bad bots?

In this Knowledge Base article, we’ll cover how to block bad bots with minimal efforts to keep the trash away from your site and free up valuable hosting resources.

Let’s begin!

How to Block Bad Bots and Spiders using .htaccess

If you’re a ChemiCloud customer, you’re covered! We’re using custom security rules that will block the following list of bots that are known to heavily crawl clients’ websites and consume unnecessary resources.

• PetalBot
• MJ12bot
• DotBot
• SeznamBot
• 8LEGS
• Nimbostratus-Bot
• Semrush
• Ahrefs
• AspiegelBot
• AhrefsBot
• MauiBot
• BLEXBot
• Sogou

In case you are using the Ahrefs services for example, in such situations, our techs can disable the security rule if needed. Don’t hesitate to reach out to our support team. We’d be glad to help!

Identifying Bad Bots

The first step in blocking bad bots and other bad requests is identifying them. There are a few ways to do this, including by keeping an eye on your website’s log files. Analyzing these log files is a lot like reading the tea leaves, i.e. it’s something that requires practice and is more of an art than an exact science.

You can also look around on Google for some log-parsing or log-analysis software, but being in the hosting industry, we like to look at the raw data. You may prefer other ways, so we can’t really recommend any apps for this, however, there is a great way to do this with Excel from this old, yet still relevant forum post.

Once you’ve identified your bad bots, you can use several methods to block them, including:

  • Blocking via Request URI
  • Blocking via User-Agent
  • Blocking via Referrer
  • Blocking via IP Address

Before you use one of these methods, be sure you investigate the request coming to your server/site to determine whether it should or should not be blocked. The best way to do this is by Googling the bot or query and you should find information on them, but there are also help forums and databases of known bad bots you can use to get more information.

Let’s cover how to block bots using each of the methods mentioned above!

Blocking via Request URI

If you’ve examined your server logs and you’re seeing a lot of queries like the ones below:

https://www.example.com/asdf-crawl/request/?scanx=123
https://wwww.example2.net/sflkjfglkj-crawl/request/?scanx123445

These requests all likely have different user agents, IP addresses, and referrers. So the only way to block similar future requests is to target the request string itself. Essentially, you would use .htaccess to block all requests that match that same pattern. The trick to this blocking technique is to find the best pattern. Ideally, you want to find the most common factor for the type of request you want to block.

In the above example, we have the following common patterns:

  • 123
  • /request
  • crawl
  • scanx

When deciding on a pattern to block, it’s important to choose one that isn’t used by any extant resources on your site. For this example, we could choose to block all request that include this string: “crawl”.

To do this, you can use the mod_alias command by adding the following code to the .hataccess file at the root of your website, i.e. the public_html directory.

# Block via Request URI
<IfModule mod_alias.c>
	RedirectMatch 403 /crawl/
</IfModule>

Later on, if you decide you also want to block all requests that include the string ‘scanx’, you can add it to the query by using the following syntax:

# Block via Request URI
<IfModule mod_alias.c>
	RedirectMatch 403 /(crawl|scanx)/
</IfModule>

Keep in mind, this technique only works when the target pattern is included in the main part of the request URI. To also block these patterns if included in the query-string portion of the request (i.e. the part after the question mark), you would use mod_rewrite instead, as seen below:

# Block via Query String
<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteCond %{QUERY_STRING} (crawl|scanx) [NC]
	RewriteRule (.*) - [F,L]
</IfModule>

The regular expression (regex) with mod_rewrite works the same as it does with mod_alias. Once this code is in place, all requests that include either of the banned strings will be denied access.

Remember to test your site for proper functionality before going live with this feature!

Blocking via User-Agent

Below we will demonstrate how to block bad bots via their user agent. Let’s say you’ve noticed a bunch of nasty spam requests all reporting one of the following user agents:

EvilBotHere
SpamSpewer
SecretAgentAgent

These are obviously not legit bots and you probably don’t want them sucking up your hosting resources.

To block all requests from any of these user agents (bots), add the following code to your .htaccess file:

# Block via User Agent
<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteCond %{HTTP_USER_AGENT} (EvilBotHere|SpamSpewer|SecretAgentAgent) [NC]
	RewriteRule (.*) - [F,L]
</IfModule>

Save the file and upload it to the public_html folder of your hosting account by using cPanel’s built-in File Manager. Before going live, you should test to be sure your site is still working properly.

You can also use a free online tool like Bots vs Browsers to lookup bots to block and test that they are blocked, using their test tools.

If you want to or need to add additional bots to that list, you can do so by using a pipe (aka | ) in between the bot names, like this:

RewriteCond %{HTTP_USER_AGENT}
(EvilBotHere|SpamSpewer|SecretAgentAgent|AnotherOneHere|AnotherOneHere|AndSoOn) [NC]

Blocking via Referrer

If you find your site is being targeted by people who are leeching (stealing) your site’s resources and bandwidth, you can easily block requests from those specific referrers.

For example, let’s say you’re seeing the following referrers in your logs:

http://www.spamreferrer1.org/
http://bandwidthleech.com/
http://www.contentthieves.ru/

You can use Apache’s built-in mod_rewrite to block these referrers. To do this, add the following code to your site’s .htaccess:

# Block via Referrer
<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteCond %{HTTP_REFERER} ^http://(.*)spamreferrer1\.org [NC,OR]
	RewriteCond %{HTTP_REFERER} ^http://(.*)bandwidthleech\.com [NC,OR]
	RewriteCond %{HTTP_REFERER} ^http://(.*)contentthieves\.ru [NC]
	RewriteRule (.*) - [F,L]
</IfModule>

This code does the following:

  1. Enables mod_rewrite, if it wasn’t already enabled.
  2. Checks the referrer for any of the URLs on the list.
  3. If the referrer is a match, it’s blocked with a 403 “Forbidden” response.

You can easily add new referrers to the list by adding a similar RewriteCond.

The important thing to remember is that the last RewriteCond must NOT include an OR flag.

Blocking via IP Address

When you’re dealing with specific users, blocking via an IP address can be very handy. Unfortunately, lots of bots have huge IP ranges and they don’t all disclose what they are, not to mention proxies, reverse proxies, caches, spoofing, and the like, so this tool can only be useful for specific cases.

If you do want to block a user based on their associated IP address, you can use the following code:

# Block via IP Address
<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteCond %{REMOTE_ADDR} ^123\.456\.789\.000
	RewriteRule (.*) - [F,L]
</IfModule>

That’s all there is to that one. Keep in mind, you’re escaping the dots with a backslash, \.

This tells Apache (or Litespeed in ChemiCloud’s case) to treat the dots as literal instead of as a wildcard, which is the default for an unescaped dot. Escaping the dots ensures we’re only blocking the specified IP address so there won’t be any false positives.

If you want to block more than one IP, the code would look like this:

# Block via IP Address
<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteCond %{REMOTE_ADDR} ^123\.456\.789\.000 [OR]
	RewriteCond %{REMOTE_ADDR} ^123\.456\.789\.000 [OR]
	RewriteCond %{REMOTE_ADDR} ^123\.456\.789\.000
	RewriteRule (.*) - [F,L]
</IfModule>

As in previous mod_rewrite techniques we’ve mentioned here, the last RewriteCond should NOT include the [OR] flag. Other than that, everything is straightforward.

To block a RANGE of IP addresses, you can simply omit the last octet, or whichever octets are required for the range, as in the code below:

# Block via IP Address
<IfModule mod_rewrite.c>
	RewriteEngine On
	RewriteCond %{REMOTE_ADDR} ^123\.           [OR]
	RewriteCond %{REMOTE_ADDR} ^111\.222\.      [OR]
	RewriteCond %{REMOTE_ADDR} ^444\.555\.777\.
	RewriteRule (.*) - [F,L]
</IfModule>

In that code, we’re blocking the following:

  • All IP addresses begin with 123.
  • All IP addresses begin with 111.222.
  • All IP addresses begin with 444.555.777.

And that’s how you block different forms of bots or users from your website using .htaccess!

Updated on August 12, 2021

Was this article helpful?

Related Articles

TRY CHEMICLOUD RISK-FREE
Fast, secure cloud hosting. 18 global data centers. Unhappy with your web host?
👉 Migrate for Free

Leave a Comment

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.