Question 1

What is a robots.txt file and why do I need one?

Accepted Answer

A robots.txt file is a plain text file placed at the root of your website (e.g. https://example.com/robots.txt) that tells web crawlers which paths they may or may not visit. Search engines like Googlebot check it before crawling. You need one to protect crawl budget — if Google spends crawl quota on admin panels, checkout flows, duplicate pages, or internal search results, it may not reach your important content as frequently. A well-configured robots.txt directs crawlers to the pages that matter and away from the ones that don't.

Question 2

What is the difference between Disallow and noindex in robots.txt?

Accepted Answer

Disallow in robots.txt blocks a crawler from visiting a URL — the page is never crawled. noindex is an HTML meta tag or HTTP header that tells a crawler "visit the page but don't include it in search results." The critical mistake to avoid: if you Disallow a URL, Googlebot cannot see the noindex tag on it, so the page may still appear in search results if it is linked from elsewhere. To fully remove a page from Google, add noindex to the page and keep it crawlable (remove the Disallow rule). Only use Disallow to save crawl budget on content you don't care about indexing.

Question 3

Where do I upload the robots.txt file?

Accepted Answer

Upload robots.txt to the root directory of your domain — it must be accessible at https://yourdomain.com/robots.txt exactly. For WordPress, place it in the public_html folder. For Next.js static exports, put it in the public/ folder. For Apache/Nginx, place it in the web root (usually /var/www/html/). Subdomain sites need their own robots.txt at the subdomain root. You can verify it is accessible by opening the URL directly in your browser.

Question 4

What paths should I block in robots.txt for an ecommerce site?

Accepted Answer

Ecommerce sites typically block: /cart/ (session-specific, no SEO value), /checkout/ (private), /account/ and /login/ (private user paths), /search (or ?q= query strings that generate duplicate pages), /wishlist/, /compare/, and any /admin/ paths. Keep /products/, /categories/, /collections/, /blog/, and your canonical product pages open. Blocking paginated or filtered URLs (/products?sort=price or /category/page/2/) is common but requires care — only block filters that create true duplicates, not URLs with unique content.

Question 5

Should robots.txt include a Sitemap directive?

Accepted Answer

Yes, always. The Sitemap directive tells search engines exactly where to find your XML sitemap without waiting for them to discover it through other means. Add a line like: Sitemap: https://example.com/sitemap.xml — this applies to all crawlers, not just the User-agent block it appears near. If you have multiple sitemaps (images, videos, news), add a separate Sitemap line for each. This is especially useful for new sites that haven't accumulated many inbound links yet.

Question 6

What does Crawl-delay do in robots.txt?

Accepted Answer

Crawl-delay specifies the minimum number of seconds a crawler should wait between requests to your server. For example, Crawl-delay: 2 asks bots to pause 2 seconds between page requests. This is useful for shared hosting or low-capacity servers that get overwhelmed by aggressive bots. Note: Googlebot ignores Crawl-delay — to control Google's crawl rate, use the crawl rate settings in Google Search Console. Crawl-delay is mainly respected by Bing, Yandex, and other crawlers.

Question 7

Can I use robots.txt to block my staging site from Google?

Accepted Answer

You can use Disallow: / in the staging site's robots.txt to discourage compliant crawlers, but this is not access control — the file itself is public, and the content can still be accessed directly. For true privacy on a staging environment, use HTTP authentication (basic auth), IP allowlists, or environment-level access restrictions. The robots.txt approach works well as a second layer to reduce accidental indexing of staging content that gets discovered through links, but should never be your only protection.

Question 8

Does robots.txt affect all search engines equally?

Accepted Answer

No. Major search engines like Google, Bing, Yandex, and Apple's crawler respect robots.txt, but rogue bots and scrapers ignore it entirely. The robots.txt protocol is voluntary — there is no technical enforcement. For Googlebot specifically, the Crawl-delay directive is ignored (use Search Console instead). The User-agent: * rule applies to all compliant crawlers. You can also write crawler-specific rules using the exact User-agent name, like User-agent: Googlebot to target only Google, followed by the Disallow or Allow rules for that crawler.

Robots.txt Generator — Create Crawl Rules for SEO

Features

Stop Google Wasting Crawl Budget on Pages That Shouldn't Be Indexed

How to Use

Common Use Cases

Frequently Asked Questions