The Small File That Shapes Everything Google Does on Your Site
Most website owners think about robots.txt once — usually when something breaks — and never again.
That’s a mistake.
robots.txt is a plain text file that sits at the root of your domain. Every time a crawler visits your site, it reads this file first. It’s the first conversation between your website and Google, Bing, and every other bot that finds you on the web.
What you say in that conversation determines how efficiently Google crawls your site, which pages get indexed, whether your admin panel is exposed to automated probes, and whether AI companies can freely use your content to train their models.
A misconfigured robots.txt can quietly block your most important pages from Google. An absent one means you’re leaving crawl budget decisions entirely up to Google — which doesn’t always prioritize what you’d prioritize. The right configuration, built on accurate technical understanding, makes your entire SEO operation run more cleanly.
This guide covers everything from first principles to advanced configuration — what robots.txt actually does, how to write it correctly, how to set it up on every major platform, and how to fix the errors that silently undermine indexation.
Part 1: What robots.txt Is, What It Does, and What It Doesn’t
The Actual Definition
robots.txt is a text file hosted at yoursite.com/robots.txt. When a web crawler visits your site, it checks this location before crawling anything else. The file contains directives — rules — that tell the crawler which parts of your site it may or may not access.
Here is what a straightforward robots.txt looks like:
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://yoursite.com/sitemap.xml
Breaking this down line by line:
User-agent: * means these rules apply to all crawlers. You can also target specific crawlers by name — more on that shortly.
Allow: / tells crawlers they’re permitted to access the root of the site and everything under it, by default.
Disallow: /admin/ tells crawlers not to access anything under the /admin/ path.
Sitemap: points crawlers to your XML sitemap, helping them find all your pages efficiently.
The entire concept is that simple. The complexity is in knowing what to include and why.
What robots.txt Actually Controls
robots.txt controls crawling — whether a bot reads the content of a URL. It does not directly control indexing — whether a page appears in search results.
This is the most important technical distinction in understanding robots.txt, and it’s the source of most configuration mistakes.
If you disallow a page in robots.txt, Google’s crawler won’t read it. But if that page is linked to from an external site, Google may still index a reference to it — it just won’t know what’s on the page. The URL can still appear in search results as an “unverified” listing.
To prevent a page from appearing in search results entirely, you need a noindex meta tag on the page itself — not a disallow in robots.txt. These two mechanisms are complementary, not interchangeable.
robots.txt is the right tool for:
- Preventing crawlers from reading pages that don’t need to be indexed (admin pages, duplicate parameter URLs, staging areas)
- Conserving crawl budget by directing Google away from low-value pages
- Pointing all crawlers to your sitemap
- Blocking specific bots from accessing your content
robots.txt is the wrong tool for:
- Hiding pages from search results (use
noindexinstead) - Securing sensitive information (use server-level access controls)
- Blocking determined malicious actors (robots.txt is a voluntary protocol — malicious scrapers ignore it)
The Voluntary Protocol Problem
robots.txt works because legitimate bots choose to follow it. Google, Bing, major SEO tools, and most professional crawlers read your robots.txt and respect its rules. This covers the overwhelming majority of automated traffic to most websites.
Malicious scrapers, unauthorized AI training bots, and vulnerability scanners frequently ignore robots.txt entirely. If you’re concerned about them, robots.txt is a first-layer signal — not a firewall. Blocking them properly requires server-level rate limiting, IP blocking, and WAF (Web Application Firewall) rules.
This distinction matters because many guides describe robots.txt as a security tool. It isn’t — not in any technical sense. It’s a communication protocol between your site and well-behaved bots. Understanding this prevents both over-reliance and under-utilization.
Part 2: The Complete robots.txt Syntax Reference
Basic Structure
Every robots.txt file follows the same structure: groups of rules, each starting with a User-agent declaration, followed by Allow and Disallow directives.
User-agent: [crawler name or *]
Allow: [path]
Disallow: [path]
Crawl-delay: [seconds]
Sitemap: [full URL]
Rules:
- One directive per line
- A colon and space after each directive:
Disallow: /path/ - Paths must start with
/ - Blank lines separate rule groups for different user-agents
- Comments begin with
# - The
Sitemap:directive goes at the end, outside any user-agent group
Correct vs. Incorrect Syntax
The most common syntax errors that silently break robots.txt:
# WRONG
user-agent: * ← lowercase works but inconsistent
Disallow /admin/ ← missing colon
Disallow: admin/ ← missing leading slash
Allow:/blog/ ← missing space after colon
# CORRECT
User-agent: *
Disallow: /admin/
Allow: /blog/
One malformed line doesn’t necessarily break the entire file, but it can cause unpredictable behavior. Always validate before publishing.
Wildcards and Pattern Matching
robots.txt supports two wildcard characters:
* matches any sequence of characters. $ matches the end of the URL.
Disallow: /*? ← Block all URLs containing a query string
Disallow: /*.pdf$ ← Block all URLs ending in .pdf
Disallow: /search/ ← Block exact path and everything under it
Googlebot and most modern crawlers support these patterns. Some older crawlers do not. Keep patterns simple where possible — complex wildcard rules are harder to debug and more likely to produce unintended matches.
Regular expressions are not supported. You cannot use [a-z] or (pattern1|pattern2) syntax.
The Allow/Disallow Hierarchy
When rules conflict, the more specific rule takes precedence. Allow overrides Disallow for the same path.
User-agent: *
Disallow: /blog/
Allow: /blog/featured/
This blocks all of /blog/ except the /blog/featured/ subdirectory. The more specific path wins.
User-agent: *
Disallow: /
Allow: /public/
This blocks everything except /public/. A common pattern for staging sites that need one section publicly accessible.
The Crawl-delay Directive
Crawl-delay tells crawlers to wait a specified number of seconds between requests:
Crawl-delay: 2
One critical fact: Googlebot does not support Crawl-delay. Google ignores this directive entirely and manages its own crawl rate based on your server’s response times. If you need to limit Google’s crawl rate, do it in Google Search Console under Crawl Rate settings.
Other crawlers (Bingbot, many SEO tools) do honor Crawl-delay. On shared hosting environments where server load is a concern, setting Crawl-delay: 1 is a reasonable default for non-Google crawlers.
For most modern hosting setups, Crawl-delay is unnecessary and can be omitted.
Part 3: What to Block, What to Allow, and Why
The Core Configuration for Most Websites
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /wp-includes/
Disallow: /wp-json/
Disallow: /?s=
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /thank-you/
Sitemap: https://yoursite.com/sitemap.xml
This is the right starting point for most WordPress sites. What each block accomplishes:
/wp-admin/ and /wp-login.php — Your backend. No value for Google. Blocking prevents these from consuming crawl budget and reduces automated login probes.
/wp-includes/ — WordPress core files. Not content. Not worth crawling.
/wp-json/ — REST API endpoints. If you have a headless setup that needs these crawled, remove this rule. For standard sites, these pages have no SEO value.
/?s= — Search result pages generated when users search your site. Each search creates a unique URL with no independent SEO value, and Google actively penalizes sites that serve large volumes of low-quality dynamically generated pages.
/cart/, /checkout/, /my-account/ — E-commerce and membership pages. Personal to each user, generate no search traffic, waste crawl budget.
/thank-you/ — Post-conversion pages. Not useful in search results.
What Not to Block
The most costly robots.txt mistake is accidentally blocking content that should rank. Never disallow:
Your main content directories: /blog/, /articles/, /products/, /services/, /category/
CSS, JavaScript, and image files. Google needs to render your pages accurately to understand them. Blocking these files prevents Google from seeing your pages the way users see them, which impairs both indexation quality and Core Web Vitals assessment.
# WRONG — Don't do this
Disallow: /wp-content/uploads/
Disallow: /wp-content/themes/
Disallow: /css/
Disallow: /js/
Disallow: /images/
Canonical pages. If you’ve set canonical URLs, the canonical version must be crawlable. Blocking a canonical URL defeats its purpose entirely.
Managing Duplicate Content With robots.txt
URL parameter variations are one of the most common sources of unintentional duplicate content. Faceted navigation in e-commerce creates thousands of filter combinations that are effectively the same page with slightly different product subsets.
# Block filter/sort variations
Disallow: /*?sort=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?page=
Disallow: /*?ref=
Disallow: /*?utm_
The alternative approach — preferred for e-commerce — is to use canonical tags on filtered pages pointing to the base category page, combined with selective robots.txt blocking. This gives Google more signal than blocking alone.
For pagination, modern guidance from Google is to allow pagination and use proper rel=”next” / rel=”prev” markup rather than blocking paginated pages with robots.txt.
Blocking AI Training Crawlers
Several AI companies run crawlers that harvest web content for training large language models. These are distinct from search engine crawlers — they don’t drive referral traffic to your site in exchange for indexing your content.
Blocking them in robots.txt is straightforward. The key is using the correct user-agent strings, which change as new bots emerge:
# OpenAI
User-agent: GPTBot
Disallow: /
# Common Crawl (used to train many AI models)
User-agent: CCBot
Disallow: /
# Anthropic
User-agent: ClaudeBot
Disallow: /
# Google AI (Bard/Gemini training)
User-agent: Google-Extended
Disallow: /
# Perplexity AI
User-agent: PerplexityBot
Disallow: /
# Meta
User-agent: FacebookBot
Disallow: /
# Apple
User-agent: Applebot-Extended
Disallow: /
Important: this list evolves as new AI products launch. Review quarterly and check current crawler documentation from each company. Using an incorrect user-agent string means your block does nothing — the crawler simply doesn’t recognize it as a rule that applies to it.
robots.txt blocking works here because these are legitimate companies whose crawlers voluntarily respect the protocol. Unauthorized scrapers that ignore robots.txt require different countermeasures.
Part 4: How to Create and Deploy robots.txt on Every Platform
WordPress
WordPress does not create a robots.txt file by default for new installations. The major SEO plugins handle this correctly and give you a clean interface for managing rules.
Using Yoast SEO:
Navigate to Yoast SEO → Tools → File Editor. You’ll see a robots.txt editor directly in your WordPress dashboard. Edit the file and save. Yoast writes changes to your actual robots.txt file on the server.
Note: Yoast automatically adds a reference to your XML sitemap. Verify it’s pointing to the correct URL.
Using Rank Math:
Navigate to Rank Math → General Settings → Edit robots.txt. The interface is similar to Yoast. Rank Math also handles the sitemap reference automatically.
Manual upload:
If you prefer full control, create a text file named robots.txt, write your rules, and upload it to your root directory (/public_html/robots.txt) via FTP or your hosting file manager. WordPress will use this file directly.
If both a physical file and a plugin-managed file exist, the physical file takes precedence. Verify which one Google is reading by checking yoursite.com/robots.txt in your browser and comparing it against your plugin settings.
Shopify
Shopify generates a default robots.txt automatically. Since 2021, Shopify allows merchants to customize it through the theme editor using a Liquid template.
To access it: Online Store → Themes → Actions → Edit Code → Templates → robots.txt.liquid
The default template handles most standard configurations correctly. Common customizations include blocking specific URL parameters generated by apps or adding AI crawler blocks.
liquid
{% assign default_robots_txt = default_robots_txt | prepend: "User-agent: GPTBot\nDisallow: /\n\n" %}
If you’re not comfortable editing Liquid templates, Shopify’s default robots.txt is functional for most stores. The more impactful optimization is submitting your sitemap to Google Search Console.
Wix
Wix provides a robots.txt editor in the SEO settings section of your dashboard. Navigate to Marketing & SEO → SEO Tools → Robots.txt.
Wix generates default rules that block its internal infrastructure pages, which is correct behavior. You can add custom directives for AI bot blocking and additional path restrictions through this interface.
The editor validates syntax before saving, which prevents the most common formatting errors.
Squarespace
Squarespace automatically generates and manages robots.txt. Direct editing is not available through the standard interface.
For page-level indexing control, use the noindex setting available in each page’s SEO settings. This is more granular than robots.txt anyway for most Squarespace use cases.
If your Squarespace site has specific crawl control requirements, the standard approach is to contact Squarespace support. Most common configurations — blocking cart pages, staging areas — are handled by Squarespace’s auto-generated rules.
Custom-Built Websites
Static sites: Create a robots.txt file, write your rules in any text editor, and place it in the root directory of your server. It must be accessible at yourdomain.com/robots.txt.
Node.js/Express:
javascript
app.get('/robots.txt', (req, res) => {
res.type('text/plain');
res.send(
'User-agent: *\n' +
'Allow: /\n' +
'Disallow: /admin/\n' +
'Disallow: /api/\n\n' +
'Sitemap: https://yoursite.com/sitemap.xml'
);
});
PHP: Create the file manually and upload to your server root. PHP doesn’t require any special routing for static files at the root directory.
Django: Add the file to your staticfiles directory and configure your URL patterns to serve it. Alternatively, use the django-robots package for database-driven robot rules.
Next.js: Place robots.txt in your /public directory. Next.js serves all files in /public from the root path automatically.
Submitting to Google Search Console
Creating and uploading your robots.txt file doesn’t automatically inform Google of its existence. While Googlebot will find it eventually during regular crawling, the best practice is to verify it in Google Search Console.
Go to Google Search Console, select your property, and navigate to Settings → robots.txt. Search Console shows the current version Google has cached, when it was last fetched, and any issues detected.
There’s no separate “submit” process for robots.txt the way there is for sitemaps. Verification in Search Console confirms Google can read your file and shows you exactly what it sees.
After making significant changes to robots.txt, use the URL Inspection tool to test specific pages and confirm they’re accessible as intended before the next scheduled crawl.
Part 5: The 40 Questions Website Owners Ask About robots.txt
Section A: Foundations
Q1: What is robots.txt and do I need one?
robots.txt is a text file at your root domain that controls how web crawlers interact with your site. Every professional website should have one.
Without robots.txt, crawlers make their own decisions about what to crawl. For a small, well-structured site with clean internal linking, this works adequately. For any site with admin pages, user-specific content, duplicate URL variations, or large volumes of pages, an unconfigured crawl wastes the budget that should go toward your most important content.
The setup takes five minutes. The benefits — cleaner indexation, faster discovery of new content, protection of admin areas — are permanent and require no ongoing work for most sites.
Q2: Does robots.txt actually stop search engines from crawling my site?
It instructs them. Legitimate crawlers — Google, Bing, most professional SEO tools — voluntarily read and follow robots.txt rules. This covers the vast majority of automated traffic on the web.
Malicious scrapers, unauthorized data harvesters, and vulnerability scanners typically ignore robots.txt. The file is a communication protocol, not an access control system. For actual access control, you need server-level security: password protection, IP blocking, rate limiting, or a WAF.
Understanding this distinction prevents the common mistake of treating robots.txt as a security measure when it isn’t one — and the equally common mistake of dismissing it because it doesn’t stop all bots, when it effectively manages the majority of legitimate ones.
Q3: How do I fix the “Blocked by robots.txt” error in Google Search Console?
This error means a page you want crawled is being blocked by a rule in your robots.txt file. The fix is to identify which rule is causing the block and either modify or remove it.
In Google Search Console, navigate to Coverage → Excluded and look for pages marked “Blocked by robots.txt.” Click into any affected URL and use the URL Inspection tool to see which specific robots.txt directive is blocking it.
Open your robots.txt file, find the rule, and determine whether the block is intentional. If the page should be crawled, remove or narrow the relevant Disallow directive. If the block is correct but the page is appearing in your index report anyway, that indicates an inconsistency between your sitemap and robots.txt worth resolving.
After editing, validate your updated robots.txt in Search Console and re-request indexation for any previously blocked pages that should now be crawlable.
Q4: What should I put in my robots.txt file?
Block only what genuinely shouldn’t be indexed. The most common and universally applicable blocks:
Admin and backend pages: /admin/, /wp-admin/, /wp-login.php — no SEO value and reduces exposure to automated probes.
Search result pages: /?s=, /search/ — dynamically generated, low quality, and can create enormous volumes of thin duplicate pages.
User account and transactional pages: /cart/, /checkout/, /my-account/, /dashboard/ — user-specific and irrelevant in search.
Test and staging paths: /staging/, /test/, /dev/ — never let these get indexed accidentally.
Duplicate parameter variations: /*?sort=, /*?ref=, /*?utm_ — these create parameter-based duplicates with no independent ranking value.
Leave everything else unconstrained. Your blog content, product pages, category pages, service pages, landing pages — everything you want ranking should be explicitly or implicitly allowed.
Q5: Is blocking crawlers with robots.txt legal?
Completely. You own your domain and you have the right to control what automated systems access it. robots.txt is the industry-standard protocol for communicating these preferences, and using it as intended is straightforwardly legal in all jurisdictions.
Blocking specific companies’ crawlers — including AI training bots from major technology companies — is a recognized and increasingly common practice. Multiple court cases and regulatory frameworks have affirmed that website operators have the right to control automated access to their content.
Q6: How do I create my first robots.txt file?
The fastest path for most website owners is using their CMS plugin. WordPress users with Yoast SEO or Rank Math already have a robots.txt editor in their dashboard — navigate there, paste in the standard configuration from Part 3 of this guide, and save.
For any other platform, open a plain text editor (Notepad on Windows, TextEdit on Mac — set to plain text mode), type your rules, and save the file as robots.txt with no additional extension. Upload it to the root directory of your server.
The key detail that catches many people: the file must be named exactly robots.txt (lowercase, no extension, no prefix). Robots.txt, robot.txt, and robots.txt.txt are all wrong and won’t be recognized.
After uploading, visit yourdomain.com/robots.txt in a browser. If you see your rules displayed as plain text, the file is correctly placed and accessible.
Q7: Can Google index my pages if robots.txt blocks them?
Yes, and this surprises most people.
When Google’s crawler is blocked from a page, it cannot read the content. But if the page is linked to from external sites Google knows about, Google may still add it to its index as an “uncrawled” URL. The listing in search results will lack a description because Google never read the page — but the URL can still appear.
To prevent a page from appearing in search results at all, you need a noindex meta tag in the page’s HTML. robots.txt blocking alone doesn’t achieve this.
For sensitive pages that should be completely invisible in search, use both: block crawling with robots.txt and add noindex to the page itself. This creates redundancy that covers the edge case where the block is later removed or temporarily fails.
Q8: robots.txt Disallow vs. noindex meta tag — which should I use?
They solve different problems:
robots.txt Disallow prevents crawling. Google’s bot won’t read the page at all. Use it for admin areas, duplicate parameter URLs, backend pages, and any URL that simply doesn’t need to be crawled.
The noindex meta tag prevents indexing while allowing crawling. Google reads the page, processes its content, but excludes it from search results. Use it for pages that exist but shouldn’t rank — archive pages, filtered variations with canonical URLs elsewhere, thank-you pages.
The critical practical point: if you block a page with robots.txt, Google can’t read the noindex tag on it, because it won’t crawl the page to find the tag. For pages where noindex is your goal, the page must be crawlable. Don’t combine a robots.txt block with a noindex tag on the same page — the block prevents Google from ever seeing the noindex instruction.
Q9: Why is my robots.txt file not working?
The most common causes, in order of frequency:
Wrong file location. The file must be at yourdomain.com/robots.txt. Place it anywhere else and crawlers won’t find it. Verify by visiting the URL directly in your browser.
Syntax error. A missing colon, missing leading slash, or extra space can cause a directive to be ignored. Disallow: /admin/ is correct; Disallow /admin/ and Disallow: admin/ are both wrong. Run your file through a validator.
Cached version being displayed. Browsers and sometimes Search Console cache robots.txt. After making changes, clear your browser cache and wait for Google to re-fetch the file before diagnosing issues.
Plugin conflict (WordPress). If you have a physical robots.txt file on your server and a WordPress SEO plugin managing a virtual one, they can conflict. Check which version Google is actually reading via Search Console.
Q10: How do I verify my robots.txt is configured correctly?
Visit yourdomain.com/robots.txt in your browser. You should see your rules as plain text. If you see a 404 error, the file doesn’t exist or isn’t in the right location.
For validation against Google’s interpretation specifically, go to Google Search Console → Settings → robots.txt. Search Console shows the version it has cached and highlights any syntax issues it detected.
Use the URL Inspection tool on specific pages to verify they’re accessible. Enter a URL, run the inspection, and the report will show whether the URL is blocked by robots.txt and which specific rule is responsible if it is.
Section B: Technical Deep Dive
Q11: Should I block AI crawlers in robots.txt?
This is a decision that depends on your goals, and there’s no universal right answer — but there are clear considerations.
AI training crawlers harvest content to train large language models. Unlike search engine crawlers, they don’t send traffic back to your site. You’re contributing to their training data without a corresponding benefit.
Major AI companies operate crawlers that respect robots.txt. Blocking them is straightforward and effective for those companies. The current major crawlers to consider:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Note: Google-Extended blocks Google’s AI training crawler specifically, without affecting Googlebot (the search crawler). These are separate user-agents. You can block Google-Extended while keeping your site fully indexed by Google Search.
The list of AI crawlers grows regularly. Checking quarterly and updating as new crawlers are identified is a reasonable maintenance habit.
Q12: What is llms.txt and how does it relate to robots.txt?
llms.txt is an emerging proposed standard, introduced in 2024, that gives website owners a structured way to communicate how they want their content used by AI systems. It’s a separate file from robots.txt, placed at yoursite.com/llms.txt.
Where robots.txt controls access, llms.txt communicates intent — it can include information about licensing, usage permissions, and preferences for how content should be summarized or attributed in AI-generated responses.
The standard is not yet universally adopted. Major AI companies haven’t committed to supporting it with the same consistency they follow robots.txt. At this point, robots.txt blocking remains the more reliable mechanism for controlling AI access.
For sites where content licensing and AI usage is a priority concern, implementing both robots.txt AI crawler blocks and llms.txt provides belt-and-suspenders coverage as the standard matures.
Q13: Can robots.txt prevent AI companies from training on my content?
Partially, for companies whose crawlers respect robots.txt. OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), and most other major AI labs have publicly committed to respecting robots.txt directives for their training crawlers.
Blocking these crawlers via robots.txt prevents them from harvesting new content. It doesn’t retroactively remove content they’ve already collected from previous crawls.
For content that was crawled before you added blocks, the options are more limited: direct requests to the companies’ opt-out systems (OpenAI, Google, and others offer these), DMCA notices for specific misuse, and in some jurisdictions, emerging legal frameworks around web scraping and intellectual property.
The practical reality: robots.txt is effective for the major legitimate players. It has no effect on unauthorized scrapers or smaller operations that don’t comply with the protocol.
Q14: robots.txt vs. robots meta tag — when do I use each?
robots.txt operates at the site level, controlling access by path. A single robots.txt file governs your entire domain, and rules apply to URL patterns rather than individual pages.
The robots meta tag operates at the page level:
html
<meta name="robots" content="noindex, nofollow">
This tag goes in the HTML head of a specific page and gives you fine-grained control over that page’s indexation behavior.
Use robots.txt when: the rule applies to a category of pages or a directory. Blocking all URLs under /account/ or all search result pages matching /*?s= is much cleaner done in robots.txt than adding meta tags to every affected page.
Use the robots meta tag when: the rule is specific to one page or a small set of pages that don’t share a consistent URL pattern. A privacy policy, a low-quality archive page, or a dynamically generated page with a unique URL structure is better managed with a meta tag.
One technical rule that bears repeating: a page blocked by robots.txt will never be crawled, which means Google cannot read any meta tags on that page, including noindex. If noindex is your goal, the page must be crawlable.
Q15: How much does Crawl-delay actually matter?
Less than most guides suggest, for most sites.
Googlebot does not honor the Crawl-delay directive. Google manages its own crawl rate based on your server’s actual response times. If your server is slow, Google automatically crawls more slowly. If you want to manually set a crawl rate limit for Google, do it in Google Search Console under Settings → Crawl Rate — that setting is the only mechanism Googlebot respects.
Bingbot and many third-party SEO tools do honor Crawl-delay. On shared hosting with limited resources, adding Crawl-delay: 1 is a reasonable way to prevent non-Google crawlers from creating load spikes.
For sites on modern hosting infrastructure (VPS, dedicated, or cloud), Crawl-delay is rarely necessary and can be omitted without consequence.
Q16: What is the correct robots.txt syntax?
The rules are straightforward. Each directive occupies its own line, with a colon and a space separating the directive from its value.
Required format:
User-agent: [value]
Disallow: /[path]/
Allow: /[path]/
Paths must start with a forward slash. The directives are case-insensitive (Disallow and disallow both work), but consistent capitalization is standard practice.
Blank lines separate rule groups. A blank line after a Disallow directive and before the next User-agent declaration creates a clean boundary between rules for different crawlers.
Comments use the # character. Anything after # on a line is ignored by crawlers:
# Standard site rules
User-agent: *
Allow: /
Disallow: /admin/ # Block backend
The Sitemap: directive stands outside any user-agent group, typically at the bottom of the file. It takes a full URL, not a path:
Sitemap: https://yoursite.com/sitemap.xml
Q17: How do wildcards work in robots.txt?
Two wildcard characters are supported by Googlebot and most modern crawlers.
* matches any sequence of characters, including none:
Disallow: /wp-admin/* ← Blocks all paths under /wp-admin/
Disallow: /*?s= ← Blocks any URL containing ?s=
Disallow: /*.pdf ← Blocks any URL ending in .pdf
$ matches the end of the URL string:
Disallow: /archive/$ ← Blocks exactly /archive/ but not /archive/post-title
Combining both:
Disallow: /*.pdf$ ← Blocks URLs ending exactly in .pdf
These patterns are powerful but require careful testing. A pattern like Disallow: /*? blocks every URL that contains a query string — which can be useful for eliminating parameter duplicates, but also accidentally blocks your own search results, pagination, and any legitimate dynamic pages. Test before deploying with Google’s robots.txt tester in Search Console.
Q18: How does robots.txt work with subdomains?
Each subdomain requires its own robots.txt file. The file at yoursite.com/robots.txt governs only yoursite.com. It has no effect on blog.yoursite.com or shop.yoursite.com.
Each subdomain needs its own file at its own root:
yoursite.com/robots.txtblog.yoursite.com/robots.txtshop.yoursite.com/robots.txt
If a subdomain doesn’t have a robots.txt file, crawlers assume all content on that subdomain is accessible. This is the correct behavior for most public subdomains. Only create a robots.txt for a subdomain if you need specific crawl rules for that subdomain.
In Google Search Console, subdomains are typically separate properties with their own crawl data, sitemap submissions, and robots.txt configuration. Manage them independently.
Q19: Can I write different rules for different crawlers?
Yes. Giving specific crawlers specific rules is one of robots.txt’s most useful features.
# Googlebot gets full access
User-agent: Googlebot
Allow: /
# Bingbot gets full access
User-agent: Bingbot
Allow: /
# AI training crawler gets no access
User-agent: GPTBot
Disallow: /
# All other crawlers follow standard rules
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Sitemap: https://yoursite.com/sitemap.xml
When a specific user-agent rule exists, that crawler follows those rules and ignores the User-agent: * wildcard group. A crawler that doesn’t match any specific user-agent falls back to the wildcard rules.
This lets you maintain full search engine accessibility while blocking AI training crawlers — a common and practical configuration.
Q20: How often should I update robots.txt?
For most stable sites, robots.txt requires minimal ongoing maintenance. Update it when:
You add new sections to your site that should be blocked (a new admin area, a new membership section, a new checkout flow). Don’t wait until after these pages have been crawled — add the blocks when you launch.
You launch a staging or development environment on a subdomain. Add a full Disallow: / to that subdomain’s robots.txt immediately.
New AI crawlers emerge that you want to block. This is worth reviewing quarterly — new models and companies introduce new crawlers regularly.
You restructure your URL architecture. Old path-based rules may no longer apply and new rules may be needed.
A quarterly review is a reasonable cadence for most sites. The review itself takes five minutes — load your robots.txt, compare it against your current site structure, check whether any rules are obsolete or missing, and update accordingly.
Section C: Platform-Specific
Q21: How do I configure robots.txt in WordPress?
WordPress creates a virtual robots.txt when no physical file exists. This virtual file is managed by your SEO plugin. Both Yoast SEO and Rank Math provide a direct editor in their settings.
Yoast SEO: SEO → Tools → File Editor → robots.txt tab. Edit directly and save. Changes write immediately to either a virtual or physical file depending on your setup.
Rank Math: Rank Math → General Settings → Edit robots.txt. Same interface, same behavior.
Important consideration: If a physical robots.txt file exists in your root directory (uploaded via FTP or file manager), it takes precedence over WordPress’s virtual file. The two can conflict. If your plugin changes don’t seem to take effect, check whether a physical file exists using your hosting file manager.
To verify which version Google is reading: visit yoursite.com/robots.txt in a browser and compare its contents against what’s in your plugin editor.
Q22: How do I configure robots.txt in Shopify?
Since 2021, Shopify supports robots.txt customization through a Liquid template. Access it via: Online Store → Themes → Actions → Edit Code → Templates → robots.txt.liquid.
Shopify’s default template generates sensible rules for most stores — blocking cart, checkout, account, and order-related paths. For most Shopify stores, the default configuration is correct and requires no modification.
Common additions:
Blocking AI crawlers:
liquid
{% capture ai_blocks %}
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
{% endcapture %}
{{ ai_blocks }}{{ default_robots_txt }}
Blocking specific app-generated URL patterns that create duplicate content requires identifying those patterns in your site’s URL structure first. If you have an app creating filter or search pages at URLs like /collections/all?filter.p.tag=sale, you can add a Disallow directive for that pattern.
Q23: How do I configure robots.txt in Wix?
Wix provides a robots.txt editor in the SEO section. Navigate to Marketing & SEO → SEO Tools → Robots.txt in your Wix dashboard.
The editor validates syntax before saving. You can add standard directives, including AI crawler blocks, and custom path rules. Wix’s auto-generated rules handle its internal infrastructure — don’t remove those lines.
One limitation: you cannot use Wix’s robots.txt editor to block Wix’s own crawlers for its internal infrastructure. Those rules are managed by Wix and shouldn’t be modified.
If robots.txt customization doesn’t appear in your dashboard, it may require a Business or Business Elite plan. Check your current plan level.
Q24: How do I configure robots.txt in Squarespace?
Squarespace auto-generates robots.txt and doesn’t provide direct editing access through the standard interface. The generated file handles Squarespace’s internal URLs and checkout flows correctly by default.
For page-level control — preventing specific pages from appearing in search results — use Squarespace’s built-in SEO settings on each page. The “Hide Page from Search Engines” option in Page Settings adds a noindex directive to that page.
For site-wide concerns beyond what Squarespace’s default configuration covers, contacting Squarespace support is the available path. If your use case requires full robots.txt control, a self-hosted WordPress installation provides that flexibility.
Q25: Can I have multiple robots.txt files?
One per domain, and one per subdomain. You cannot have yoursite.com/blog/robots.txt — crawlers only check the root directory for the robots.txt file. Rules for all paths under your domain are contained in the single root-level file.
If you need different rules for different sections of your site, put them all in the root robots.txt file with path-specific Disallow and Allow directives.
Subdomains each get their own independent file. yoursite.com/robots.txt and blog.yoursite.com/robots.txt are completely separate and each governs only its own subdomain.
Q26: How do I block AI crawlers across all my content?
Add user-agent specific blocks for each AI crawler you want to exclude. The current major ones:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Place these blocks above your general User-agent: * rules. Crawlers read the file in order and apply the first matching user-agent group they find.
Note: Google-Extended is specifically Google’s AI data collection crawler. Adding this block does not affect Googlebot (Google Search). You can block Google-Extended while keeping your site fully indexed in Google Search — these are separate systems.
Verify user-agent strings against each company’s official crawler documentation before implementing. Incorrect strings do nothing.
Q27: What’s the right robots.txt setup for a small business website?
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /wp-includes/
Disallow: /wp-json/
Disallow: /?s=
Disallow: /search/
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Sitemap: https://yoursite.com/sitemap.xml
This configuration accomplishes everything a small business site needs: full indexation of all public content, protection of admin and backend areas, blocking of duplicate search-result pages, and exclusion from major AI training pipelines. It’s clean, maintainable, and covers the cases that matter most.
Q28: Do I need both robots.txt and an XML sitemap?
Yes — they serve genuinely different purposes and work together.
robots.txt tells crawlers which parts of your site they should not visit. It’s about directing crawlers away from low-value areas.
An XML sitemap tells crawlers exactly which pages exist and should be crawled. It’s about directing crawlers toward your important content.
The combination creates a complete crawl guidance system: your sitemap says “here’s everything worth indexing,” your robots.txt says “here’s everything you can skip.” Together, they give Google a precise picture of your site’s structure and maximize crawl budget efficiency.
Always reference your sitemap from robots.txt using the Sitemap: directive. This ensures every crawler that reads your robots.txt file also knows where to find your full page inventory.
Q29: How do I test whether my robots.txt is working correctly?
Google Search Console’s robots.txt tester is the most reliable tool. Navigate to Settings → robots.txt in Search Console. It shows the current version of your file as Google sees it, including the last fetch date and any syntax issues detected.
For testing specific URLs, use the URL Inspection tool. Enter any URL on your site and run the inspection. The report explicitly states whether the URL is blocked by robots.txt, and if so, which directive is responsible.
Manually, you can visit yourdomain.com/robots.txt in a browser to confirm the file loads correctly and contains your intended rules.
After making changes, re-test your key pages — homepage, main content pages, a product or blog post, and your admin URL — to confirm that what should be accessible is accessible and what should be blocked is blocked.
Q30: Does robots.txt help or hurt SEO?
Correct robots.txt configuration helps SEO through two concrete mechanisms.
First, crawl budget optimization. Google allocates a crawl budget to each domain — a practical limit on how many pages it will crawl within a given period. On large sites, this budget genuinely constrains how quickly new content gets indexed. By blocking low-value pages (admin areas, parameter variations, search result pages, checkout flows), you direct more of this budget toward content that actually benefits from crawling.
Second, duplicate content reduction. URL parameter variations, pagination, filter combinations, and tracking parameters can create dozens or hundreds of near-identical URLs for the same underlying content. Blocking these with robots.txt prevents Google from trying to evaluate them as distinct pages, which simplifies its understanding of your site’s content and strengthens your canonical pages.
The caveat: these benefits are most pronounced on large sites. A twenty-page business website probably won’t see a measurable change in crawl efficiency from robots.txt optimization. The fundamentals — correct content, functioning sitemap, clean site structure — matter more at that scale.
Section D: Troubleshooting
Q31: Google Search Console shows pages are “Blocked by robots.txt.” How do I fix this?
Start with the URL Inspection tool. Enter the blocked URL, run the inspection, and look at the “Coverage” section. It will name the specific robots.txt directive responsible for the block.
Open your robots.txt file and find that directive. Determine whether the block is intentional or accidental. Common accidental blocks:
Disallow: / accidentally placed at the top of a rule group blocks your entire site. This usually happens when someone sets up a staging site with a full block and then reuses the file without updating it.
Disallow: /*? blocks all URLs containing query strings. This is often intended to block parameter duplicates, but can also block paginated content, filtered pages, or search functionality that you want indexed.
Disallow: /wp-content/ blocks WordPress media and theme files, which prevents Google from rendering your pages correctly.
After identifying and correcting the rule, validate your updated robots.txt in Search Console. Then use the URL Inspection tool to confirm the previously blocked pages are now accessible, and request indexation for any that were incorrectly excluded.
Q32: What happens if I don’t have a robots.txt file at all?
Crawlers default to treating all content as accessible when no robots.txt file exists. Google will crawl everything it can reach through internal links, including admin pages, parameter variations, and any other paths on your site.
For a brand new, small site with clean URL structure and no admin exposure, this isn’t catastrophic. Google will crawl your site and index your content.
The practical problems that emerge over time: admin and backend paths that should never be indexed become visible in Google’s index. Parameter-based duplicate URLs accumulate in your coverage report. Crawl budget gets distributed across your entire site rather than concentrated on your important content. New AI crawlers access your content without any friction.
None of these are insurmountable, but they all require retroactive cleanup that proper robots.txt configuration prevents from the start.
Q33: Can robots.txt protect my site from hackers?
Not directly. robots.txt is a voluntary protocol that well-behaved bots follow. Someone actively attempting to compromise your site will not be deterred by a text file.
What robots.txt does accomplish in a security context is reduce your attack surface in automated scanning. Many vulnerability scanners and automated probes crawl websites looking for common admin paths, login pages, and backend endpoints. Blocking these paths in robots.txt means your login page doesn’t appear in the automated probes that discover potential targets.
This isn’t security — a determined attacker will find /wp-admin/ regardless. It’s noise reduction. Fewer automated probes reach your admin pages because the easy automated path is blocked.
Actual security requires: strong passwords and two-factor authentication, keeping software updated, HTTPS, a web application firewall, and regular security audits. robots.txt is not a substitute for any of these.
Q34: Is robots.txt mandatory?
Not technically. Sites without robots.txt function, and Google indexes them. But the professional standard is to have one, and there’s no legitimate argument for omitting it.
The configuration is not complex, the setup takes minutes, the maintenance is minimal, and the benefits — cleaner indexation, protection of admin areas, control over AI crawler access — apply to every site regardless of size.
Q35: How do I completely hide my site from all search engines?
To block all crawlers from your entire domain:
User-agent: *
Disallow: /
This is appropriate for: development environments, staging sites, internal tools, and any site not intended for public search discovery.
For production sites temporarily not ready for indexation, this is the correct approach. When ready to launch, remove the Disallow: / rule, submit your sitemap to Google Search Console, and request indexation.
One important note: this blocks crawling but doesn’t guarantee no search visibility. If your domain has external links pointing to it, Google may still index the URLs as references without crawling them. For complete search invisibility, noindex at the server level (via HTTP response headers) is more reliable than robots.txt alone.
Q36: Does robots.txt protect against content scraping?
For legitimate bots that respect the protocol, yes. For malicious scrapers, no.
The practical implication: adding Disallow: / for known scraper user-agents is useful for the subset of scrapers that respect robots.txt — typically commercial tools and data services that need to maintain good standing in the web ecosystem. It does nothing for purpose-built scrapers that ignore the file.
For meaningful scraping protection, the effective tools are: rate limiting at the server or CDN level (blocking IPs that make excessive requests), CAPTCHAs on content that’s frequently targeted, content delivery systems that make large-scale scraping technically difficult, and legal action when identifiable infringement occurs.
robots.txt is a first-layer signal, not a content protection system.
Q37: How do I manage robots.txt across multiple websites?
Each domain gets its own separate file. Rules don’t carry between domains.
For teams managing multiple sites, the practical approach is maintaining a template robots.txt that covers your standard configuration — core blocks, AI crawler exclusions, sitemap reference — and customizing per-site from that template. This reduces setup time and ensures consistency across properties.
Store your templates in version control. When you update your AI crawler block list or add a new standard rule, apply the update across all sites systematically rather than site by site on demand.
Q38: What are the most damaging robots.txt mistakes?
The mistakes that cause the most actual harm:
Accidentally blocking your entire site. A misplaced Disallow: / under User-agent: * makes your entire domain invisible to Google. This happens more often than it should, usually when copying rules from a staging environment. Check your live robots.txt after any configuration change.
Blocking CSS, JavaScript, and media files. Google renders your pages to evaluate them for indexation. Blocking these files prevents accurate rendering, which impairs Google’s understanding of your content. Never disallow /wp-content/themes/, /css/, /js/, or /images/.
Blocking core content directories. Disallow: /blog/, Disallow: /products/, Disallow: /services/ removes your entire content inventory from Google’s reach.
Using incorrect AI crawler user-agent strings. A block for anthropic-ai doesn’t block Anthropic’s actual crawler (ClaudeBot). Incorrect strings do nothing. Verify user-agent strings from official documentation.
Never updating after launch. A robots.txt written at launch and never reviewed accumulates outdated rules and misses new crawlers. Quarterly review catches drift before it becomes a problem.
Q39: Should blog posts ever be blocked in robots.txt?
No. Your blog content is the substance of your site’s organic presence in search. Blocking it removes all SEO value from those posts.
The only blog-related paths worth considering for robots.txt blocks:
Draft posts published accidentally. If your CMS gives drafts a live URL before they’re ready, block the draft path pattern: Disallow: /blog/draft/ or Disallow: /?post_status=draft.
Tag and author archive pages, if they generate thin duplicate content with no independent search value. Many WordPress sites block /tag/ and /author/ for this reason.
Paginated variations beyond the first page, depending on your content strategy and whether paginated pages have sufficient unique content to merit independent indexation.
Everything else — every published post — should be accessible. That’s the content you want ranking.
Q40: Why is robots.txt important to overall SEO strategy?
robots.txt sits at the foundation of how Google experiences your site. Before it crawls a single page of content, before it evaluates a single title tag, before it follows a single internal link — it reads your robots.txt file.
What it finds there shapes every subsequent crawl decision. Which paths it prioritizes, which it skips, how efficiently it allocates its crawl budget, which pages end up in its index. A misconfigured file creates downstream problems throughout your SEO operation: pages that should rank don’t get indexed, duplicate content confuses Google’s understanding of your site structure, crawl budget drains on pages that provide no return.
The positive framing: a correctly configured robots.txt is infrastructure that silently improves everything else you do. It doesn’t produce a visible spike in a dashboard. It creates the conditions — clean indexation, efficient crawling, minimal duplication — that allow your content, link-building, and on-page optimization work to produce their full effect.
That’s what foundational SEO infrastructure does. You don’t notice it when it works correctly. You notice its absence when it doesn’t.
Part 6: Robots.txt Best Practices — The Professional Standard
What a Well-Maintained robots.txt Looks Like in Practice
Professionals treat robots.txt as living infrastructure, not a one-time setup. The file needs to reflect your current site structure and the current crawler ecosystem.
At launch: Configure your core rules — block admin areas, search result pages, transactional paths, and duplicate parameter URLs. Reference your sitemap. Add AI crawler blocks for the major providers. Validate in Google Search Console before going live.
When adding site sections: Any new section that shouldn’t be crawled needs a robots.txt rule from the moment it launches. Don’t let admin panels, member areas, or staging paths accumulate crawls before you add the block.
Quarterly review: Check for obsolete rules (paths that no longer exist on your site), new crawlers that should be blocked, and any changes to Google’s crawler documentation that affect best practices. The fifteen minutes spent on this quarterly prevents larger cleanup efforts later.
After major site changes: URL restructuring, CMS migrations, and e-commerce platform changes all require robots.txt review. Old rules may reference paths that no longer exist. New URL structures may expose paths that need blocking.
The Connection Between robots.txt and XML Sitemap
These two files work as a system. Your sitemap says “here are all my important pages.” Your robots.txt says “here’s what to ignore.” Together they give Google a complete and consistent picture of your site.
The consistency requirement: if a URL appears in your sitemap, it must be accessible in robots.txt. Including a page in your sitemap while blocking it in robots.txt creates a contradiction that Google reports as a sitemap error in Search Console. Audit both files together and verify they don’t conflict.
The Sitemap: directive in robots.txt closes the loop — it tells every crawler that reads your robots.txt exactly where to find your full URL inventory.
Technical SEO Efficiency
For large sites where crawl budget is a genuine constraint, robots.txt configuration is one of the highest-leverage technical SEO actions available. Reducing the number of crawlable low-value URLs directly increases the percentage of crawl budget spent on your important content.
The pages worth auditing for potential blocking: faceted navigation combinations, URL parameter variations, paginated archive pages beyond page two or three, site search results, printer-friendly versions, and legacy URL formats kept for redirect purposes but not intended for indexation.
None of this requires sophisticated tools. A crawl report from Screaming Frog shows you every URL Google is reaching. Sorting by URL pattern reveals which categories of pages are consuming crawl budget and which of those have no SEO value.
Part 7: Your robots.txt Implementation Checklist
If your site doesn’t have a robots.txt file or you haven’t reviewed it recently, here’s the complete sequence:
Step 1 — Audit your current state. Visit yourdomain.com/robots.txt. If you see a 404, no file exists. If you see rules, note them and evaluate whether they’re correct for your current site structure.
Step 2 — Build your configuration. Start with the standard configuration from Part 3. Add AI crawler blocks. Add any site-specific paths that should be excluded. Verify you’re not accidentally blocking important content.
Step 3 — Validate. Run your file through a robots.txt validator. Fix any syntax errors. Confirm the file is correctly formatted and all directives are recognized.
Step 4 — Deploy. Upload to your root directory or save through your CMS plugin. Visit yourdomain.com/robots.txt in a browser to confirm it loads correctly and displays your rules.
Step 5 — Verify in Google Search Console. Check the robots.txt report in Search Console. Confirm Google can read the file. Test specific URLs with the URL Inspection tool to verify expected accessibility.
Step 6 — Cross-reference with your sitemap. Confirm that every URL in your sitemap is accessible according to your robots.txt rules. Fix any conflicts.
Step 7 — Set a quarterly review reminder. Put it in your calendar. The review takes fifteen minutes. The cost of missing it compounds.
The Takeaway
robots.txt is one of the smallest files on your server and one of the most consequential for how search engines experience your site.
It doesn’t replace great content. It doesn’t replace a strong backlink profile. It doesn’t replace technical SEO fundamentals like fast load times and clean site structure.
What it does is create the conditions for everything else to work properly. Crawlers that aren’t wasting time on admin pages and parameter duplicates are spending that time on your content. Content that gets found efficiently gets indexed faster. Content that gets indexed faster starts competing for rankings sooner.
Configure it correctly, review it quarterly, and let it do its job quietly in the background — the way good infrastructure should.The Small File That Shapes Everything Google Does on Your Site
Most website owners think about robots.txt once — usually when something breaks — and never again.
That’s a mistake.
robots.txt is a plain text file that sits at the root of your domain. Every time a crawler visits your site, it reads this file first. It’s the first conversation between your website and Google, Bing, and every other bot that finds you on the web.
What you say in that conversation determines how efficiently Google crawls your site, which pages get indexed, whether your admin panel is exposed to automated probes, and whether AI companies can freely use your content to train their models.
A misconfigured robots.txt can quietly block your most important pages from Google. An absent one means you’re leaving crawl budget decisions entirely up to Google — which doesn’t always prioritize what you’d prioritize. The right configuration, built on accurate technical understanding, makes your entire SEO operation run more cleanly.
This guide covers everything from first principles to advanced configuration — what robots.txt actually does, how to write it correctly, how to set it up on every major platform, and how to fix the errors that silently undermine indexation.
Part 1: What robots.txt Is, What It Does, and What It Doesn’t
The Actual Definition
robots.txt is a text file hosted at yoursite.com/robots.txt. When a web crawler visits your site, it checks this location before crawling anything else. The file contains directives — rules — that tell the crawler which parts of your site it may or may not access.
Here is what a straightforward robots.txt looks like:
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://yoursite.com/sitemap.xml
Breaking this down line by line:
User-agent: * means these rules apply to all crawlers. You can also target specific crawlers by name — more on that shortly.
Allow: / tells crawlers they’re permitted to access the root of the site and everything under it, by default.
Disallow: /admin/ tells crawlers not to access anything under the /admin/ path.
Sitemap: points crawlers to your XML sitemap, helping them find all your pages efficiently.
The entire concept is that simple. The complexity is in knowing what to include and why.
What robots.txt Actually Controls
robots.txt controls crawling — whether a bot reads the content of a URL. It does not directly control indexing — whether a page appears in search results.
This is the most important technical distinction in understanding robots.txt, and it’s the source of most configuration mistakes.
If you disallow a page in robots.txt, Google’s crawler won’t read it. But if that page is linked to from an external site, Google may still index a reference to it — it just won’t know what’s on the page. The URL can still appear in search results as an “unverified” listing.
To prevent a page from appearing in search results entirely, you need a noindex meta tag on the page itself — not a disallow in robots.txt. These two mechanisms are complementary, not interchangeable.
robots.txt is the right tool for:
- Preventing crawlers from reading pages that don’t need to be indexed (admin pages, duplicate parameter URLs, staging areas)
- Conserving crawl budget by directing Google away from low-value pages
- Pointing all crawlers to your sitemap
- Blocking specific bots from accessing your content
robots.txt is the wrong tool for:
- Hiding pages from search results (use
noindexinstead) - Securing sensitive information (use server-level access controls)
- Blocking determined malicious actors (robots.txt is a voluntary protocol — malicious scrapers ignore it)
The Voluntary Protocol Problem
robots.txt works because legitimate bots choose to follow it. Google, Bing, major SEO tools, and most professional crawlers read your robots.txt and respect its rules. This covers the overwhelming majority of automated traffic to most websites.
Malicious scrapers, unauthorized AI training bots, and vulnerability scanners frequently ignore robots.txt entirely. If you’re concerned about them, robots.txt is a first-layer signal — not a firewall. Blocking them properly requires server-level rate limiting, IP blocking, and WAF (Web Application Firewall) rules.
This distinction matters because many guides describe robots.txt as a security tool. It isn’t — not in any technical sense. It’s a communication protocol between your site and well-behaved bots. Understanding this prevents both over-reliance and under-utilization.
Part 2: The Complete robots.txt Syntax Reference
Basic Structure
Every robots.txt file follows the same structure: groups of rules, each starting with a User-agent declaration, followed by Allow and Disallow directives.
User-agent: [crawler name or *]
Allow: [path]
Disallow: [path]
Crawl-delay: [seconds]
Sitemap: [full URL]
Rules:
- One directive per line
- A colon and space after each directive:
Disallow: /path/ - Paths must start with
/ - Blank lines separate rule groups for different user-agents
- Comments begin with
# - The
Sitemap:directive goes at the end, outside any user-agent group
Correct vs. Incorrect Syntax
The most common syntax errors that silently break robots.txt:
# WRONG
user-agent: * ← lowercase works but inconsistent
Disallow /admin/ ← missing colon
Disallow: admin/ ← missing leading slash
Allow:/blog/ ← missing space after colon
# CORRECT
User-agent: *
Disallow: /admin/
Allow: /blog/
One malformed line doesn’t necessarily break the entire file, but it can cause unpredictable behavior. Always validate before publishing.
Wildcards and Pattern Matching
robots.txt supports two wildcard characters:
* matches any sequence of characters. $ matches the end of the URL.
Disallow: /*? ← Block all URLs containing a query string
Disallow: /*.pdf$ ← Block all URLs ending in .pdf
Disallow: /search/ ← Block exact path and everything under it
Googlebot and most modern crawlers support these patterns. Some older crawlers do not. Keep patterns simple where possible — complex wildcard rules are harder to debug and more likely to produce unintended matches.
Regular expressions are not supported. You cannot use [a-z] or (pattern1|pattern2) syntax.
The Allow/Disallow Hierarchy
When rules conflict, the more specific rule takes precedence. Allow overrides Disallow for the same path.
User-agent: *
Disallow: /blog/
Allow: /blog/featured/
This blocks all of /blog/ except the /blog/featured/ subdirectory. The more specific path wins.
User-agent: *
Disallow: /
Allow: /public/
This blocks everything except /public/. A common pattern for staging sites that need one section publicly accessible.
The Crawl-delay Directive
Crawl-delay tells crawlers to wait a specified number of seconds between requests:
Crawl-delay: 2
One critical fact: Googlebot does not support Crawl-delay. Google ignores this directive entirely and manages its own crawl rate based on your server’s response times. If you need to limit Google’s crawl rate, do it in Google Search Console under Crawl Rate settings.
Other crawlers (Bingbot, many SEO tools) do honor Crawl-delay. On shared hosting environments where server load is a concern, setting Crawl-delay: 1 is a reasonable default for non-Google crawlers.
For most modern hosting setups, Crawl-delay is unnecessary and can be omitted.
Part 3: What to Block, What to Allow, and Why
The Core Configuration for Most Websites
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /wp-includes/
Disallow: /wp-json/
Disallow: /?s=
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /thank-you/
Sitemap: https://yoursite.com/sitemap.xml
This is the right starting point for most WordPress sites. What each block accomplishes:
/wp-admin/ and /wp-login.php — Your backend. No value for Google. Blocking prevents these from consuming crawl budget and reduces automated login probes.
/wp-includes/ — WordPress core files. Not content. Not worth crawling.
/wp-json/ — REST API endpoints. If you have a headless setup that needs these crawled, remove this rule. For standard sites, these pages have no SEO value.
/?s= — Search result pages generated when users search your site. Each search creates a unique URL with no independent SEO value, and Google actively penalizes sites that serve large volumes of low-quality dynamically generated pages.
/cart/, /checkout/, /my-account/ — E-commerce and membership pages. Personal to each user, generate no search traffic, waste crawl budget.
/thank-you/ — Post-conversion pages. Not useful in search results.
What Not to Block
The most costly robots.txt mistake is accidentally blocking content that should rank. Never disallow:
Your main content directories: /blog/, /articles/, /products/, /services/, /category/
CSS, JavaScript, and image files. Google needs to render your pages accurately to understand them. Blocking these files prevents Google from seeing your pages the way users see them, which impairs both indexation quality and Core Web Vitals assessment.
# WRONG — Don't do this
Disallow: /wp-content/uploads/
Disallow: /wp-content/themes/
Disallow: /css/
Disallow: /js/
Disallow: /images/
Canonical pages. If you’ve set canonical URLs, the canonical version must be crawlable. Blocking a canonical URL defeats its purpose entirely.
Managing Duplicate Content With robots.txt
URL parameter variations are one of the most common sources of unintentional duplicate content. Faceted navigation in e-commerce creates thousands of filter combinations that are effectively the same page with slightly different product subsets.
# Block filter/sort variations
Disallow: /*?sort=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?page=
Disallow: /*?ref=
Disallow: /*?utm_
The alternative approach — preferred for e-commerce — is to use canonical tags on filtered pages pointing to the base category page, combined with selective robots.txt blocking. This gives Google more signal than blocking alone.
For pagination, modern guidance from Google is to allow pagination and use proper rel=”next” / rel=”prev” markup rather than blocking paginated pages with robots.txt.
Blocking AI Training Crawlers
Several AI companies run crawlers that harvest web content for training large language models. These are distinct from search engine crawlers — they don’t drive referral traffic to your site in exchange for indexing your content.
Blocking them in robots.txt is straightforward. The key is using the correct user-agent strings, which change as new bots emerge:
# OpenAI
User-agent: GPTBot
Disallow: /
# Common Crawl (used to train many AI models)
User-agent: CCBot
Disallow: /
# Anthropic
User-agent: ClaudeBot
Disallow: /
# Google AI (Bard/Gemini training)
User-agent: Google-Extended
Disallow: /
# Perplexity AI
User-agent: PerplexityBot
Disallow: /
# Meta
User-agent: FacebookBot
Disallow: /
# Apple
User-agent: Applebot-Extended
Disallow: /
Important: this list evolves as new AI products launch. Review quarterly and check current crawler documentation from each company. Using an incorrect user-agent string means your block does nothing — the crawler simply doesn’t recognize it as a rule that applies to it.
robots.txt blocking works here because these are legitimate companies whose crawlers voluntarily respect the protocol. Unauthorized scrapers that ignore robots.txt require different countermeasures.
Part 4: How to Create and Deploy robots.txt on Every Platform
WordPress
WordPress does not create a robots.txt file by default for new installations. The major SEO plugins handle this correctly and give you a clean interface for managing rules.
Using Yoast SEO:
Navigate to Yoast SEO → Tools → File Editor. You’ll see a robots.txt editor directly in your WordPress dashboard. Edit the file and save. Yoast writes changes to your actual robots.txt file on the server.
Note: Yoast automatically adds a reference to your XML sitemap. Verify it’s pointing to the correct URL.
Using Rank Math:
Navigate to Rank Math → General Settings → Edit robots.txt. The interface is similar to Yoast. Rank Math also handles the sitemap reference automatically.
Manual upload:
If you prefer full control, create a text file named robots.txt, write your rules, and upload it to your root directory (/public_html/robots.txt) via FTP or your hosting file manager. WordPress will use this file directly.
If both a physical file and a plugin-managed file exist, the physical file takes precedence. Verify which one Google is reading by checking yoursite.com/robots.txt in your browser and comparing it against your plugin settings.
Shopify
Shopify generates a default robots.txt automatically. Since 2021, Shopify allows merchants to customize it through the theme editor using a Liquid template.
To access it: Online Store → Themes → Actions → Edit Code → Templates → robots.txt.liquid
The default template handles most standard configurations correctly. Common customizations include blocking specific URL parameters generated by apps or adding AI crawler blocks.
liquid
{% assign default_robots_txt = default_robots_txt | prepend: "User-agent: GPTBot\nDisallow: /\n\n" %}
If you’re not comfortable editing Liquid templates, Shopify’s default robots.txt is functional for most stores. The more impactful optimization is submitting your sitemap to Google Search Console.
Wix
Wix provides a robots.txt editor in the SEO settings section of your dashboard. Navigate to Marketing & SEO → SEO Tools → Robots.txt.
Wix generates default rules that block its internal infrastructure pages, which is correct behavior. You can add custom directives for AI bot blocking and additional path restrictions through this interface.
The editor validates syntax before saving, which prevents the most common formatting errors.
Squarespace
Squarespace automatically generates and manages robots.txt. Direct editing is not available through the standard interface.
For page-level indexing control, use the noindex setting available in each page’s SEO settings. This is more granular than robots.txt anyway for most Squarespace use cases.
If your Squarespace site has specific crawl control requirements, the standard approach is to contact Squarespace support. Most common configurations — blocking cart pages, staging areas — are handled by Squarespace’s auto-generated rules.
Custom-Built Websites
Static sites: Create a robots.txt file, write your rules in any text editor, and place it in the root directory of your server. It must be accessible at yourdomain.com/robots.txt.
Node.js/Express:
javascript
app.get('/robots.txt', (req, res) => {
res.type('text/plain');
res.send(
'User-agent: *\n' +
'Allow: /\n' +
'Disallow: /admin/\n' +
'Disallow: /api/\n\n' +
'Sitemap: https://yoursite.com/sitemap.xml'
);
});
PHP: Create the file manually and upload to your server root. PHP doesn’t require any special routing for static files at the root directory.
Django: Add the file to your staticfiles directory and configure your URL patterns to serve it. Alternatively, use the django-robots package for database-driven robot rules.
Next.js: Place robots.txt in your /public directory. Next.js serves all files in /public from the root path automatically.
Submitting to Google Search Console
Creating and uploading your robots.txt file doesn’t automatically inform Google of its existence. While Googlebot will find it eventually during regular crawling, the best practice is to verify it in Google Search Console.
Go to Google Search Console, select your property, and navigate to Settings → robots.txt. Search Console shows the current version Google has cached, when it was last fetched, and any issues detected.
There’s no separate “submit” process for robots.txt the way there is for sitemaps. Verification in Search Console confirms Google can read your file and shows you exactly what it sees.
After making significant changes to robots.txt, use the URL Inspection tool to test specific pages and confirm they’re accessible as intended before the next scheduled crawl.
Part 5: The 40 Questions Website Owners Ask About robots.txt
Section A: Foundations
Q1: What is robots.txt and do I need one?
robots.txt is a text file at your root domain that controls how web crawlers interact with your site. Every professional website should have one.
Without robots.txt, crawlers make their own decisions about what to crawl. For a small, well-structured site with clean internal linking, this works adequately. For any site with admin pages, user-specific content, duplicate URL variations, or large volumes of pages, an unconfigured crawl wastes the budget that should go toward your most important content.
The setup takes five minutes. The benefits — cleaner indexation, faster discovery of new content, protection of admin areas — are permanent and require no ongoing work for most sites.
Q2: Does robots.txt actually stop search engines from crawling my site?
It instructs them. Legitimate crawlers — Google, Bing, most professional SEO tools — voluntarily read and follow robots.txt rules. This covers the vast majority of automated traffic on the web.
Malicious scrapers, unauthorized data harvesters, and vulnerability scanners typically ignore robots.txt. The file is a communication protocol, not an access control system. For actual access control, you need server-level security: password protection, IP blocking, rate limiting, or a WAF.
Understanding this distinction prevents the common mistake of treating robots.txt as a security measure when it isn’t one — and the equally common mistake of dismissing it because it doesn’t stop all bots, when it effectively manages the majority of legitimate ones.
Q3: How do I fix the “Blocked by robots.txt” error in Google Search Console?
This error means a page you want crawled is being blocked by a rule in your robots.txt file. The fix is to identify which rule is causing the block and either modify or remove it.
In Google Search Console, navigate to Coverage → Excluded and look for pages marked “Blocked by robots.txt.” Click into any affected URL and use the URL Inspection tool to see which specific robots.txt directive is blocking it.
Open your robots.txt file, find the rule, and determine whether the block is intentional. If the page should be crawled, remove or narrow the relevant Disallow directive. If the block is correct but the page is appearing in your index report anyway, that indicates an inconsistency between your sitemap and robots.txt worth resolving.
After editing, validate your updated robots.txt in Search Console and re-request indexation for any previously blocked pages that should now be crawlable.
Q4: What should I put in my robots.txt file?
Block only what genuinely shouldn’t be indexed. The most common and universally applicable blocks:
Admin and backend pages: /admin/, /wp-admin/, /wp-login.php — no SEO value and reduces exposure to automated probes.
Search result pages: /?s=, /search/ — dynamically generated, low quality, and can create enormous volumes of thin duplicate pages.
User account and transactional pages: /cart/, /checkout/, /my-account/, /dashboard/ — user-specific and irrelevant in search.
Test and staging paths: /staging/, /test/, /dev/ — never let these get indexed accidentally.
Duplicate parameter variations: /*?sort=, /*?ref=, /*?utm_ — these create parameter-based duplicates with no independent ranking value.
Leave everything else unconstrained. Your blog content, product pages, category pages, service pages, landing pages — everything you want ranking should be explicitly or implicitly allowed.
Q5: Is blocking crawlers with robots.txt legal?
Completely. You own your domain and you have the right to control what automated systems access it. robots.txt is the industry-standard protocol for communicating these preferences, and using it as intended is straightforwardly legal in all jurisdictions.
Blocking specific companies’ crawlers — including AI training bots from major technology companies — is a recognized and increasingly common practice. Multiple court cases and regulatory frameworks have affirmed that website operators have the right to control automated access to their content.
Q6: How do I create my first robots.txt file?
The fastest path for most website owners is using their CMS plugin. WordPress users with Yoast SEO or Rank Math already have a robots.txt editor in their dashboard — navigate there, paste in the standard configuration from Part 3 of this guide, and save.
For any other platform, open a plain text editor (Notepad on Windows, TextEdit on Mac — set to plain text mode), type your rules, and save the file as robots.txt with no additional extension. Upload it to the root directory of your server.
The key detail that catches many people: the file must be named exactly robots.txt (lowercase, no extension, no prefix). Robots.txt, robot.txt, and robots.txt.txt are all wrong and won’t be recognized.
After uploading, visit yourdomain.com/robots.txt in a browser. If you see your rules displayed as plain text, the file is correctly placed and accessible.
Q7: Can Google index my pages if robots.txt blocks them?
Yes, and this surprises most people.
When Google’s crawler is blocked from a page, it cannot read the content. But if the page is linked to from external sites Google knows about, Google may still add it to its index as an “uncrawled” URL. The listing in search results will lack a description because Google never read the page — but the URL can still appear.
To prevent a page from appearing in search results at all, you need a noindex meta tag in the page’s HTML. robots.txt blocking alone doesn’t achieve this.
For sensitive pages that should be completely invisible in search, use both: block crawling with robots.txt and add noindex to the page itself. This creates redundancy that covers the edge case where the block is later removed or temporarily fails.
Q8: robots.txt Disallow vs. noindex meta tag — which should I use?
They solve different problems:
robots.txt Disallow prevents crawling. Google’s bot won’t read the page at all. Use it for admin areas, duplicate parameter URLs, backend pages, and any URL that simply doesn’t need to be crawled.
The noindex meta tag prevents indexing while allowing crawling. Google reads the page, processes its content, but excludes it from search results. Use it for pages that exist but shouldn’t rank — archive pages, filtered variations with canonical URLs elsewhere, thank-you pages.
The critical practical point: if you block a page with robots.txt, Google can’t read the noindex tag on it, because it won’t crawl the page to find the tag. For pages where noindex is your goal, the page must be crawlable. Don’t combine a robots.txt block with a noindex tag on the same page — the block prevents Google from ever seeing the noindex instruction.
Q9: Why is my robots.txt file not working?
The most common causes, in order of frequency:
Wrong file location. The file must be at yourdomain.com/robots.txt. Place it anywhere else and crawlers won’t find it. Verify by visiting the URL directly in your browser.
Syntax error. A missing colon, missing leading slash, or extra space can cause a directive to be ignored. Disallow: /admin/ is correct; Disallow /admin/ and Disallow: admin/ are both wrong. Run your file through a validator.
Cached version being displayed. Browsers and sometimes Search Console cache robots.txt. After making changes, clear your browser cache and wait for Google to re-fetch the file before diagnosing issues.
Plugin conflict (WordPress). If you have a physical robots.txt file on your server and a WordPress SEO plugin managing a virtual one, they can conflict. Check which version Google is actually reading via Search Console.
Q10: How do I verify my robots.txt is configured correctly?
Visit yourdomain.com/robots.txt in your browser. You should see your rules as plain text. If you see a 404 error, the file doesn’t exist or isn’t in the right location.
For validation against Google’s interpretation specifically, go to Google Search Console → Settings → robots.txt. Search Console shows the version it has cached and highlights any syntax issues it detected.
Use the URL Inspection tool on specific pages to verify they’re accessible. Enter a URL, run the inspection, and the report will show whether the URL is blocked by robots.txt and which specific rule is responsible if it is.
Section B: Technical Deep Dive
Q11: Should I block AI crawlers in robots.txt?
This is a decision that depends on your goals, and there’s no universal right answer — but there are clear considerations.
AI training crawlers harvest content to train large language models. Unlike search engine crawlers, they don’t send traffic back to your site. You’re contributing to their training data without a corresponding benefit.
Major AI companies operate crawlers that respect robots.txt. Blocking them is straightforward and effective for those companies. The current major crawlers to consider:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Note: Google-Extended blocks Google’s AI training crawler specifically, without affecting Googlebot (the search crawler). These are separate user-agents. You can block Google-Extended while keeping your site fully indexed by Google Search.
The list of AI crawlers grows regularly. Checking quarterly and updating as new crawlers are identified is a reasonable maintenance habit.
Q12: What is llms.txt and how does it relate to robots.txt?
llms.txt is an emerging proposed standard, introduced in 2024, that gives website owners a structured way to communicate how they want their content used by AI systems. It’s a separate file from robots.txt, placed at yoursite.com/llms.txt.
Where robots.txt controls access, llms.txt communicates intent — it can include information about licensing, usage permissions, and preferences for how content should be summarized or attributed in AI-generated responses.
The standard is not yet universally adopted. Major AI companies haven’t committed to supporting it with the same consistency they follow robots.txt. At this point, robots.txt blocking remains the more reliable mechanism for controlling AI access.
For sites where content licensing and AI usage is a priority concern, implementing both robots.txt AI crawler blocks and llms.txt provides belt-and-suspenders coverage as the standard matures.
Q13: Can robots.txt prevent AI companies from training on my content?
Partially, for companies whose crawlers respect robots.txt. OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), and most other major AI labs have publicly committed to respecting robots.txt directives for their training crawlers.
Blocking these crawlers via robots.txt prevents them from harvesting new content. It doesn’t retroactively remove content they’ve already collected from previous crawls.
For content that was crawled before you added blocks, the options are more limited: direct requests to the companies’ opt-out systems (OpenAI, Google, and others offer these), DMCA notices for specific misuse, and in some jurisdictions, emerging legal frameworks around web scraping and intellectual property.
The practical reality: robots.txt is effective for the major legitimate players. It has no effect on unauthorized scrapers or smaller operations that don’t comply with the protocol.
Q14: robots.txt vs. robots meta tag — when do I use each?
robots.txt operates at the site level, controlling access by path. A single robots.txt file governs your entire domain, and rules apply to URL patterns rather than individual pages.
The robots meta tag operates at the page level:
html
<meta name="robots" content="noindex, nofollow">
This tag goes in the HTML head of a specific page and gives you fine-grained control over that page’s indexation behavior.
Use robots.txt when: the rule applies to a category of pages or a directory. Blocking all URLs under /account/ or all search result pages matching /*?s= is much cleaner done in robots.txt than adding meta tags to every affected page.
Use the robots meta tag when: the rule is specific to one page or a small set of pages that don’t share a consistent URL pattern. A privacy policy, a low-quality archive page, or a dynamically generated page with a unique URL structure is better managed with a meta tag.
One technical rule that bears repeating: a page blocked by robots.txt will never be crawled, which means Google cannot read any meta tags on that page, including noindex. If noindex is your goal, the page must be crawlable.
Q15: How much does Crawl-delay actually matter?
Less than most guides suggest, for most sites.
Googlebot does not honor the Crawl-delay directive. Google manages its own crawl rate based on your server’s actual response times. If your server is slow, Google automatically crawls more slowly. If you want to manually set a crawl rate limit for Google, do it in Google Search Console under Settings → Crawl Rate — that setting is the only mechanism Googlebot respects.
Bingbot and many third-party SEO tools do honor Crawl-delay. On shared hosting with limited resources, adding Crawl-delay: 1 is a reasonable way to prevent non-Google crawlers from creating load spikes.
For sites on modern hosting infrastructure (VPS, dedicated, or cloud), Crawl-delay is rarely necessary and can be omitted without consequence.
Q16: What is the correct robots.txt syntax?
The rules are straightforward. Each directive occupies its own line, with a colon and a space separating the directive from its value.
Required format:
User-agent: [value]
Disallow: /[path]/
Allow: /[path]/
Paths must start with a forward slash. The directives are case-insensitive (Disallow and disallow both work), but consistent capitalization is standard practice.
Blank lines separate rule groups. A blank line after a Disallow directive and before the next User-agent declaration creates a clean boundary between rules for different crawlers.
Comments use the # character. Anything after # on a line is ignored by crawlers:
# Standard site rules
User-agent: *
Allow: /
Disallow: /admin/ # Block backend
The Sitemap: directive stands outside any user-agent group, typically at the bottom of the file. It takes a full URL, not a path:
Sitemap: https://yoursite.com/sitemap.xml
Q17: How do wildcards work in robots.txt?
Two wildcard characters are supported by Googlebot and most modern crawlers.
* matches any sequence of characters, including none:
Disallow: /wp-admin/* ← Blocks all paths under /wp-admin/
Disallow: /*?s= ← Blocks any URL containing ?s=
Disallow: /*.pdf ← Blocks any URL ending in .pdf
$ matches the end of the URL string:
Disallow: /archive/$ ← Blocks exactly /archive/ but not /archive/post-title
Combining both:
Disallow: /*.pdf$ ← Blocks URLs ending exactly in .pdf
These patterns are powerful but require careful testing. A pattern like Disallow: /*? blocks every URL that contains a query string — which can be useful for eliminating parameter duplicates, but also accidentally blocks your own search results, pagination, and any legitimate dynamic pages. Test before deploying with Google’s robots.txt tester in Search Console.
Q18: How does robots.txt work with subdomains?
Each subdomain requires its own robots.txt file. The file at yoursite.com/robots.txt governs only yoursite.com. It has no effect on blog.yoursite.com or shop.yoursite.com.
Each subdomain needs its own file at its own root:
yoursite.com/robots.txtblog.yoursite.com/robots.txtshop.yoursite.com/robots.txt
If a subdomain doesn’t have a robots.txt file, crawlers assume all content on that subdomain is accessible. This is the correct behavior for most public subdomains. Only create a robots.txt for a subdomain if you need specific crawl rules for that subdomain.
In Google Search Console, subdomains are typically separate properties with their own crawl data, sitemap submissions, and robots.txt configuration. Manage them independently.
Q19: Can I write different rules for different crawlers?
Yes. Giving specific crawlers specific rules is one of robots.txt’s most useful features.
# Googlebot gets full access
User-agent: Googlebot
Allow: /
# Bingbot gets full access
User-agent: Bingbot
Allow: /
# AI training crawler gets no access
User-agent: GPTBot
Disallow: /
# All other crawlers follow standard rules
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Sitemap: https://yoursite.com/sitemap.xml
When a specific user-agent rule exists, that crawler follows those rules and ignores the User-agent: * wildcard group. A crawler that doesn’t match any specific user-agent falls back to the wildcard rules.
This lets you maintain full search engine accessibility while blocking AI training crawlers — a common and practical configuration.
Q20: How often should I update robots.txt?
For most stable sites, robots.txt requires minimal ongoing maintenance. Update it when:
You add new sections to your site that should be blocked (a new admin area, a new membership section, a new checkout flow). Don’t wait until after these pages have been crawled — add the blocks when you launch.
You launch a staging or development environment on a subdomain. Add a full Disallow: / to that subdomain’s robots.txt immediately.
New AI crawlers emerge that you want to block. This is worth reviewing quarterly — new models and companies introduce new crawlers regularly.
You restructure your URL architecture. Old path-based rules may no longer apply and new rules may be needed.
A quarterly review is a reasonable cadence for most sites. The review itself takes five minutes — load your robots.txt, compare it against your current site structure, check whether any rules are obsolete or missing, and update accordingly.
Section C: Platform-Specific
Q21: How do I configure robots.txt in WordPress?
WordPress creates a virtual robots.txt when no physical file exists. This virtual file is managed by your SEO plugin. Both Yoast SEO and Rank Math provide a direct editor in their settings.
Yoast SEO: SEO → Tools → File Editor → robots.txt tab. Edit directly and save. Changes write immediately to either a virtual or physical file depending on your setup.
Rank Math: Rank Math → General Settings → Edit robots.txt. Same interface, same behavior.
Important consideration: If a physical robots.txt file exists in your root directory (uploaded via FTP or file manager), it takes precedence over WordPress’s virtual file. The two can conflict. If your plugin changes don’t seem to take effect, check whether a physical file exists using your hosting file manager.
To verify which version Google is reading: visit yoursite.com/robots.txt in a browser and compare its contents against what’s in your plugin editor.
Q22: How do I configure robots.txt in Shopify?
Since 2021, Shopify supports robots.txt customization through a Liquid template. Access it via: Online Store → Themes → Actions → Edit Code → Templates → robots.txt.liquid.
Shopify’s default template generates sensible rules for most stores — blocking cart, checkout, account, and order-related paths. For most Shopify stores, the default configuration is correct and requires no modification.
Common additions:
Blocking AI crawlers:
liquid
{% capture ai_blocks %}
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
{% endcapture %}
{{ ai_blocks }}{{ default_robots_txt }}
Blocking specific app-generated URL patterns that create duplicate content requires identifying those patterns in your site’s URL structure first. If you have an app creating filter or search pages at URLs like /collections/all?filter.p.tag=sale, you can add a Disallow directive for that pattern.
Q23: How do I configure robots.txt in Wix?
Wix provides a robots.txt editor in the SEO section. Navigate to Marketing & SEO → SEO Tools → Robots.txt in your Wix dashboard.
The editor validates syntax before saving. You can add standard directives, including AI crawler blocks, and custom path rules. Wix’s auto-generated rules handle its internal infrastructure — don’t remove those lines.
One limitation: you cannot use Wix’s robots.txt editor to block Wix’s own crawlers for its internal infrastructure. Those rules are managed by Wix and shouldn’t be modified.
If robots.txt customization doesn’t appear in your dashboard, it may require a Business or Business Elite plan. Check your current plan level.
Q24: How do I configure robots.txt in Squarespace?
Squarespace auto-generates robots.txt and doesn’t provide direct editing access through the standard interface. The generated file handles Squarespace’s internal URLs and checkout flows correctly by default.
For page-level control — preventing specific pages from appearing in search results — use Squarespace’s built-in SEO settings on each page. The “Hide Page from Search Engines” option in Page Settings adds a noindex directive to that page.
For site-wide concerns beyond what Squarespace’s default configuration covers, contacting Squarespace support is the available path. If your use case requires full robots.txt control, a self-hosted WordPress installation provides that flexibility.
Q25: Can I have multiple robots.txt files?
One per domain, and one per subdomain. You cannot have yoursite.com/blog/robots.txt — crawlers only check the root directory for the robots.txt file. Rules for all paths under your domain are contained in the single root-level file.
If you need different rules for different sections of your site, put them all in the root robots.txt file with path-specific Disallow and Allow directives.
Subdomains each get their own independent file. yoursite.com/robots.txt and blog.yoursite.com/robots.txt are completely separate and each governs only its own subdomain.
Q26: How do I block AI crawlers across all my content?
Add user-agent specific blocks for each AI crawler you want to exclude. The current major ones:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Place these blocks above your general User-agent: * rules. Crawlers read the file in order and apply the first matching user-agent group they find.
Note: Google-Extended is specifically Google’s AI data collection crawler. Adding this block does not affect Googlebot (Google Search). You can block Google-Extended while keeping your site fully indexed in Google Search — these are separate systems.
Verify user-agent strings against each company’s official crawler documentation before implementing. Incorrect strings do nothing.
Q27: What’s the right robots.txt setup for a small business website?
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /wp-includes/
Disallow: /wp-json/
Disallow: /?s=
Disallow: /search/
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Sitemap: https://yoursite.com/sitemap.xml
This configuration accomplishes everything a small business site needs: full indexation of all public content, protection of admin and backend areas, blocking of duplicate search-result pages, and exclusion from major AI training pipelines. It’s clean, maintainable, and covers the cases that matter most.
Q28: Do I need both robots.txt and an XML sitemap?
Yes — they serve genuinely different purposes and work together.
robots.txt tells crawlers which parts of your site they should not visit. It’s about directing crawlers away from low-value areas.
An XML sitemap tells crawlers exactly which pages exist and should be crawled. It’s about directing crawlers toward your important content.
The combination creates a complete crawl guidance system: your sitemap says “here’s everything worth indexing,” your robots.txt says “here’s everything you can skip.” Together, they give Google a precise picture of your site’s structure and maximize crawl budget efficiency.
Always reference your sitemap from robots.txt using the Sitemap: directive. This ensures every crawler that reads your robots.txt file also knows where to find your full page inventory.
Q29: How do I test whether my robots.txt is working correctly?
Google Search Console’s robots.txt tester is the most reliable tool. Navigate to Settings → robots.txt in Search Console. It shows the current version of your file as Google sees it, including the last fetch date and any syntax issues detected.
For testing specific URLs, use the URL Inspection tool. Enter any URL on your site and run the inspection. The report explicitly states whether the URL is blocked by robots.txt, and if so, which directive is responsible.
Manually, you can visit yourdomain.com/robots.txt in a browser to confirm the file loads correctly and contains your intended rules.
After making changes, re-test your key pages — homepage, main content pages, a product or blog post, and your admin URL — to confirm that what should be accessible is accessible and what should be blocked is blocked.
Q30: Does robots.txt help or hurt SEO?
Correct robots.txt configuration helps SEO through two concrete mechanisms.
First, crawl budget optimization. Google allocates a crawl budget to each domain — a practical limit on how many pages it will crawl within a given period. On large sites, this budget genuinely constrains how quickly new content gets indexed. By blocking low-value pages (admin areas, parameter variations, search result pages, checkout flows), you direct more of this budget toward content that actually benefits from crawling.
Second, duplicate content reduction. URL parameter variations, pagination, filter combinations, and tracking parameters can create dozens or hundreds of near-identical URLs for the same underlying content. Blocking these with robots.txt prevents Google from trying to evaluate them as distinct pages, which simplifies its understanding of your site’s content and strengthens your canonical pages.
The caveat: these benefits are most pronounced on large sites. A twenty-page business website probably won’t see a measurable change in crawl efficiency from robots.txt optimization. The fundamentals — correct content, functioning sitemap, clean site structure — matter more at that scale.
Section D: Troubleshooting
Q31: Google Search Console shows pages are “Blocked by robots.txt.” How do I fix this?
Start with the URL Inspection tool. Enter the blocked URL, run the inspection, and look at the “Coverage” section. It will name the specific robots.txt directive responsible for the block.
Open your robots.txt file and find that directive. Determine whether the block is intentional or accidental. Common accidental blocks:
Disallow: / accidentally placed at the top of a rule group blocks your entire site. This usually happens when someone sets up a staging site with a full block and then reuses the file without updating it.
Disallow: /*? blocks all URLs containing query strings. This is often intended to block parameter duplicates, but can also block paginated content, filtered pages, or search functionality that you want indexed.
Disallow: /wp-content/ blocks WordPress media and theme files, which prevents Google from rendering your pages correctly.
After identifying and correcting the rule, validate your updated robots.txt in Search Console. Then use the URL Inspection tool to confirm the previously blocked pages are now accessible, and request indexation for any that were incorrectly excluded.
Q32: What happens if I don’t have a robots.txt file at all?
Crawlers default to treating all content as accessible when no robots.txt file exists. Google will crawl everything it can reach through internal links, including admin pages, parameter variations, and any other paths on your site.
For a brand new, small site with clean URL structure and no admin exposure, this isn’t catastrophic. Google will crawl your site and index your content.
The practical problems that emerge over time: admin and backend paths that should never be indexed become visible in Google’s index. Parameter-based duplicate URLs accumulate in your coverage report. Crawl budget gets distributed across your entire site rather than concentrated on your important content. New AI crawlers access your content without any friction.
None of these are insurmountable, but they all require retroactive cleanup that proper robots.txt configuration prevents from the start.
Q33: Can robots.txt protect my site from hackers?
Not directly. robots.txt is a voluntary protocol that well-behaved bots follow. Someone actively attempting to compromise your site will not be deterred by a text file.
What robots.txt does accomplish in a security context is reduce your attack surface in automated scanning. Many vulnerability scanners and automated probes crawl websites looking for common admin paths, login pages, and backend endpoints. Blocking these paths in robots.txt means your login page doesn’t appear in the automated probes that discover potential targets.
This isn’t security — a determined attacker will find /wp-admin/ regardless. It’s noise reduction. Fewer automated probes reach your admin pages because the easy automated path is blocked.
Actual security requires: strong passwords and two-factor authentication, keeping software updated, HTTPS, a web application firewall, and regular security audits. robots.txt is not a substitute for any of these.
Q34: Is robots.txt mandatory?
Not technically. Sites without robots.txt function, and Google indexes them. But the professional standard is to have one, and there’s no legitimate argument for omitting it.
The configuration is not complex, the setup takes minutes, the maintenance is minimal, and the benefits — cleaner indexation, protection of admin areas, control over AI crawler access — apply to every site regardless of size.
Q35: How do I completely hide my site from all search engines?
To block all crawlers from your entire domain:
User-agent: *
Disallow: /
This is appropriate for: development environments, staging sites, internal tools, and any site not intended for public search discovery.
For production sites temporarily not ready for indexation, this is the correct approach. When ready to launch, remove the Disallow: / rule, submit your sitemap to Google Search Console, and request indexation.
One important note: this blocks crawling but doesn’t guarantee no search visibility. If your domain has external links pointing to it, Google may still index the URLs as references without crawling them. For complete search invisibility, noindex at the server level (via HTTP response headers) is more reliable than robots.txt alone.
Q36: Does robots.txt protect against content scraping?
For legitimate bots that respect the protocol, yes. For malicious scrapers, no.
The practical implication: adding Disallow: / for known scraper user-agents is useful for the subset of scrapers that respect robots.txt — typically commercial tools and data services that need to maintain good standing in the web ecosystem. It does nothing for purpose-built scrapers that ignore the file.
For meaningful scraping protection, the effective tools are: rate limiting at the server or CDN level (blocking IPs that make excessive requests), CAPTCHAs on content that’s frequently targeted, content delivery systems that make large-scale scraping technically difficult, and legal action when identifiable infringement occurs.
robots.txt is a first-layer signal, not a content protection system.
Q37: How do I manage robots.txt across multiple websites?
Each domain gets its own separate file. Rules don’t carry between domains.
For teams managing multiple sites, the practical approach is maintaining a template robots.txt that covers your standard configuration — core blocks, AI crawler exclusions, sitemap reference — and customizing per-site from that template. This reduces setup time and ensures consistency across properties.
Store your templates in version control. When you update your AI crawler block list or add a new standard rule, apply the update across all sites systematically rather than site by site on demand.
Q38: What are the most damaging robots.txt mistakes?
The mistakes that cause the most actual harm:
Accidentally blocking your entire site. A misplaced Disallow: / under User-agent: * makes your entire domain invisible to Google. This happens more often than it should, usually when copying rules from a staging environment. Check your live robots.txt after any configuration change.
Blocking CSS, JavaScript, and media files. Google renders your pages to evaluate them for indexation. Blocking these files prevents accurate rendering, which impairs Google’s understanding of your content. Never disallow /wp-content/themes/, /css/, /js/, or /images/.
Blocking core content directories. Disallow: /blog/, Disallow: /products/, Disallow: /services/ removes your entire content inventory from Google’s reach.
Using incorrect AI crawler user-agent strings. A block for anthropic-ai doesn’t block Anthropic’s actual crawler (ClaudeBot). Incorrect strings do nothing. Verify user-agent strings from official documentation.
Never updating after launch. A robots.txt written at launch and never reviewed accumulates outdated rules and misses new crawlers. Quarterly review catches drift before it becomes a problem.
Q39: Should blog posts ever be blocked in robots.txt?
No. Your blog content is the substance of your site’s organic presence in search. Blocking it removes all SEO value from those posts.
The only blog-related paths worth considering for robots.txt blocks:
Draft posts published accidentally. If your CMS gives drafts a live URL before they’re ready, block the draft path pattern: Disallow: /blog/draft/ or Disallow: /?post_status=draft.
Tag and author archive pages, if they generate thin duplicate content with no independent search value. Many WordPress sites block /tag/ and /author/ for this reason.
Paginated variations beyond the first page, depending on your content strategy and whether paginated pages have sufficient unique content to merit independent indexation.
Everything else — every published post — should be accessible. That’s the content you want ranking.
Q40: Why is robots.txt important to overall SEO strategy?
robots.txt sits at the foundation of how Google experiences your site. Before it crawls a single page of content, before it evaluates a single title tag, before it follows a single internal link — it reads your robots.txt file.
What it finds there shapes every subsequent crawl decision. Which paths it prioritizes, which it skips, how efficiently it allocates its crawl budget, which pages end up in its index. A misconfigured file creates downstream problems throughout your SEO operation: pages that should rank don’t get indexed, duplicate content confuses Google’s understanding of your site structure, crawl budget drains on pages that provide no return.
The positive framing: a correctly configured robots.txt is infrastructure that silently improves everything else you do. It doesn’t produce a visible spike in a dashboard. It creates the conditions — clean indexation, efficient crawling, minimal duplication — that allow your content, link-building, and on-page optimization work to produce their full effect.
That’s what foundational SEO infrastructure does. You don’t notice it when it works correctly. You notice its absence when it doesn’t.
Part 6: Robots.txt Best Practices — The Professional Standard
What a Well-Maintained robots.txt Looks Like in Practice
Professionals treat robots.txt as living infrastructure, not a one-time setup. The file needs to reflect your current site structure and the current crawler ecosystem.
At launch: Configure your core rules — block admin areas, search result pages, transactional paths, and duplicate parameter URLs. Reference your sitemap. Add AI crawler blocks for the major providers. Validate in Google Search Console before going live.
When adding site sections: Any new section that shouldn’t be crawled needs a robots.txt rule from the moment it launches. Don’t let admin panels, member areas, or staging paths accumulate crawls before you add the block.
Quarterly review: Check for obsolete rules (paths that no longer exist on your site), new crawlers that should be blocked, and any changes to Google’s crawler documentation that affect best practices. The fifteen minutes spent on this quarterly prevents larger cleanup efforts later.
After major site changes: URL restructuring, CMS migrations, and e-commerce platform changes all require robots.txt review. Old rules may reference paths that no longer exist. New URL structures may expose paths that need blocking.
The Connection Between robots.txt and XML Sitemap
These two files work as a system. Your sitemap says “here are all my important pages.” Your robots.txt says “here’s what to ignore.” Together they give Google a complete and consistent picture of your site.
The consistency requirement: if a URL appears in your sitemap, it must be accessible in robots.txt. Including a page in your sitemap while blocking it in robots.txt creates a contradiction that Google reports as a sitemap error in Search Console. Audit both files together and verify they don’t conflict.
The Sitemap: directive in robots.txt closes the loop — it tells every crawler that reads your robots.txt exactly where to find your full URL inventory.
Technical SEO Efficiency
For large sites where crawl budget is a genuine constraint, robots.txt configuration is one of the highest-leverage technical SEO actions available. Reducing the number of crawlable low-value URLs directly increases the percentage of crawl budget spent on your important content.
The pages worth auditing for potential blocking: faceted navigation combinations, URL parameter variations, paginated archive pages beyond page two or three, site search results, printer-friendly versions, and legacy URL formats kept for redirect purposes but not intended for indexation.
None of this requires sophisticated tools. A crawl report from Screaming Frog shows you every URL Google is reaching. Sorting by URL pattern reveals which categories of pages are consuming crawl budget and which of those have no SEO value.
Part 7: Your robots.txt Implementation Checklist
If your site doesn’t have a robots.txt file or you haven’t reviewed it recently, here’s the complete sequence:
Step 1 — Audit your current state. Visit yourdomain.com/robots.txt. If you see a 404, no file exists. If you see rules, note them and evaluate whether they’re correct for your current site structure.
Step 2 — Build your configuration. Start with the standard configuration from Part 3. Add AI crawler blocks. Add any site-specific paths that should be excluded. Verify you’re not accidentally blocking important content.
Step 3 — Validate. Run your file through a robots.txt validator. Fix any syntax errors. Confirm the file is correctly formatted and all directives are recognized.
Step 4 — Deploy. Upload to your root directory or save through your CMS plugin. Visit yourdomain.com/robots.txt in a browser to confirm it loads correctly and displays your rules.
Step 5 — Verify in Google Search Console. Check the robots.txt report in Search Console. Confirm Google can read the file. Test specific URLs with the URL Inspection tool to verify expected accessibility.
Step 6 — Cross-reference with your sitemap. Confirm that every URL in your sitemap is accessible according to your robots.txt rules. Fix any conflicts.
Step 7 — Set a quarterly review reminder. Put it in your calendar. The review takes fifteen minutes. The cost of missing it compounds.
The Takeaway
robots.txt is one of the smallest files on your server and one of the most consequential for how search engines experience your site.
It doesn’t replace great content. It doesn’t replace a strong backlink profile. It doesn’t replace technical SEO fundamentals like fast load times and clean site structure.
What it does is create the conditions for everything else to work properly. Crawlers that aren’t wasting time on admin pages and parameter duplicates are spending that time on your content. Content that gets found efficiently gets indexed faster. Content that gets indexed faster starts competing for rankings sooner.
Configure it correctly, review it quarterly, and let it do its job quietly in the background — the way good infrastructure should.
