Robots.txt File Generator — What It Is and How to Create One
Learn what a robots.txt file does, how crawl rules work, and how to generate a correct robots.txt for your website without touching code.
Every website on the internet gets visited by bots — Googlebot, Bingbot, AI crawlers, and dozens of others. A robots.txt file is how you tell them what they are and are not allowed to index. Get it right and you control what shows up in search results. Get it wrong and you can accidentally block your entire site from Google.
What is a robots.txt file?
A robots.txt file is a plain text file placed at the root of your website that instructs web crawlers which pages or sections they should crawl and which they should skip. It follows the Robots Exclusion Protocol — an informal standard that virtually every major crawler respects.
When Googlebot visits your site, the very first URL it requests is:
https://yourdomain.com/robots.txt
If the file exists, the bot reads the rules and adjusts its crawl accordingly. If it does not exist, the bot assumes everything is open for crawling.
Important: robots.txt is a directive, not a security measure. It tells well-behaved bots what not to crawl — it does not prevent access. Malicious bots and scrapers may ignore it entirely. Never rely on robots.txt to hide sensitive content.
robots.txt file structure
A robots.txt file is made up of one or more records. Each record consists of:
- A
User-agentline — which bot the rules apply to - One or more
DisalloworAllowlines — the crawl rules - An optional
Crawl-delaydirective - An optional
Sitemapdirective (at file level)
User-agent: Googlebot
Disallow: /admin/
Allow: /admin/public/
User-agent: *
Disallow: /private/
Crawl-delay: 10
Sitemap: https://yourdomain.com/sitemap.xml
Key directives explained
| Directive | What it does |
|---|---|
User-agent |
Specifies which crawler the following rules apply to. * means all bots. |
Disallow |
Blocks the specified path from being crawled. |
Allow |
Explicitly permits a path, even within a disallowed parent directory. |
Crawl-delay |
Asks the bot to wait N seconds between requests (not supported by Google). |
Sitemap |
Points crawlers to your XML sitemap. |
User-agent values — who are you talking to?
Each crawler has a unique user-agent string. The most important ones:
| User-agent | Crawler | Engine |
|---|---|---|
* |
All crawlers | — |
Googlebot |
Google web crawler | Google Search |
Googlebot-Image |
Google Images crawler | Google Images |
Googlebot-Video |
Google Video crawler | Google Video |
Bingbot |
Microsoft Bing crawler | Bing Search |
Slurp |
Yahoo! crawler | Yahoo Search |
DuckDuckBot |
DuckDuckGo crawler | DuckDuckGo |
Baiduspider |
Baidu crawler | Baidu Search |
YandexBot |
Yandex crawler | Yandex Search |
GPTBot |
OpenAI training crawler | ChatGPT |
ClaudeBot |
Anthropic training crawler | Claude |
CCBot |
Common Crawl bot | Various AI datasets |
Rules are applied per user-agent. If a bot matches a specific user-agent record, those rules apply. If no specific record exists, the * (wildcard) rules apply.
Disallow and Allow — how path matching works
Disallow
Disallow: /path/ blocks that path and everything under it.
Disallow: /admin/ # blocks /admin/, /admin/users, /admin/login, etc.
Disallow: /private.html # blocks exactly that file
Disallow: / # blocks the entire site
Disallow: # empty value = allow everything (no restriction)
Allow
Allow overrides a Disallow for a more specific path. More specific rules win.
User-agent: Googlebot
Disallow: /products/
Allow: /products/featured/ # Googlebot CAN crawl /products/featured/ despite the Disallow above
Wildcards
Most crawlers (including Google) support two wildcard characters:
| Pattern | Meaning | Example |
|---|---|---|
* |
Matches any sequence of characters | Disallow: /*.pdf$ |
$ |
Matches the end of the URL | Disallow: /*.pdf$ — blocks URLs ending in .pdf |
Disallow: /*? # blocks all URLs with query strings
Disallow: /*.pdf$ # blocks all PDF files
Disallow: /tag/*/page/ # blocks paginated tag archive pages
Common robots.txt patterns
Allow everything (default behavior)
User-agent: *
Disallow:
An empty Disallow means no restrictions. This is equivalent to having no robots.txt at all — but it is good practice to have the file present so you can add rules later.
Block the entire site (e.g., staging environment)
User-agent: *
Disallow: /
Use this on development, staging, or preview environments to prevent them from being indexed.
Block specific directories
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
E-commerce site — block non-indexable pages
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wishlist/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?ref=
Allow: /products/
Allow: /collections/
Sitemap: https://yourdomain.com/sitemap.xml
WordPress site
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-login.php
Disallow: /wp-includes/
Disallow: /?s=
Disallow: /search/
Disallow: /trackback/
Sitemap: https://yourdomain.com/sitemap.xml
How to generate a robots.txt file without writing code
Writing robots.txt by hand is error-prone — a single typo can block pages you intended to allow. Our Robots.txt Generator lets you:
- Select which bots to target
- Add allow and disallow rules via a simple form
- Set crawl delay if needed
- Add your sitemap URL
- Copy or download the finished file instantly
After generating, use the Robots.txt Tester to verify your rules work as intended before deploying.
Where to place your robots.txt file
The file must be at the root of your domain:
https://yourdomain.com/robots.txt ✓ correct
https://yourdomain.com/robots/robots.txt ✗ wrong
https://subdomain.yourdomain.com/robots.txt ✓ correct (for subdomain)
Each subdomain needs its own robots.txt. A file at www.yourdomain.com/robots.txt does not apply to blog.yourdomain.com.
Deploying robots.txt
Static site (HTML): Upload robots.txt to the root of your web server's public directory (/public_html/, /dist/, /public/, etc.).
WordPress: Place it in the root of your WordPress installation. Many SEO plugins (Yoast, RankMath) manage it automatically via the admin panel.
Next.js: Place robots.txt in the /public folder, or use the robots.js file in /app for programmatic generation (Next.js 13+).
Vercel / Netlify: Place in /public — it will be served from the root automatically on deployment.
robots.txt and SEO — what to get right
Do not block CSS and JavaScript
A common legacy practice was to block /wp-content/ or /assets/ to save crawl budget. This backfires: Google needs to render your pages to understand them, and blocking CSS/JS prevents that. Only block what you genuinely do not want indexed.
robots.txt does not prevent indexing — noindex does
Disallow prevents Google from crawling a URL. It does not prevent Google from indexing it if another site links to it. To prevent indexing, use the noindex meta tag or X-Robots-Tag HTTP header on the page itself.
Sitemap declaration
Always include your sitemap URL in robots.txt — it is the most reliable way to make sure crawlers find it:
Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-images.xml
You can list multiple sitemaps.
Crawl budget
Large sites (thousands of pages) benefit most from robots.txt optimization. Blocking low-value pages (filtered URLs, internal search results, duplicate thin content) helps Google spend its crawl budget on pages that actually matter.
Validating your robots.txt file
After creating your file, verify it before deploying:
- Robots.txt Tester — paste your file and test specific URLs to confirm allow/disallow behavior
- Google Search Console — the Crawl Stats report shows how Google is crawling your site; the URL Inspection tool shows whether specific pages are blocked
- Manual check — visit
https://yourdomain.com/robots.txtafter deploying to confirm the file is live and being served correctly
Frequently asked questions
Does Google always follow robots.txt?
Google respects Disallow directives for crawling. However, Google may still index a disallowed URL if it finds links to it — a Disallow blocks the crawl, not the index entry. Use noindex on the page itself to prevent indexing.
Can I have multiple User-agent blocks for the same bot? No. Each user-agent should appear in only one block. If you have conflicting rules for the same bot in multiple blocks, behavior is undefined. Combine all rules for a given user-agent into a single record.
What happens if my robots.txt has a syntax error? Most crawlers will either ignore the malformed rule or stop parsing at the error. Google will typically continue with the rules it parsed successfully before the error. Test your file before deploying.
Should I block Googlebot-Image? Only if you specifically do not want your images appearing in Google Images results. If you sell photography or run an image-heavy site, blocking Googlebot-Image could reduce traffic significantly.
How often do crawlers re-read robots.txt? Google typically caches robots.txt for up to 24 hours. After you update the file, changes may take up to a day to be reflected in Google's crawl behavior.
Does robots.txt affect page speed or Core Web Vitals? No. robots.txt only affects crawl behavior, not how pages load or perform for real users.
robots.txt file size and limits
- Google supports robots.txt files up to 500 KB in size
- Files larger than 500 KB are truncated — rules beyond that size are ignored
- No official limit on the number of rules, but keep files organized and concise
- UTF-8 encoding is recommended; ASCII also works
Related tools
- Robots.txt Generator — generate a robots.txt file using a form-based interface
- Robots.txt Tester — validate your rules and test specific URLs
- Sitemap Generator — generate an XML sitemap to pair with your robots.txt
- Meta Tag Generator — generate SEO meta tags for your pages