What is a robots.txt file and what does it do?

A robots.txt file is a small plain text file that lives at the root of your website and gives instructions to automated crawlers before they read the rest of your site. When a search engine bot or an AI crawler arrives, the first thing it usually checks is whether a robots.txt file exists at your domain root. The rules inside tell it which paths it is welcome to crawl and which it should leave alone.

The file follows a simple, line-based format that almost any crawler understands. It is not a security measure and it does not hide pages from the public. Instead it is a set of polite requests that well-behaved bots agree to follow. Search engines like Google honor it, and the major AI companies have published named crawlers that respect it too. That makes robots.txt the single most direct lever you have over who reads your content at scale.

Because it is the first file most bots request, getting it right matters. A clean robots.txt can protect private or low-value areas, point crawlers at your sitemap, and now decide whether AI systems are allowed to ingest your pages at all.

What is the basic syntax of robots.txt?

Robots.txt is built from a few directives that you stack into groups. Each group starts with a User-agent line that names the bot the rules apply to, followed by one or more rules for that bot. The most common rules are Disallow, which blocks a path, and Allow, which permits one. You can also add a Sitemap line that points to your XML sitemap so crawlers can find every URL you want indexed.

A User-agent of an asterisk means the rules apply to every crawler that does not have its own named group. A Disallow with a single slash blocks the entire site, while an empty Disallow value blocks nothing and effectively allows everything. Paths are matched from the root, so a Disallow line for a folder blocks that folder and everything inside it.

  • User-agent: names the crawler a block of rules applies to, or an asterisk for all bots.
  • Disallow: tells the named crawler not to crawl a given path or folder.
  • Allow: carves out an exception so a specific path stays crawlable inside a blocked folder.
  • Sitemap: gives the full URL of your XML sitemap so crawlers can discover your pages.

How do you use this robots.txt generator?

The tool turns the syntax above into a few clicks. You start by toggling which search crawlers you want to allow or block, such as Google and Bing. Then you do the same for AI crawlers, deciding bot by bot whether systems like GPTBot, ClaudeBot, and PerplexityBot may read your content. Each toggle writes the correct User-agent and Disallow lines for you, so you never have to remember the exact spelling of a bot name or worry about a misplaced slash.

Next you add the full URL of your sitemap, for example the address of your sitemap.xml file. The generator appends a clean Sitemap line so crawlers can find every page you want indexed. As you flip toggles, the preview updates live, so you can see exactly what bots will read before you publish anything.

When the file looks right, you copy it to your clipboard or download it as robots.txt, ready to upload to your server. There is no guesswork and no need to validate the format by hand, because every rule the tool emits is already valid.

How do you block AI crawlers like GPTBot, and should you?

Blocking an AI crawler works exactly like blocking any other bot. You add a User-agent line for the crawler you want to stop, then a Disallow line with a single slash to block your whole site for that bot. To block GPTBot, you write a group naming GPTBot and disallow the root. The generator does this for you when you flip the matching toggle, so an ai crawler blocker is really just the right combination of User-agent and Disallow lines.

The major AI crawlers each have published names you can target. OpenAI uses GPTBot for training, ChatGPT-User when a person browses inside ChatGPT, and OAI-SearchBot for its search features. Google uses Google-Extended to control whether your content trains its AI products. Anthropic uses ClaudeBot and anthropic-ai. Perplexity uses PerplexityBot, Common Crawl uses CCBot, ByteDance uses Bytespider, and Apple uses Applebot-Extended for AI training. Knowing the exact names is the whole game, since a typo means the rule silently does nothing.

On whether you should block them, the honest answer for most brands is no. If you want to show up when people ask questions inside ChatGPT, Perplexity, or AI Overviews, you need those systems to be able to read you, so you should allow the AI crawlers. Blocking makes sense only when you genuinely do not want your content used by AI, for example paywalled work, proprietary research, or material you license separately. Knowing how to block ai crawlers is useful, but for anyone chasing AI visibility, an open robots.txt for ai usually wins.

What is the difference between blocking crawling and blocking AI training?

These two things look similar in a robots.txt file but mean different outcomes, and the difference comes down to which bot you are addressing. Blocking crawling stops a bot from reading your pages at all. If you disallow ChatGPT-User or OAI-SearchBot, you reduce the chance your content appears when someone asks a live question inside that product, because the system cannot fetch you in the moment.

Blocking AI training is narrower. Crawlers like GPTBot, Google-Extended, and Applebot-Extended exist mainly to gather content that may train future models. Disallowing those specific bots tells the company not to use your pages for training, while still leaving the door open for the crawlers that power live answers and search. That separation lets you make a precise choice: you can keep earning citations in AI answers today while opting out of having your work absorbed into a model.

Because the bots are named separately, you control each outcome independently. The tool exposes them as distinct toggles so you do not have to guess which crawler does what.

Where do you host the robots.txt file?

Robots.txt only works in one place: the root of your domain, served at the path /robots.txt. So if your site is example.com, the file must be reachable at example.com/robots.txt. Crawlers do not look anywhere else, so a file placed in a subfolder or under a different name will be ignored completely.

Each subdomain needs its own file, because crawlers treat blog.example.com and shop.example.com as separate hosts. The file should return a normal 200 response and be plain text. Once you download the file from this tool, upload it to your web root with whatever you use to deploy, then confirm it loads in a browser at your domain followed by /robots.txt.

What are the most common robots.txt mistakes?

The most damaging mistake is blocking your entire site by accident. A Disallow with a single slash under a User-agent of an asterisk tells every crawler to stay out, which can quietly remove you from search results. This often slips into production when a staging rule is copied to the live server, so it is worth checking after every deploy.

Blocking CSS and JavaScript is another frequent error. Modern search engines render your pages to understand them, and if you disallow the folders that hold your styles and scripts, the crawler may see a broken layout and judge the page poorly. Leave those assets crawlable unless you have a strong reason not to.

Typos in user-agent names are the quiet killers. Robots.txt does no validation, so if you misspell GPTBot or PerplexityBot, the rule simply does nothing and the bot crawls anyway. Other common slips include forgetting the Sitemap line, relying on robots.txt to hide sensitive pages (it does not, since the file is public), and putting the file anywhere other than the domain root. Using a generator removes most of these because the bot names and structure are correct by default.

How does robots.txt affect crawl budget and SEO?

Crawl budget is the amount of crawling a search engine is willing to spend on your site in a given window. On large sites with many low-value URLs, such as filtered listings, internal search results, or endless pagination, robots.txt helps you steer that budget toward the pages that matter. By disallowing the thin or duplicate sections, you encourage crawlers to spend their time on the content you actually want ranked.

For smaller sites, crawl budget is rarely a constraint, so the bigger SEO win from robots.txt is simply avoiding self-inflicted damage. Keep your important pages crawlable, keep CSS and JavaScript open, point to your sitemap, and only block what truly should stay private. Done well, the file is invisible. Done badly, it can undo months of work, which is exactly why a clean, validated robots.txt is worth the two minutes it takes to build.