UtilitySEO
Back to Blog
SEO·9 December 2025

Robots.txt Pitfalls on Headless and JAMstack Sites

Robots.txt Pitfalls on Headless and JAMstack Sites

The robots.txt configuration mistakes that are common on JAMstack and headless sites, and the correct patterns to use instead.

Robots.txt on a traditional CMS is usually straightforward. Robots.txt on a JAMstack or headless setup is where teams routinely ship subtle errors, because the build process generates the file in ways that obscure what it actually contains. This article is about the specific robots.txt pitfalls that hit static-site and headless deployments, with the correct patterns for each framework.

The "staging accidentally indexed" pattern

The most common and most damaging error: the same codebase deploys to staging and production. The robots.txt logic depends on an environment variable that is not set during the production build, so production accidentally ships with a User-agent: * Disallow: / from the staging defaults. Within days, the entire production site is deindexed. The fix is to make the production build explicitly set the production robots.txt and to fail the build if the variable is missing.

Next.js: the public folder versus the API route

Next.js supports robots.txt as a static file in /public/robots.txt or as a dynamic API route at /pages/api/robots.txt (or app/robots.txt). Mixing both creates a file resolution race that depends on Next.js version. Pick one approach and stick to it. For most sites the static file is simpler and reliable.

Nuxt: the module versus the generated file

Nuxt has a robots module that generates robots.txt during build. Configuring it correctly requires understanding which environment the build runs in. The common error is configuring robots in development but not in production, resulting in the development robots.txt shipping to production.

Headless WordPress: who serves what

In headless WordPress, the API backend has its own robots.txt and the frontend has its own. Both are accessible. Make sure the API backend disallows everything (you do not want the GraphQL endpoint indexed) and the frontend serves the actual production robots.txt. The default WordPress robots.txt almost certainly is not what you want for a headless setup.

Per-environment robots.txt

The pattern that works reliably across frameworks is per-environment robots.txt files committed to the repo, with the build process choosing which to deploy based on an explicit environment flag. Names like robots.production.txt, robots.staging.txt, robots.preview.txt make the intent obvious. Avoid clever runtime generation when a static file per environment will do.

Sitemap references

Robots.txt should reference your sitemap with an absolute URL. Many robots.txt files reference /sitemap.xml relatively, which Google handles but is more fragile. Use the full URL including protocol and domain, and verify it loads in a browser before committing.

User-agent specific rules

Most sites do not need user-agent-specific rules. The temptation to block specific bots like AhrefsBot or SemrushBot is real but rarely produces meaningful value — the bots respect robots.txt, and blocking them mostly just signals to your competitors which tools you use. Stick to User-agent: * for almost everything unless there is a specific reason to discriminate.

Crawl-delay and other deprecated directives

Google ignores Crawl-delay. Bing partially honours it. Most other bots ignore it. If you have a server load problem, the solution is not robots.txt — it is rate limiting or scaling the server. Remove crawl-delay entries from any robots.txt that has them; they create the false impression that you have configured something useful.

Allow before Disallow on conflicting paths

The default precedence is that longer rules win. If you have Disallow: /api/ and Allow: /api/public/, the Allow wins for /api/public/ and the Disallow wins everywhere else under /api/. Test specific paths in Google's robots.txt tester before assuming a complicated rule set works.

Validation in CI

The most reliable defence against robots.txt errors is a CI check that fails the build if robots.txt does not match the expected production pattern. Twenty lines of test code prevents the entire category of deployment-mistake errors. Most teams do not have this check; they should.

The audit cadence

Robots.txt should be on every quarterly audit checklist regardless of framework, because changes ship silently. A continuous audit tool that flags robots.txt changes catches issues within hours of deploy rather than weeks later.

For JAMstack and headless sites specifically, robots.txt is one of the highest-risk files in the build because it can take an entire site offline from Google's perspective with a single character change. Investing in explicit environment-aware configuration and CI validation pays back immediately the first time it catches a near-miss. UtilitySEO and similar audit tools track robots.txt changes over time, which is the correct shape for monitoring a file that should change rarely and dramatically when it does.

Frequently asked questions

Why is my production site getting deindexed because of robots.txt?

Production sites often get deindexed due to robots.txt pitfalls when staging environment configurations are accidentally deployed, resulting in a Disallow: / rule blocking search engine crawlers.

  • This happens when environment variables for robots.txt are not correctly set during production builds.
  • Ensure your production build explicitly uses the correct production robots.txt file.
  • Implement build checks to fail if the necessary environment variables are missing.
What is the best way to configure robots.txt in a Next.js project?

For most Next.js projects, the simplest and most reliable way to configure robots.txt pitfalls is by placing a static file in the /public/robots.txt folder.

  • Next.js supports both static files and dynamic API routes for robots.txt.
  • Avoid mixing both approaches as it can lead to file resolution conflicts based on Next.js version.
  • The static file approach typically offers more predictability and ease of management.
How should I manage robots.txt for a headless WordPress setup?

In a headless WordPress setup, both the API backend and the frontend have their own robots.txt, and addressing these robots.txt pitfalls is crucial.

  • Ensure the API backend's robots.txt disallows all crawling, preventing API endpoints from being indexed.
  • The frontend should serve the actual production robots.txt for your public-facing site.
  • Do not rely on the default WordPress robots.txt for your headless frontend.
What is the most reliable strategy for managing robots.txt across different environments?

The most reliable strategy to avoid common robots.txt pitfalls is to use per-environment robots.txt files committed directly to your repository.

  • Name files clearly, like robots.production.txt or robots.staging.txt, to indicate their purpose.
  • Configure your build process to explicitly select and deploy the correct file based on an environment flag.
  • This approach is generally more robust than clever runtime generation logic.
How should I properly reference my sitemap in a robots.txt file?

To avoid robots.txt pitfalls related to sitemap discovery, always reference your sitemap using an absolute URL, including protocol and domain.

  • While Google can handle relative paths, an absolute URL is more robust and less prone to errors.
  • For example, use Sitemap: https://www.example.com/sitemap.xml instead of /sitemap.xml.
  • Always verify that the absolute sitemap URL loads correctly in a browser before deployment.

Ready to improve your SEO?

Get started with UtilitySEO free — no credit card required.

Get Started Free