Question 1

Why is my production site getting deindexed because of robots.txt?

Accepted Answer

Production sites often get deindexed due to robots.txt pitfalls when staging environment configurations are accidentally deployed, resulting in a Disallow: / rule blocking search engine crawlers.
- This happens when environment variables for robots.txt are not correctly set during production builds.
- Ensure your production build explicitly uses the correct production robots.txt file.
- Implement build checks to fail if the necessary environment variables are missing.

Question 2

What is the best way to configure robots.txt in a Next.js project?

Accepted Answer

For most Next.js projects, the simplest and most reliable way to configure robots.txt pitfalls is by placing a static file in the /public/robots.txt folder.
- Next.js supports both static files and dynamic API routes for robots.txt.
- Avoid mixing both approaches as it can lead to file resolution conflicts based on Next.js version.
- The static file approach typically offers more predictability and ease of management.

Question 3

How should I manage robots.txt for a headless WordPress setup?

Accepted Answer

In a headless WordPress setup, both the API backend and the frontend have their own robots.txt, and addressing these robots.txt pitfalls is crucial.
- Ensure the API backend's robots.txt disallows all crawling, preventing API endpoints from being indexed.
- The frontend should serve the actual production robots.txt for your public-facing site.
- Do not rely on the default WordPress robots.txt for your headless frontend.

Question 4

What is the most reliable strategy for managing robots.txt across different environments?

Accepted Answer

The most reliable strategy to avoid common robots.txt pitfalls is to use per-environment robots.txt files committed directly to your repository.
- Name files clearly, like robots.production.txt or robots.staging.txt, to indicate their purpose.
- Configure your build process to explicitly select and deploy the correct file based on an environment flag.
- This approach is generally more robust than clever runtime generation logic.

Question 5

How should I properly reference my sitemap in a robots.txt file?

Accepted Answer

To avoid robots.txt pitfalls related to sitemap discovery, always reference your sitemap using an absolute URL, including protocol and domain.
- While Google can handle relative paths, an absolute URL is more robust and less prone to errors.
- For example, use Sitemap: https://www.example.com/sitemap.xml instead of /sitemap.xml.
- Always verify that the absolute sitemap URL loads correctly in a browser before deployment.

Robots.txt Pitfalls on Headless and JAMstack Sites

The "staging accidentally indexed" pattern

Next.js: the public folder versus the API route

Nuxt: the module versus the generated file

Headless WordPress: who serves what

Per-environment robots.txt

Sitemap references

User-agent specific rules

Crawl-delay and other deprecated directives

Allow before Disallow on conflicting paths

Validation in CI

The audit cadence

Frequently asked questions

Ready to improve your SEO?