Sitemap Generation

Generate XML sitemaps from completed crawl results, following the sitemaps.org protocol. Essential for SEO optimization, site audits, and content inventory management.

REST Endpoint
GET /api/v1/crawls/:id/sitemap
MCP Tool
crawl_sitemap
Permission
crawls:read
Looking to analyze site structure?
Generate sitemaps from crawl results for SEO audits, content inventories, and migration planning.

REST API

GET /api/v1/crawls/:id/sitemap

Parameters

Parameter Type Required Description
id string Yes Crawl job UUID (path parameter)
page integer No Page number for large sitemaps (default: 1)
Note
Sitemaps are only generated for completed crawls. The crawl must have a status of "completed" before you can request a sitemap.

Examples

Basic Usage
curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/v1/crawls/crawl_abc123/sitemap
With Pagination
curl -H "Authorization: Bearer sk_your_api_key" \
  "https://your-server/api/v1/crawls/crawl_abc123/sitemap?page=2"

Response Format

Returns application/xml content type. For crawls with ≤50K URLs, returns a sitemap XML. For >50K URLs (without page parameter), returns a sitemap index XML.

Sitemap XML Example
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
    <lastmod>2024-01-15T10:30:15Z</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>https://example.com/page2</loc>
    <lastmod>2024-01-15T10:30:18Z</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>
Sitemap Index XML Example (>50K URLs)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://your-server/api/v1/crawls/crawl_abc123/sitemap?page=1</loc>
    <lastmod>2024-01-15T10:30:00Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://your-server/api/v1/crawls/crawl_abc123/sitemap?page=2</loc>
    <lastmod>2024-01-15T10:30:00Z</lastmod>
  </sitemap>
</sitemapindex>

MCP Tool

Generate sitemaps from AI agents using the crawl_sitemap MCP tool:

Parameters

Parameter Type Required Description
id string Yes Crawl job UUID
page integer No Page number for large sitemaps (default: 1)

Examples

Basic Call
{
  "id": "crawl_abc123"
}
With Pagination
{
  "id": "crawl_abc123",
  "page": 2
}
Response Format
{
  "data": {
    "sitemap": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n  <url>\n    <loc>https://example.com/page1</loc>\n    <lastmod>2024-01-15T10:30:15Z</lastmod>\n    <changefreq>weekly</changefreq>\n    <priority>0.5</priority>\n  </url>\n</urlset>"
  }
}
Important
The crawl_sitemap tool returns the sitemap as a string in the data.sitemap field. For large sitemaps (>50K URLs), consider using the REST API directly and streaming the response to a file.

Sitemap Format

Protocol Compliance

Generated sitemaps follow the sitemaps.org v0.9 specification with these features:

URL Limit
Maximum 50,000 URLs per sitemap
Pagination
Automatic for crawls exceeding 50K URLs
URL Filtering
Only includes successful pages (HTTP 200-399)
Sorting
URLs sorted alphabetically

URL Filtering Rules

Status Code Range Included? Reason
200-299 Yes Successful requests
300-399 Yes Successful redirects
400-499 No Client errors
500-599 No Server errors

Use Cases

1

SEO Audit Workflow

Crawl a competitor's site and generate a sitemap to analyze their page structure:

# 1. Create crawl
curl -X POST \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://competitor.com", "depth": 3}' \
  https://api.example.com/api/crawls

# 2. Generate sitemap when complete
curl -H "Authorization: Bearer sk_your_api_key" \
  https://api.example.com/api/v1/crawls/crawl_xyz/sitemap > competitor-sitemap.xml

# 3. Compare with actual sitemap to find orphan pages
2

Site Migration Verification

Generate sitemaps before and after migration to ensure all pages were migrated:

# Crawl old site
curl -X POST \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://old-site.com", "depth": 5}' \
  https://api.example.com/api/crawls

# After migration, crawl new site
curl -X POST \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://new-site.com", "depth": 5}' \
  https://api.example.com/api/crawls

# Generate and compare sitemaps
3

Content Inventory for CMS Migration

Crawl entire site and export sitemap for import into new CMS:

# Crawl entire site
curl -X POST \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://legacy-cms.com", "depth": 10}' \
  https://api.example.com/api/crawls

# Export sitemap for import into new CMS
curl -H "Authorization: Bearer sk_your_api_key" \
  https://api.example.com/api/v1/crawls/crawl_abc/sitemap > content-inventory.xml
4

AI-Powered SEO Optimization

Use AI agents to analyze sitemaps and provide SEO recommendations:

"Let me analyze the site structure. I'll crawl the site first, then generate a sitemap to identify which pages are being indexed and recommend improvements."
// Agent uses crawl_create
{
  "url": "https://example.com",
  "depth": 3,
  "format": "markdown",
  "wait": true
}

// Then uses crawl_sitemap
{
  "id": "crawl_xyz"
}

// Agent analyzes sitemap to find:
// - Missing canonical tags
// - Pages with low priority
// - Orphan pages
// - Deep URL structure issues
Pro tip
Combine sitemap generation with the pages endpoint to get both structure and content. Use the sitemap to identify pages of interest, then fetch their content via the pages API.

Error Handling

REST API Errors

Status Description
401 Invalid or missing API key
403 Key lacks crawls:read permission
404 Crawl ID doesn't exist or crawl not completed
404 Page number out of range
500 Server-side error

MCP Tool Errors

Error Description
unauthorized Invalid or missing API key
forbidden Key lacks required permissions
not_found Crawl ID doesn't exist
page_not_found Page number out of range
no_results Crawl has no successful URLs

Rate Limiting

  • Sitemap requests count toward standard API rate limits
  • Recommended: Cache sitemap responses locally for completed crawls
  • Use webhooks to regenerate sitemaps when crawl completes

Next Steps