Sync Crawls

Crawl a single URL and receive immediate results without creating a background job. Perfect for quick lookups, API integrations, or when you need results right away.

Endpoint
GET /api/v1/crawls/sync
Auth Required
Bearer Token (Private Key)
Max Timeout
120 seconds

Sync vs Async Crawls

Choose the right approach for your use case:

Feature Sync Crawls Async Crawls
HTTP Method GET POST
URLs Single URL only Multiple URLs, full site crawl
Response Immediate results (200 OK) Job ID (202 Accepted), poll for results
Depth Single page only Configurable depth (default: 1)
Webhooks Not supported Supported
Events Does not emit events Emits crawl lifecycle events
Persistence No database record Creates background job in database
Use Case Quick lookups, API integrations Large crawls, batch processing

API Endpoint

GET /api/v1/crawls/sync

Parameters

Parameter Type Required Default Description
url string Yes The URL to crawl (must be HTTP or HTTPS)
result_format string No html Output format: html, text, markdown, or json
timeout integer No 30000 Request timeout in milliseconds (max: 120000)

Usage Examples

Basic Request
curl -X GET "https://your-server/api/v1/crawls/sync?url=https://example.com" \
  -H "Authorization: Bearer sk_your_api_key"
With Markdown Format
curl -X GET "https://your-server/api/v1/crawls/sync?url=https://example.com&result_format=markdown" \
  -H "Authorization: Bearer sk_your_api_key"
Custom Timeout
curl -X GET "https://your-server/api/v1/crawls/sync?url=https://example.com&timeout=60000" \
  -H "Authorization: Bearer sk_your_api_key"

Output Formats

html
Raw HTML content (default). Returns the full HTML source.
text
Plain text with HTML tags removed. Clean, readable content.
markdown
Converted to Markdown format. Great for AI processing.
json
Structured JSON with metadata. Full parsing details.

Response Format

Success Response (200 OK)
{
  "data": [
    {
      "url": "https://example.com",
      "title": "Example Domain",
      "description": "Example Description",
      "content": "<html>...</html>",
      "content_type": "html",
      "status_code": 200,
      "response_time_ms": 150,
      "content_length": 1256,
      "fetched_at": "2024-01-15T10:30:00Z"
    }
  ],
  "errors": []
}
Error Response
{
  "data": [],
  "errors": [
    {
      "url": "https://example.com",
      "category": "fetch_failed",
      "message": "Connection refused"
    }
  ]
}

Error Handling

HTTP Status Codes

Status Description
200 Success (check errors array for crawl failures)
401 Missing or invalid API key
403 Public API key used (requires private key)
408 Request timeout
422 Invalid URL or parameters
429 Rate limit exceeded

Error Categories

Category Description
fetch_failed Network error (connection refused, DNS failure, etc.)
robots_blocked URL blocked by robots.txt
content_too_large Response exceeded 10MB limit
no_result No response from crawler
internal_error Unexpected error during crawling
Rate Limiting
Sync crawls have a lower concurrency limit than async crawls. The response includes a Retry-After header when limits are exceeded. Use async crawls for batch operations.

Security Features

SSRF Protection

Mulberry blocks access to internal and private addresses to prevent server-side request forgery:

  • localhost and 127.0.0.0/8
  • 10.0.0.0/8 (private network)
  • 172.16.0.0/12 (private network)
  • 192.168.0.0/16 (private network)
  • 169.254.0.0/16 (link-local)
  • ::1 (IPv6 loopback)

Content Size Limits

Responses larger than 10MB are rejected with a content_too_large error. Adjust the timeout parameter as needed for slow-loading pages.

MCP Integration

The sync crawl is also available as an MCP tool for AI agents:

MCP Tool Call
{
  "url": "https://example.com",
  "result_format": "markdown",
  "timeout": 30000
}
When to Use Sync Crawls
Use sync crawls when you need immediate results for a single page. For multi-page crawls, site traversal, or webhook notifications, use async crawls.

Next Steps