Your First Crawl

This guide covers creating crawl jobs, understanding crawl options, monitoring progress, and retrieving results.

Looking for a quick single-page crawl?

Check out Sync Crawls for immediate results without creating a background job.

Creating a Crawl

Crawls are created via the REST API using a POST request:

POST /api/v1/crawls

Request

curl -X POST \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "crawl_job": {
      "start_url": "https://docs.example.com",
      "mode": "website",
      "max_depth": 2,
      "result_format": "markdown"
    }
  }' \
  https://your-server/api/v1/crawls

Crawl Options

Basic Options

All parameters must be nested under a crawl_job key in the request body.

Parameter	Required	Description
`start_url`	Required	The starting URL to crawl
`mode`	Required	Crawl mode: `website` (recursive) or `url_list` (non-recursive)
`max_depth`	Optional	How many levels deep to follow links (default: 3, max: 10)
`max_urls`	Optional	Maximum URLs to crawl (default: 100, max: 1000)
`result_format`	Optional	Output format: `html`, `text`, `markdown`, or `json` (default: `text`)

Crawl Modes

Mulberry supports two crawl modes:

Default

Website Mode

Starts at a URL and follows links within the same domain

{
  "crawl_job": {
    "start_url": "https://example.com",
    "mode": "website",
    "max_depth": 2
  }
}

URL List Mode

Crawls a specific URL without following links

{
  "crawl_job": {
    "start_url": "https://example.com/page1",
    "mode": "url_list"
  }
}

Advanced Options

Parameter	Default	Description
`max_workers`	`5`	Number of concurrent workers (max: 10)
`emit_page_events`	`false`	Emit webhook events for each page crawled
`delete_on_retrieve`	`false`	Delete results after retrieval
`results_expires_in_days`	`7`	Days until results expire (max: 7)

Filtering Options

Control which pages get crawled using regex patterns:

Example with filters

{
  "crawl_job": {
    "start_url": "https://docs.example.com",
    "mode": "website",
    "include_patterns": ["^/docs/", "^/api/"],
    "exclude_patterns": ["/legacy/", "/deprecated/"]
  }
}

Parameter	Description
`include_patterns`	Only crawl URLs matching these patterns
`exclude_patterns`	Skip URLs matching these patterns

Monitoring Progress

Check crawl status by fetching the crawl details:

GET /api/v1/crawls/:id

Request

curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/v1/crawls/crawl_abc123

Response includes current status and progress:

Response

{
  "data": {
    "id": "crawl_abc123",
    "url": "https://docs.example.com",
    "status": "running",
    "pages_crawled": 47,
    "pages_total": 120,
    "started_at": "2024-01-15T10:30:00Z"
  }
}

Crawl Statuses

pending Crawl is queued, waiting to start

running Crawl is actively processing pages

completed Crawl finished successfully

failed Crawl encountered an error

cancelled Crawl was manually stopped

Retrieving Results

Once a crawl completes, retrieve the results:

GET /api/v1/crawls/:crawl_id/results

Request

curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/v1/crawls/crawl_abc123/results

This returns all crawled pages with their content. The content field is formatted according to the result_format specified when creating the crawl (html, text, markdown, or json).

Response

{
  "data": [
    {
      "url": "https://docs.example.com/",
      "title": "Documentation Home",
      "content": "# Welcome to our docs...",
      "status_code": 200,
      "crawled_at": "2024-01-15T10:30:15Z"
    },
    {
      "url": "https://docs.example.com/getting-started",
      "title": "Getting Started",
      "content": "# Getting Started\n\nThis guide...",
      "status_code": 200,
      "crawled_at": "2024-01-15T10:30:18Z"
    }
  ],
  "meta": {
    "total": 120,
    "page": 1,
    "per_page": 20
  }
}

Pagination

Results are paginated. Use page and per_page parameters:

Paginated request

curl -H "Authorization: Bearer sk_your_api_key" \
  "https://your-server/api/v1/crawls/crawl_abc123/results?page=2&per_page=50"

Data Retention

Crawl results are retained for a configurable period. Use the results_expires_in_days parameter when creating a crawl to set retention (default: 7 days, max: 7 days). After expiration, the crawl record remains but page content is deleted.

Next Steps

Get notified when crawls complete Set up webhooks for real-time updates

Generate sitemaps from crawl results Create XML sitemaps for SEO analysis

Let AI agents create crawls Connect via the Model Context Protocol