Your First Crawl

This guide covers creating crawl jobs, understanding crawl options, monitoring progress, and retrieving results.

Looking for a quick single-page crawl?

Check out Sync Crawls for immediate results without creating a background job.

Creating a Crawl

Crawls are created via the REST API using a POST request:

POST /api/v1/crawls
Request
curl -X POST \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "crawl_job": {
      "start_url": "https://docs.example.com",
      "mode": "website",
      "max_depth": 2,
      "result_format": "markdown"
    }
  }' \
  https://your-server/api/v1/crawls

Crawl Options

Basic Options

All parameters must be nested under a crawl_job key in the request body.

Parameter Required Description
start_url Required The starting URL to crawl
mode Required Crawl mode: website (recursive) or url_list (non-recursive)
max_depth Optional How many levels deep to follow links (default: 3, max: 10)
max_urls Optional Maximum URLs to crawl (default: 100, max: 1000)
result_format Optional Output format: html, text, markdown, or json (default: text)

Crawl Modes

Mulberry supports two crawl modes:

Default

Website Mode

Starts at a URL and follows links within the same domain

{
  "crawl_job": {
    "start_url": "https://example.com",
    "mode": "website",
    "max_depth": 2
  }
}

URL List Mode

Crawls a specific URL without following links

{
  "crawl_job": {
    "start_url": "https://example.com/page1",
    "mode": "url_list"
  }
}

Advanced Options

Parameter Default Description
max_workers 5 Number of concurrent workers (max: 10)
emit_page_events false Emit webhook events for each page crawled
delete_on_retrieve false Delete results after retrieval
results_expires_in_days 7 Days until results expire (max: 7)

Filtering Options

Control which pages get crawled using regex patterns:

Example with filters
{
  "crawl_job": {
    "start_url": "https://docs.example.com",
    "mode": "website",
    "include_patterns": ["^/docs/", "^/api/"],
    "exclude_patterns": ["/legacy/", "/deprecated/"]
  }
}
Parameter Description
include_patterns Only crawl URLs matching these patterns
exclude_patterns Skip URLs matching these patterns

Monitoring Progress

Check crawl status by fetching the crawl details:

GET /api/v1/crawls/:id
Request
curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/v1/crawls/crawl_abc123

Response includes current status and progress:

Response
{
  "data": {
    "id": "crawl_abc123",
    "url": "https://docs.example.com",
    "status": "running",
    "pages_crawled": 47,
    "pages_total": 120,
    "started_at": "2024-01-15T10:30:00Z"
  }
}

Crawl Statuses

pending Crawl is queued, waiting to start
running Crawl is actively processing pages
completed Crawl finished successfully
failed Crawl encountered an error
cancelled Crawl was manually stopped

Retrieving Results

Once a crawl completes, retrieve the results:

GET /api/v1/crawls/:crawl_id/results
Request
curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/v1/crawls/crawl_abc123/results

This returns all crawled pages with their content. The content field is formatted according to the result_format specified when creating the crawl (html, text, markdown, or json).

Response
{
  "data": [
    {
      "url": "https://docs.example.com/",
      "title": "Documentation Home",
      "content": "# Welcome to our docs...",
      "status_code": 200,
      "crawled_at": "2024-01-15T10:30:15Z"
    },
    {
      "url": "https://docs.example.com/getting-started",
      "title": "Getting Started",
      "content": "# Getting Started\n\nThis guide...",
      "status_code": 200,
      "crawled_at": "2024-01-15T10:30:18Z"
    }
  ],
  "meta": {
    "total": 120,
    "page": 1,
    "per_page": 20
  }
}

Pagination

Results are paginated. Use page and per_page parameters:

Paginated request
curl -H "Authorization: Bearer sk_your_api_key" \
  "https://your-server/api/v1/crawls/crawl_abc123/results?page=2&per_page=50"

Data Retention

Crawl results are retained for a configurable period. Use the results_expires_in_days parameter when creating a crawl to set retention (default: 7 days, max: 7 days). After expiration, the crawl record remains but page content is deleted.

Next Steps