Your First Crawl

This guide covers creating crawl jobs, understanding crawl options, monitoring progress, and retrieving results.

Creating a Crawl

Crawls are created via the REST API using a POST request:

curl -X POST \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "depth": 2,
    "format": "markdown"
  }' \
  https://your-server/api/crawls

Crawl Options

Basic Options

  • url (required) - The starting URL to crawl
  • depth - How many levels deep to follow links (default: 1)
  • format - Output format: html, text, markdown, or json (default: markdown)

Crawl Modes

Mulberry supports two crawl modes:

  • Site traversal (default) - Starts at a URL and follows links within the same domain
  • URL list - Crawls a specific list of URLs without following links

To use URL list mode:

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "mode": "url_list"
}

Performance Options

  • max_workers - Number of concurrent workers (default: 5)
  • delay_ms - Delay between requests in milliseconds (default: 100)

Filtering Options

Control which pages get crawled using regex patterns:

{
  "url": "https://docs.example.com",
  "include_patterns": ["^/docs/", "^/api/"],
  "exclude_patterns": ["/legacy/", "/deprecated/"]
}
  • include_patterns - Only crawl URLs matching these patterns
  • exclude_patterns - Skip URLs matching these patterns

Monitoring Progress

Check crawl status by fetching the crawl details:

curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/crawls/crawl_abc123

Response includes current status and progress:

{
  "data": {
    "id": "crawl_abc123",
    "url": "https://docs.example.com",
    "status": "running",
    "pages_crawled": 47,
    "pages_total": 120,
    "started_at": "2024-01-15T10:30:00Z"
  }
}

Crawl Statuses

  • pending - Crawl is queued, waiting to start
  • running - Crawl is actively processing pages
  • completed - Crawl finished successfully
  • failed - Crawl encountered an error
  • cancelled - Crawl was manually stopped

Retrieving Results

Once a crawl completes, retrieve the results:

curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/crawls/crawl_abc123/pages

This returns all crawled pages with their content:

{
  "data": [
    {
      "url": "https://docs.example.com/",
      "title": "Documentation Home",
      "content": "# Welcome to our docs...",
      "status_code": 200,
      "crawled_at": "2024-01-15T10:30:15Z"
    },
    {
      "url": "https://docs.example.com/getting-started",
      "title": "Getting Started",
      "content": "# Getting Started\n\nThis guide...",
      "status_code": 200,
      "crawled_at": "2024-01-15T10:30:18Z"
    }
  ],
  "meta": {
    "total": 120,
    "page": 1,
    "per_page": 20
  }
}

Pagination

Results are paginated. Use page and per_page parameters:

curl -H "Authorization: Bearer sk_your_api_key" \
  "https://your-server/api/crawls/crawl_abc123/pages?page=2&per_page=50"

Download All Results

For large crawls, you can download all results as a single file:

curl -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/crawls/crawl_abc123/download

Cancelling a Crawl

Stop a running crawl with a DELETE request:

curl -X DELETE \
  -H "Authorization: Bearer sk_your_api_key" \
  https://your-server/api/crawls/crawl_abc123

Data Retention

Crawl data is retained for a configurable period (default: 3 days). After this, the crawl record remains but page content is deleted. Configure retention in your account settings.

Next Steps