Your First Crawl
This guide covers creating crawl jobs, understanding crawl options, monitoring progress, and retrieving results.
Check out Sync Crawls for immediate results without creating a background job.
Creating a Crawl
Crawls are created via the REST API using a POST request:
curl -X POST \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"crawl_job": {
"start_url": "https://docs.example.com",
"mode": "website",
"max_depth": 2,
"result_format": "markdown"
}
}' \
https://your-server/api/v1/crawls Crawl Options
Basic Options
All parameters must be nested under a crawl_job key in the request body.
| Parameter | Required | Description |
|---|---|---|
start_url | Required | The starting URL to crawl |
mode | Required | Crawl mode: website (recursive) or url_list (non-recursive) |
max_depth | Optional | How many levels deep to follow links (default: 3, max: 10) |
max_urls | Optional | Maximum URLs to crawl (default: 100, max: 1000) |
result_format | Optional | Output format: html, text, markdown, or json (default: text) |
Crawl Modes
Mulberry supports two crawl modes:
Website Mode
Starts at a URL and follows links within the same domain
{
"crawl_job": {
"start_url": "https://example.com",
"mode": "website",
"max_depth": 2
}
} URL List Mode
Crawls a specific URL without following links
{
"crawl_job": {
"start_url": "https://example.com/page1",
"mode": "url_list"
}
} Advanced Options
| Parameter | Default | Description |
|---|---|---|
max_workers | 5 | Number of concurrent workers (max: 10) |
emit_page_events | false | Emit webhook events for each page crawled |
delete_on_retrieve | false | Delete results after retrieval |
results_expires_in_days | 7 | Days until results expire (max: 7) |
Filtering Options
Control which pages get crawled using regex patterns:
{
"crawl_job": {
"start_url": "https://docs.example.com",
"mode": "website",
"include_patterns": ["^/docs/", "^/api/"],
"exclude_patterns": ["/legacy/", "/deprecated/"]
}
} | Parameter | Description |
|---|---|
include_patterns | Only crawl URLs matching these patterns |
exclude_patterns | Skip URLs matching these patterns |
Monitoring Progress
Check crawl status by fetching the crawl details:
curl -H "Authorization: Bearer sk_your_api_key" \
https://your-server/api/v1/crawls/crawl_abc123 Response includes current status and progress:
{
"data": {
"id": "crawl_abc123",
"url": "https://docs.example.com",
"status": "running",
"pages_crawled": 47,
"pages_total": 120,
"started_at": "2024-01-15T10:30:00Z"
}
} Crawl Statuses
Retrieving Results
Once a crawl completes, retrieve the results:
curl -H "Authorization: Bearer sk_your_api_key" \
https://your-server/api/v1/crawls/crawl_abc123/results This returns all crawled pages with their content. The content field is formatted according to the result_format specified when creating the crawl (html, text, markdown, or json).
{
"data": [
{
"url": "https://docs.example.com/",
"title": "Documentation Home",
"content": "# Welcome to our docs...",
"status_code": 200,
"crawled_at": "2024-01-15T10:30:15Z"
},
{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started",
"content": "# Getting Started\n\nThis guide...",
"status_code": 200,
"crawled_at": "2024-01-15T10:30:18Z"
}
],
"meta": {
"total": 120,
"page": 1,
"per_page": 20
}
} Pagination
Results are paginated. Use page and per_page parameters:
curl -H "Authorization: Bearer sk_your_api_key" \
"https://your-server/api/v1/crawls/crawl_abc123/results?page=2&per_page=50" Data Retention
Crawl results are retained for a configurable period. Use the results_expires_in_days parameter when creating a crawl to set retention (default: 7 days, max: 7 days). After expiration, the crawl record remains but page content is deleted.