Your First Crawl
This guide covers creating crawl jobs, understanding crawl options, monitoring progress, and retrieving results.
Creating a Crawl
Crawls are created via the REST API using a POST request:
curl -X POST \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"depth": 2,
"format": "markdown"
}' \
https://your-server/api/crawls Crawl Options
Basic Options
url(required) - The starting URL to crawldepth- How many levels deep to follow links (default: 1)format- Output format:html,text,markdown, orjson(default:markdown)
Crawl Modes
Mulberry supports two crawl modes:
- Site traversal (default) - Starts at a URL and follows links within the same domain
- URL list - Crawls a specific list of URLs without following links
To use URL list mode:
{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"mode": "url_list"
} Performance Options
max_workers- Number of concurrent workers (default: 5)delay_ms- Delay between requests in milliseconds (default: 100)
Filtering Options
Control which pages get crawled using regex patterns:
{
"url": "https://docs.example.com",
"include_patterns": ["^/docs/", "^/api/"],
"exclude_patterns": ["/legacy/", "/deprecated/"]
} include_patterns- Only crawl URLs matching these patternsexclude_patterns- Skip URLs matching these patterns
Monitoring Progress
Check crawl status by fetching the crawl details:
curl -H "Authorization: Bearer sk_your_api_key" \
https://your-server/api/crawls/crawl_abc123 Response includes current status and progress:
{
"data": {
"id": "crawl_abc123",
"url": "https://docs.example.com",
"status": "running",
"pages_crawled": 47,
"pages_total": 120,
"started_at": "2024-01-15T10:30:00Z"
}
} Crawl Statuses
pending- Crawl is queued, waiting to startrunning- Crawl is actively processing pagescompleted- Crawl finished successfullyfailed- Crawl encountered an errorcancelled- Crawl was manually stopped
Retrieving Results
Once a crawl completes, retrieve the results:
curl -H "Authorization: Bearer sk_your_api_key" \
https://your-server/api/crawls/crawl_abc123/pages This returns all crawled pages with their content:
{
"data": [
{
"url": "https://docs.example.com/",
"title": "Documentation Home",
"content": "# Welcome to our docs...",
"status_code": 200,
"crawled_at": "2024-01-15T10:30:15Z"
},
{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started",
"content": "# Getting Started\n\nThis guide...",
"status_code": 200,
"crawled_at": "2024-01-15T10:30:18Z"
}
],
"meta": {
"total": 120,
"page": 1,
"per_page": 20
}
} Pagination
Results are paginated. Use page and per_page parameters:
curl -H "Authorization: Bearer sk_your_api_key" \
"https://your-server/api/crawls/crawl_abc123/pages?page=2&per_page=50" Download All Results
For large crawls, you can download all results as a single file:
curl -H "Authorization: Bearer sk_your_api_key" \
https://your-server/api/crawls/crawl_abc123/download Cancelling a Crawl
Stop a running crawl with a DELETE request:
curl -X DELETE \
-H "Authorization: Bearer sk_your_api_key" \
https://your-server/api/crawls/crawl_abc123 Data Retention
Crawl data is retained for a configurable period (default: 3 days). After this, the crawl record remains but page content is deleted. Configure retention in your account settings.