Every morning, developers who need to scrape an entire documentation site face the same grind: filtering out duplicate navigation links, stripping feedback buttons, and wrestling page structures that were never designed for machine consumption. This week, a GitHub trending project called web-crawl-olostep takes a different approach — one that collapses the entire pipeline into a single API call.
How to Crawl an Entire Documentation Site Into Markdown With the Olostep SDK
The project, published by researcher kingabzpro, uses Olostep — a unified API for searching, crawling, scraping, and structuring web data — to extract full documentation sites into clean markdown files. The SDK requires Python 3.11 or higher, and the setup guide points developers to the Olostep dashboard for API key configuration.
Installation is straightforward:
pip install olostep python-dotenvCreate a `.env` file in your project folder:
OLOSTEP_API_KEY=your_api_key_here
The core crawling script (`crawl_docs_with_olostep.py`) follows this structure:
import os
from dotenv import load_dotenv
from olostep import OlostepClientload_dotenv()
Crawl configuration
START_URL = "https://docs.example.com"
CRAWL_DEPTH = 2
PAGE_LIMIT = 50
OUTPUT_FOLDER = "crawled_docs"
INCLUDE_PATTERNS = ["/docs/"]
EXCLUDE_PATTERNS = ["/blog/", "/community/"]
def url_to_filename(url: str) -> str:
"""Convert URL to a filesystem-safe filename"""
return url.replace("https://", "").replace("http://", "").replace("/", "_").replace("?", "_") + ".md"
def clean_markdown(content: str) -> str:
"""Remove unnecessary UI text, repeated line breaks, feedback prompts"""
import re
content = re.sub(r'\n{3,}', '\n\n', content)
content = re.sub(r'(Was this page helpful\?|Edit this page).*', '', content)
return content.strip()
def save_markdown(url: str, content: str, folder: str):
"""Save cleaned markdown to file with source URL at top"""
filename = url_to_filename(url)
filepath = os.path.join(folder, filename)
with open(filepath, 'w', encoding='utf-8') as f:
f.write(f"<!-- Source: {url} -->\n\n")
f.write(content)
print(f"Saved: {filepath}")
def clear_output_folder(folder: str):
"""Remove existing crawled markdown files"""
if os.path.exists(folder):
for f in os.listdir(folder):
if f.endswith('.md'):
os.remove(os.path.join(folder, f))
Main crawler
client = OlostepClient(api_key=os.getenv("OLOSTEP_API_KEY"))
clear_output_folder(OUTPUT_FOLDER)
os.makedirs(OUTPUT_FOLDER, exist_ok=True)
crawl = client.crawl(
url=START_URL,
max_depth=CRAWL_DEPTH,
max_pages=PAGE_LIMIT,
include=INCLUDE_PATTERNS,
exclude=EXCLUDE_PATTERNS,
output_format="markdown"
)
for page in crawl.pages:
cleaned = clean_markdown(page.content)
save_markdown(page.url, cleaned, OUTPUT_FOLDER)
Run the script from the terminal:
python crawl_docs_with_olostep.pyWhen execution finishes, the `crawled_docs` folder contains individual markdown files for each page. Every file includes the original URL as a comment at the top, making source tracking straightforward.
What Used to Require Scrapy or Selenium Now Takes a Single API Call
Before this approach, developers had two main options. Scrapy offered deep control but required building a full scraping framework from scratch. Selenium handled JavaScript-rendered pages but introduced browser overhead and fragility. Neither tool was designed for documentation sites specifically, so removing duplicate links, understanding page structure, and converting content into LLM-friendly formats all had to be done manually.
Olostep collapses search, crawling, scraping, and structuring into one API. It natively outputs markdown, plain text, HTML, and structured JSON — formats that large language models can consume directly. The path from URL to usable content shrinks dramatically.
The Real Shift: Crawl Results Connect Directly to AI Workflows
Developers no longer spend time assembling a crawling stack. The extracted markdown files feed directly into retrieval-augmented generation pipelines, question-answering systems, and agent-based architectures. The project also includes a Gradio-based web app that lets users input a URL and crawl settings, then preview results without touching the command line.
Launch the web app with:
python app.pyThen open `http://127.0.0.1:7860` in a browser. The full web app code is available at web-crawl-olostep/app.py.
Documentation crawling has stopped being an engineering task and become a single API call.




