Olostep API Turns Whole Documentation Sites Into Markdown in One Call

Every morning, developers who need to scrape an entire documentation site face the same grind: filtering out duplicate navigation links, stripping feedback buttons, and wrestling page structures that were never designed for machine consumption. This week, a GitHub trending project called web-crawl-olostep takes a different approach — one that collapses the entire pipeline into a single API call.

How to Crawl an Entire Documentation Site Into Markdown With the Olostep SDK

The project, published by researcher kingabzpro, uses Olostep — a unified API for searching, crawling, scraping, and structuring web data — to extract full documentation sites into clean markdown files. The SDK requires Python 3.11 or higher, and the setup guide points developers to the Olostep dashboard for API key configuration.

Installation is straightforward:

bash

pip install olostep python-dotenv

Create a `.env` file in your project folder:

OLOSTEP_API_KEY=your_api_key_here

The core crawling script (`crawl_docs_with_olostep.py`) follows this structure:

python

import os
from dotenv import load_dotenv
from olostep import OlostepClient

load_dotenv()

Crawl configuration

START_URL = "https://docs.example.com"

CRAWL_DEPTH = 2

PAGE_LIMIT = 50

OUTPUT_FOLDER = "crawled_docs"

INCLUDE_PATTERNS = ["/docs/"]

EXCLUDE_PATTERNS = ["/blog/", "/community/"]

def url_to_filename(url: str) -> str:

"""Convert URL to a filesystem-safe filename"""

return url.replace("https://", "").replace("http://", "").replace("/", "_").replace("?", "_") + ".md"

def clean_markdown(content: str) -> str:

"""Remove unnecessary UI text, repeated line breaks, feedback prompts"""

import re

content = re.sub(r'\n{3,}', '\n\n', content)

content = re.sub(r'(Was this page helpful\?|Edit this page).*', '', content)

return content.strip()

def save_markdown(url: str, content: str, folder: str):

"""Save cleaned markdown to file with source URL at top"""

filename = url_to_filename(url)

filepath = os.path.join(folder, filename)

with open(filepath, 'w', encoding='utf-8') as f:

f.write(f"\n\n")

f.write(content)

print(f"Saved: {filepath}")

def clear_output_folder(folder: str):

"""Remove existing crawled markdown files"""

if os.path.exists(folder):

for f in os.listdir(folder):

if f.endswith('.md'):

os.remove(os.path.join(folder, f))

Main crawler

client = OlostepClient(api_key=os.getenv("OLOSTEP_API_KEY"))

clear_output_folder(OUTPUT_FOLDER)

os.makedirs(OUTPUT_FOLDER, exist_ok=True)

crawl = client.crawl(

url=START_URL,

max_depth=CRAWL_DEPTH,

max_pages=PAGE_LIMIT,

include=INCLUDE_PATTERNS,

exclude=EXCLUDE_PATTERNS,

output_format="markdown"

)

for page in crawl.pages:

cleaned = clean_markdown(page.content)

save_markdown(page.url, cleaned, OUTPUT_FOLDER)

Run the script from the terminal:

bash

python crawl_docs_with_olostep.py

When execution finishes, the `crawled_docs` folder contains individual markdown files for each page. Every file includes the original URL as a comment at the top, making source tracking straightforward.

What Used to Require Scrapy or Selenium Now Takes a Single API Call

Before this approach, developers had two main options. Scrapy offered deep control but required building a full scraping framework from scratch. Selenium handled JavaScript-rendered pages but introduced browser overhead and fragility. Neither tool was designed for documentation sites specifically, so removing duplicate links, understanding page structure, and converting content into LLM-friendly formats all had to be done manually.

Olostep collapses search, crawling, scraping, and structuring into one API. It natively outputs markdown, plain text, HTML, and structured JSON — formats that large language models can consume directly. The path from URL to usable content shrinks dramatically.

The Real Shift: Crawl Results Connect Directly to AI Workflows

Developers no longer spend time assembling a crawling stack. The extracted markdown files feed directly into retrieval-augmented generation pipelines, question-answering systems, and agent-based architectures. The project also includes a Gradio-based web app that lets users input a URL and crawl settings, then preview results without touching the command line.

Launch the web app with:

bash

python app.py

Then open `http://127.0.0.1:7860` in a browser. The full web app code is available at web-crawl-olostep/app.py.

Documentation crawling has stopped being an engineering task and become a single API call.

Olostep API Turns Whole Documentation Sites Into Markdown in One Call

How to Crawl an Entire Documentation Site Into Markdown With the Olostep SDK

Crawl configuration

Main crawler

What Used to Require Scrapy or Selenium Now Takes a Single API Call

The Real Shift: Crawl Results Connect Directly to AI Workflows

Related Articles