Skip to main content

DocPull Turns Websites Into OKF Knowledge Bundles

Google Cloud introduced the Open Knowledge Format as a portable Markdown knowledge layer. DocPull 4.3.0 emits OKF bundles for public web pages, docs, help centers, changelogs, and product sites.

DocPullField NoteFirst-class option7 min
Export spec
OKF v0.1
Smoke-tested
5 bundles
Shape errors
0
Published
June 13, 2026
DocPull release
4.3.0
Spec
OKF v0.1
DocPull entrypoints
--format okf, --profile okf
DocPull OKF

Generate an Open Knowledge Format bundle from any public website

One command turns a public page graph into portable Markdown concept files, generated indexes, and a stable corpus manifest.

Command
docpull https://example.com --format okf -o ./site-okf
Bundle tree
site-okf/
  index.md
  corpus.manifest.json
  _root.md
  docs/
    index.md
    getting-started.md

TL;DR

Google Cloud introduced Open Knowledge Format on June 12, 2026 as an open, vendor-neutral specification. Its value is portability: a knowledge bundle is just Markdown files, YAML frontmatter, optional directory indexes, and a small set of conventions agents can rely on.

DocPull 4.3.0 adds first-class OKF output to DocPull because scraped web pages are one of the most useful raw materials for agent context. The released CLI can fetch public web content, normalize it, and emit an OKF v0.1 bundle with stable manifests, generated indexes, source URLs, hashes, and concept frontmatter.

DocPull is a web scraper first. Documentation is one high-value source, but the same output path applies to product pages, blog posts, release notes, help centers, changelogs, support articles, and other public pages you want agents to understand.

Public websites

Product pages, articles, landing pages, and reference material.

Knowledge surfaces

Help centers, support articles, changelogs, and release notes.

Agent systems

Search indexes, memory layers, citations, and review workflows.

The right framing: OKF is a portable interchange shape for knowledge that should remain useful when the consuming agent, search index, or host application changes.

What OKF standardizes

The OKF v0.1 spec formalizes a pattern many agent teams were already drifting toward: a directory tree of Markdown files that can be read by humans, crawled by ordinary tools, and interpreted predictably by agents.

In OKF, normal Markdown files are concept documents. Each concept carries YAML frontmatter with a required non-empty type. Root index.md can declare okf_version. Other index.md files are directory listings for progressive disclosure. log.md is reserved for chronological update history.

That restraint is the feature. A useful OKF bundle should survive being opened in an editor, rendered on GitHub, searched with grep, loaded into a vector index, or handed to an agent.

Why it matters

Agent context usually fails in one of two ways. Either we give the model too much unstructured text and hope retrieval finds the right parts, or we lock knowledge inside one product's schema and make it hard to move later.

OKF sits between those extremes. It keeps the files portable while adding enough structure for a consumer to answer practical questions: What is this document about? Where did it come from? Which directory should I inspect next? Is this a concept file or a generated listing?

For web scraping, this is especially useful. Public pages already have URLs, titles, sections, timestamps, and stable citation needs. That applies to product pages, help centers, knowledge bases, articles, changelogs, and API references. The missing piece is a portable packaging contract for agents.

Scrape

Web pages

DocPull fetches URLs or crawls a small site surface.

Normalize

Markdown

HTML becomes clean text with titles, links, and source metadata.

Package

OKF bundle

Concept files, generated indexes, and manifest data stay together.

Use

Agent context

The bundle is ready for search, memory, review, or citation.

Concrete output

A scraped product page becomes a normal OKF concept document: producer-defined type, display title, source URL, tags, timestamp, and a Markdown body the agent can read without a custom SDK.

Example OKF conceptOKF
---
type: Web Page
title: Example Product Page
description: Normalized public page scraped from example.com.
resource: https://example.com/product
source: https://example.com/product
tags: [product, scraped]
timestamp: 2026-06-13T18:22:00Z
---

# Example Product Page

Clean Markdown body extracted from the public page.

## Pricing

The product starts at $29 per month.

# Citations

[1] [Source page](https://example.com/product)

The important part is not the exact type string. OKF lets producers choose descriptive concept types and requires consumers to tolerate unknown ones. The interoperability surface is the bundle shape: frontmatter first, Markdown body after, normal links and citations throughout.

What we added

DocPull 4.3.0 supports two first-class OKF entrypoints:

Generate OKFOKF
docpull https://example.com --format okf -o ./site-okf

# Equivalent profile form
docpull https://example.com --profile okf -o ./site-okf
OKF ships in DocPull 4.3.0 on PyPI. If your installed CLI does not show --format okf or --profile okf, upgrade with pip install -U docpull.

The implementation is intentionally opt-in. The default Markdown output stays unchanged because existing users may parse current filenames and frontmatter. OKF changes reserved filenames and concept metadata, so it belongs behind an explicit format/profile boundary.

Bundle shape
site-okf/
index.md
Root OKF index
Declares okf_version: "0.1" and lists top-level entries.
corpus.manifest.json
DocPull manifest
Preserves IDs, hashes, paths, and run identity.
_root.md
Scraped page concept
Keeps a root or landing page from colliding with index.md.
docs/index.md
Generated directory listing
Lets agents progressively inspect nested content.
docs/getting-started.md
Normal concept document
Contains Markdown body plus OKF frontmatter.

Concept files get OKF frontmatter: type, title, description, resource,tags, and timestamp when source metadata is available. DocPull also keeps source as a compatibility extension because existing consumers already use it.

The important naming detail is index.md. OKF reserves it for generated listings, but many websites use landing pages at root or trailing-slash URLs. DocPull writes those scraped concepts to _root.md or _page.md so OKF indexes remain valid.

What we tested

We validated the implementation two ways. First, local tests assert the hard OKF rules: every non-reserved Markdown file has parseable frontmatter and non-empty type. Root index.md carries okf_version: "0.1". Nested indexes remain plain listings, and the DocPull manifest reports output_format: okf.

Second, we ran DocPull against real public websites with tight page limits and validated the generated bundles. The live targets below are stable public pages chosen because they are easy to reproduce:

Conformance smoke test

Five generated bundles, zero OKF shape errors.

Passed
Python.org
docs page
1 concept, 1 index
Pydantic.dev
docs page
1 concept, 1 index
Django Project
docs page
1 concept, 1 index
FastAPI
profile okf
1 concept, 1 index
Python.org
3-page crawl
3 concepts, 4 indexes

This is not a retrieval-quality benchmark. It is a conformance smoke test for the generated bundle shape. OKF standardizes packaging; it does not prove source quality, ranking quality, refresh policy, or answer accuracy.

The live pass covered Python, Pydantic, Django, and FastAPI sites. Across five generated bundles, the validator found zero OKF shape errors. The same output path works for other scraped web pages because OKF cares about the generated Markdown bundle, not whether the source page is a docs page.

From the DocPull source checkout, the reproducible check is:

Conformance checkOKF
# From the DocPull source checkout
uv run pytest \
  tests/test_outputs_e2e.py::test_okf_output_bundle_local_server \
  tests/test_outputs_e2e.py::test_okf_indexes_include_nested_directories

That command runs the OKF output tests and the nested-index test that enforces the index.md collision behavior.

Where OKF fits

OKF is the packaging layer. Retrieval quality still comes from good source selection, deduplication, chunking, ranking, and citation strategy. The format gives those systems cleaner inputs and a shared shape to work from.

Input
Any public page DocPull can fetch and normalize.
Output
A directory of Markdown concept files plus generated indexes.
Consumer
Agents, search systems, memory layers, and review tools.

That makes it a strong companion to semantic search, access policy, and agent memory systems. The files carry readable knowledge and source metadata; the surrounding application can decide how to rank, secure, cite, and refresh it.

That is why DocPull still writes corpus.manifest.json. OKF gives agents a portable Markdown bundle; the DocPull manifest preserves run identity, content hashes, document IDs, chunk IDs, and relative output paths so regenerated corpora can be diffed and cited.

Try it

Install or upgrade DocPull, then point it at a website:

CLIOKF
docpull https://example.com --format okf -o ./site-okf

# Equivalent profile form
docpull https://example.com --profile okf -o ./site-okf

Write OKF output to a clean directory. Under the spec, every non-reserved .md file in the bundle tree is treated as a concept document, so mixing unrelated Markdown into the same directory can make the bundle invalid.

The practical next step is simple: generate OKF from your own site, help center, blog, changelog, or docs, open the bundle in an editor, and inspect what an agent would see before adding any vector database or hosted knowledge layer. If the files are clear to a person, they are much easier to make useful to an agent.