▶ PRIVATE BETA — EARLY ACCESS NOW OPEN

◈ 319 TESTS PASSING · ZERO DEPENDENCIES

↗ 6 CHUNKING STRATEGIES · 11 FILE FORMATS

⊗ CLAUDE CODE MCP INTEGRATION

■ PII DETECTION · MINHASH DEDUP · QUALITY SCORING

▶ PRIVATE BETA — EARLY ACCESS NOW OPEN

◈ 319 TESTS PASSING · ZERO DEPENDENCIES

↗ 6 CHUNKING STRATEGIES · 11 FILE FORMATS

⊗ CLAUDE CODE MCP INTEGRATION

■ PII DETECTION · MINHASH DEDUP · QUALITY SCORING

Private Beta — Limited Access

YOUR TRAINING DATA, PRODUCTION READY.

The pipeline that turns raw documents into fine-tuning gold. Ingest 11 file formats, scrub PII automatically, deduplicate with MinHash LSH, chunk with 6 strategies, score quality — then export directly to your embedding API.

11 file formats

6 chunk strategies

11 PII types

0 required deps

319 tests passing

Early Access

GET NOTIFIED

Beta seats are limited. ML teams at the front of the queue get 3 months free on the Growth plan.

247 engineers already on the waitlist

The Pipeline

SIX STAGES.
ONE COMMAND.

📥1

Ingest

Parse any document into a clean Document object

11 formats

🧹2

Clean

Normalize, fix encoding, strip boilerplate

UTF-8 safe

🔍3

PII Scan

Detect 11 entity types, choose your action

replace / redact

✂4

Chunk

Split with 6 strategies, score every chunk

quality 0–1.0

🧾5

Embed

OpenAI or Cohere with cost tracking

coming soon

📤6

Export

JSONL, Parquet, HuggingFace, vector DB

fine-tune ready

What’s Included

BUILT FOR
ML TEAMS
WHO SHIP.

🔐

PII PROTECTION

Regex + heuristic detection for 11 entity types. Choose replace, redact, flag, or block. Never expose sensitive data in your training set.

EMAILPHONESSNCREDIT CARD+7 more

🧩

SMART CHUNKING

Six strategies for every document type. Semantic chunks preserve meaning. Code chunks respect function boundaries. Document chunks follow headings.

SEMANTICCODEDOCUMENTFIXEDSLIDING

♻

DEDUPLICATION

MinHash LSH finds near-duplicates in O(n) time. Configurable Jaccard threshold. Exact SHA-256 hash for identical documents.

MINHASH LSHSHA-256O(n)

⚡

REST API + MCP

20+ endpoints. Full OpenAPI docs. Claude Code MCP server — let AI agents create pipelines, submit jobs, and export chunks directly.

FASTAPIMCP SERVERCLAUDE CODE

📊

QUALITY SCORING

Every chunk gets a 0.0–1.0 quality score. Set a threshold to auto-filter low-signal content before it reaches your embedding API.

INFO DENSITYCOMPLETENESSSTRUCTURE

📄

11 FILE FORMATS

PDF, DOCX, HTML, Markdown, CSV, JSON, XML, XLSX, plain text, and source code. Multi-layer fallback ensures nothing fails silently.

PDFDOCXHTMLCSV+7 more

pipeline.py

from dataclassifier import Pipeline

# Configure once
pipeline = Pipeline(
    pii_action        = "replace",
    chunking_strategy = "semantic",
    max_tokens        = 512,
    dedup_threshold   = 0.80,
)

# Process any batch of files
result = pipeline.run(
    files  = ["contracts/", "tickets.csv"],
    job_id = "q4-batch-001",
)

print(result.summary())
# ✓ 341 chunks · 47 PII replaced · $0.018

result.export_jsonl("dataset.jsonl")
# ✓ 341 records written

REQUIRED DEPENDENCIES

The core pipeline runs on pure Python stdlib. No spaCy, no torch, no transformers. Install FastAPI only when you need the API server.

319

TESTS, ALL PASSING

Every stage has its own test suite. Run python3 run_tests.py and see 319 green lights in 2.7 seconds.

2.7s

FULL SUITE RUNTIME

Fast enough to run on every commit. Ingestion, cleaning, chunking, API, and integration tests — all done in under 3 seconds.

Pricing

PAY FOR
WHAT YOU
PROCESS.

Starter

For solo researchers and small teams getting started with fine-tuning.

50 GB ingestion / month
1,000 jobs / month
5 team seats
All 6 chunking strategies
PII detection + replacement
REST API + MCP server
SSO / SAML
Dedicated support
On-premise deploy

Join Waitlist

YOUR TRAINING DATA, PRODUCTION READY.

SIX STAGES.ONE COMMAND.

BUILT FORML TEAMSWHO SHIP.

PAY FORWHAT YOUPROCESS.

SIX STAGES.
ONE COMMAND.

BUILT FOR
ML TEAMS
WHO SHIP.

PAY FOR
WHAT YOU
PROCESS.