✓ You're on the list. We'll reach out before launch.
PRIVATE BETA — EARLY ACCESS NOW OPEN
319 TESTS PASSING · ZERO DEPENDENCIES
6 CHUNKING STRATEGIES · 11 FILE FORMATS
CLAUDE CODE MCP INTEGRATION
PII DETECTION · MINHASH DEDUP · QUALITY SCORING
PRIVATE BETA — EARLY ACCESS NOW OPEN
319 TESTS PASSING · ZERO DEPENDENCIES
6 CHUNKING STRATEGIES · 11 FILE FORMATS
CLAUDE CODE MCP INTEGRATION
PII DETECTION · MINHASH DEDUP · QUALITY SCORING
Private Beta — Limited Access

YOUR TRAINING DATA, PRODUCTION READY.

The pipeline that turns raw documents into fine-tuning gold. Ingest 11 file formats, scrub PII automatically, deduplicate with MinHash LSH, chunk with 6 strategies, score quality — then export directly to your embedding API.

11 file formats
6 chunk strategies
11 PII types
0 required deps
319 tests passing
Early Access
GET NOTIFIED
Beta seats are limited. ML teams at the front of the queue get 3 months free on the Growth plan.
247 engineers already on the waitlist
INGEST
CLEAN
CHUNK
CLASSIFY
EMBED
EXPORT
INGEST
CLEAN
CHUNK
CLASSIFY
EMBED
EXPORT
Chunks Processed
0M
Live counter
PII Entities Removed
0K
across all beta users
Avg Processing Speed
0ms
per document (p50)
Test Coverage
0%
319 tests passing


The Pipeline

SIX STAGES.
ONE COMMAND.

📥1
Ingest
Parse any document into a clean Document object
11 formats
🧹2
Clean
Normalize, fix encoding, strip boilerplate
UTF-8 safe
🔍3
PII Scan
Detect 11 entity types, choose your action
replace / redact
4
Chunk
Split with 6 strategies, score every chunk
quality 0–1.0
🧾5
Embed
OpenAI or Cohere with cost tracking
coming soon
📤6
Export
JSONL, Parquet, HuggingFace, vector DB
fine-tune ready
What’s Included

BUILT FOR
ML TEAMS
WHO SHIP.

🔐
PII PROTECTION
Regex + heuristic detection for 11 entity types. Choose replace, redact, flag, or block. Never expose sensitive data in your training set.
EMAILPHONESSNCREDIT CARD+7 more
🧩
SMART CHUNKING
Six strategies for every document type. Semantic chunks preserve meaning. Code chunks respect function boundaries. Document chunks follow headings.
SEMANTICCODEDOCUMENTFIXEDSLIDING
DEDUPLICATION
MinHash LSH finds near-duplicates in O(n) time. Configurable Jaccard threshold. Exact SHA-256 hash for identical documents.
MINHASH LSHSHA-256O(n)
REST API + MCP
20+ endpoints. Full OpenAPI docs. Claude Code MCP server — let AI agents create pipelines, submit jobs, and export chunks directly.
FASTAPIMCP SERVERCLAUDE CODE
📊
QUALITY SCORING
Every chunk gets a 0.0–1.0 quality score. Set a threshold to auto-filter low-signal content before it reaches your embedding API.
INFO DENSITYCOMPLETENESSSTRUCTURE
📄
11 FILE FORMATS
PDF, DOCX, HTML, Markdown, CSV, JSON, XML, XLSX, plain text, and source code. Multi-layer fallback ensures nothing fails silently.
PDFDOCXHTMLCSV+7 more
pipeline.py
from dataclassifier import Pipeline

# Configure once
pipeline = Pipeline(
    pii_action        = "replace",
    chunking_strategy = "semantic",
    max_tokens        = 512,
    dedup_threshold   = 0.80,
)

# Process any batch of files
result = pipeline.run(
    files  = ["contracts/", "tickets.csv"],
    job_id = "q4-batch-001",
)

print(result.summary())
# ✓ 341 chunks · 47 PII replaced · $0.018

result.export_jsonl("dataset.jsonl")
# ✓ 341 records written
0
REQUIRED DEPENDENCIES
The core pipeline runs on pure Python stdlib. No spaCy, no torch, no transformers. Install FastAPI only when you need the API server.
319
TESTS, ALL PASSING
Every stage has its own test suite. Run python3 run_tests.py and see 319 green lights in 2.7 seconds.
2.7s
FULL SUITE RUNTIME
Fast enough to run on every commit. Ingestion, cleaning, chunking, API, and integration tests — all done in under 3 seconds.
Pricing

PAY FOR
WHAT YOU
PROCESS.

Starter
For solo researchers and small teams getting started with fine-tuning.
  • 50 GB ingestion / month
  • 1,000 jobs / month
  • 5 team seats
  • All 6 chunking strategies
  • PII detection + replacement
  • REST API + MCP server
  • SSO / SAML
  • Dedicated support
  • On-premise deploy
Join Waitlist
Most Popular
Growth
For ML teams running production fine-tuning pipelines at scale.
  • 500 GB ingestion / month
  • 10,000 jobs / month
  • 25 team seats
  • All 6 chunking strategies
  • PII detection + replacement
  • REST API + MCP server
  • Priority job queue
  • Audit log export
  • On-premise deploy
Join Waitlist
Enterprise
For regulated industries with strict data residency and compliance requirements.
  • Unlimited ingestion
  • Unlimited jobs
  • Unlimited seats
  • All 6 chunking strategies
  • PII detection + replacement
  • REST API + MCP server
  • SSO / SAML
  • Dedicated support SLA
  • On-premise / VPC deploy
Contact Sales
The bottleneck isn’t your model.
It’s your training data.
Bad chunks produce bad fine-tunes.
Leaked PII produces legal nightmares.
Duplicates waste compute and money.


We built the pipeline
we wished existed
when we were the ones cleaning data at 2am.
— The dataclassifier.ai team