The pipeline that turns raw documents into fine-tuning gold. Ingest 11 file formats, scrub PII automatically, deduplicate with MinHash LSH, chunk with 6 strategies, score quality — then export directly to your embedding API.
from dataclassifier import Pipeline # Configure once pipeline = Pipeline( pii_action = "replace", chunking_strategy = "semantic", max_tokens = 512, dedup_threshold = 0.80, ) # Process any batch of files result = pipeline.run( files = ["contracts/", "tickets.csv"], job_id = "q4-batch-001", ) print(result.summary()) # ✓ 341 chunks · 47 PII replaced · $0.018 result.export_jsonl("dataset.jsonl") # ✓ 341 records written
python3 run_tests.py and see 319 green lights in 2.7 seconds.