NeMo Data Curator is an open-source, GPU-accelerated data curation pipeline developed by NVIDIA as part of the NeMo Framework. Its primary purpose is to help researchers and ML engineers prepare high-quality text datasets for pretraining large language models (LLMs). It is designed to handle datasets at massive scale — from hundreds of gigabytes to petabytes — with efficiency and reproducibility.
| Modality | Key Capabilities |
|---|---|
| Text | Deduplication • Classification • Quality Filtering • Language Detection |
| Image | Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication |
| Video | Scene Detection • Clip Extraction • Motion Filtering • Deduplication |
| Audio | ASR Transcription • Quality Assessment • WER Filtering |
The pipeline can ingest data from Common Crawl (WARC/WET files), HuggingFace datasets, local files, and custom sources. It handles multiple formats including JSON, JSONL, Parquet, and plain text, and converts them into a unified format for downstream processing.
Using libraries like fastText or langdetect, Data Curator can classify and filter documents by language.
For HTML or WARC content, it extracts clean text by removing boilerplate (navigation bars, ads, etc.) using tools like JusText or Trafilatura. It also performs Unicode normalization and basic cleaning like fixing encoding artifacts.
Data Curator includes tools to detect and redact PII such as names, email addresses, phone numbers, IP addresses, and more. This uses rule-based regex patterns as well as NER (Named Entity Recognition) models, helping teams comply with privacy regulations like GDPR.
Beyond heuristics, you can plug in custom ML classifiers to score and filter documents. NVIDIA provides pre-trained classifiers for things like:
Output clean, shuffled JSONL or Parquet for training.