nvidia-generative-ai-notes

Nemo Curator

NeMo Data Curator is an open-source, GPU-accelerated data curation pipeline developed by NVIDIA as part of the NeMo Framework. Its primary purpose is to help researchers and ML engineers prepare high-quality text datasets for pretraining large language models (LLMs). It is designed to handle datasets at massive scale — from hundreds of gigabytes to petabytes — with efficiency and reproducibility.

Modality Key Capabilities
Text Deduplication • Classification • Quality Filtering • Language Detection
Image Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication
Video Scene Detection • Clip Extraction • Motion Filtering • Deduplication
Audio ASR Transcription • Quality Assessment • WER Filtering

Text - Key Components and Capabilities

Data Download & Format Handling

The pipeline can ingest data from Common Crawl (WARC/WET files), HuggingFace datasets, local files, and custom sources. It handles multiple formats including JSON, JSONL, Parquet, and plain text, and converts them into a unified format for downstream processing.

Language Identitication

Using libraries like fastText or langdetect, Data Curator can classify and filter documents by language.

Text Extraction and Cleaning

For HTML or WARC content, it extracts clean text by removing boilerplate (navigation bars, ads, etc.) using tools like JusText or Trafilatura. It also performs Unicode normalization and basic cleaning like fixing encoding artifacts.

Heuristic Quality Filtering

Deduplication

PII (Personally Identifiable Information) Redaction

Data Curator includes tools to detect and redact PII such as names, email addresses, phone numbers, IP addresses, and more. This uses rule-based regex patterns as well as NER (Named Entity Recognition) models, helping teams comply with privacy regulations like GDPR.

Classifier-Based Filtering

Beyond heuristics, you can plug in custom ML classifiers to score and filter documents. NVIDIA provides pre-trained classifiers for things like:

Output Format

Output clean, shuffled JSONL or Parquet for training.