Hugging Face Datasets - Hub of ready-to-use ML datasets
The largest hub of ready-to-use datasets for machine learning models with over 100,000 datasets for NLP, computer vision, audio, and more.
- Step 1
Overview
Hugging Face Datasets is a library for easily accessing and sharing datasets for machine learning tasks. It provides a unified interface to over 100,000 datasets spanning NLP, computer vision, audio, and multimodal tasks. The library handles data downloading, caching, and memory-efficient loading automatically, enabling researchers and practitioners to focus on model development rather than data pipeline engineering.
- Step 2
Technology Stack
Hugging Face Datasets is built with the following technologies:
Language: Python (3.8+) License: Apache 2.0 Stars: 21,534+ Owner: huggingface Repo: https://github.com/huggingface/datasets Core Dependencies: - pyarrow (15.0.0+) - Columnar storage format - numpy - Numerical computing - pandas - Data manipulation - fsspec - Filesystem abstraction (local, S3, GCS, etc.) - requests - HTTP library - tqdm - Progress bars - xxhash - Fast hashing - multiprocess - Parallel processing - dill - Serialization for complex objects Optional Integrations: - torch (PyTorch) - Deep learning framework - tensorflow - Deep learning framework - jax - High-performance ML framework - pillow - Image processing - librosa - Audio processing - soundfile - Audio I/O - transformers - Hugging Face model library - Step 3
Installation
Install the datasets library using pip. The base installation is lightweight, with optional dependencies for specific modalities:
# Basic installation pip install datasets # With vision support (for image datasets) pip install datasets[vision] # With audio support (for audio datasets) pip install datasets[audio] # With all optional dependencies pip install datasets[all] # Install from source for latest development version pip install git+https://github.com/huggingface/datasets.git - Step 4
Quick Start - Loading a Dataset
Load any dataset from the Hugging Face Hub with a single line of code. The library automatically downloads, caches, and prepares the data:
from datasets import load_dataset # Load a text classification dataset dataset = load_dataset("imdb") # Access splits train_data = dataset["train"] test_data = dataset["test"] # Inspect the dataset structure print(dataset) # DatasetDict({ # train: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # test: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # }) # Access individual examples print(train_data[0]) # {'text': 'I loved this movie...', 'label': 1} - Step 5
Dataset Features and Types
Datasets library supports rich data types including text, images, audio, and custom structures. Features are strongly typed and automatically validated:
from datasets import load_dataset # Text dataset text_dataset = load_dataset("squad") # Question answering print(text_dataset["train"].features) # {'id': Value('string'), 'title': Value('string'), 'context': Value('string'), ...} # Image dataset image_dataset = load_dataset("mnist") # Handwritten digits print(image_dataset["train"].features) # {'image': Image(decode=True), 'label': ClassLabel(num_classes=10)} # Audio dataset audio_dataset = load_dataset("common_voice", "en", split="train") print(audio_dataset.features) # {'audio': Audio(sampling_rate=48000), 'sentence': Value('string'), ...} # Multi-modal dataset multimodal = load_dataset("nlphuji/flickr30k") # Image captioning print(multimodal["test"].features) # {'image': Image(), 'caption': Sequence(Value('string'))} - Step 6
Data Processing and Transformation
Transform datasets efficiently using the map() function, which applies processing in parallel and caches results:
from datasets import load_dataset dataset = load_dataset("imdb", split="train") # Apply a function to each example def tokenize_function(examples): return {"length": [len(text.split()) for text in examples["text"]]} # Process in batches for efficiency tokenized = dataset.map( tokenize_function, batched=True, batch_size=1000, num_proc=4 # Use 4 processes for parallel processing ) # Filter dataset long_reviews = dataset.filter(lambda x: len(x["text"]) > 500) # Select specific columns text_only = dataset.remove_columns(["label"]) # Sort by a column sorted_dataset = dataset.sort("text") # Shuffle the dataset shuffled = dataset.shuffle(seed=42) - Step 7
Memory-Efficient Data Loading
Datasets are memory-mapped from disk using Apache Arrow, allowing you to work with datasets larger than RAM:
from datasets import load_dataset # Load a large dataset - only metadata is loaded into memory dataset = load_dataset("c4", "en", split="train", streaming=False) # For extremely large datasets, use streaming mode # Data is loaded on-the-fly without downloading the full dataset streaming_dataset = load_dataset( "c4", "en", split="train", streaming=True # Loads data as needed ) # Iterate over a streaming dataset for i, example in enumerate(streaming_dataset): print(example["text"]) if i >= 10: break # Process only first 10 examples # Take a subset from streaming dataset subset = streaming_dataset.take(1000) # First 1000 examples # Skip examples in streaming mode skipped = streaming_dataset.skip(5000) # Skip first 5000 - Step 8
Integration with PyTorch and TensorFlow
Convert datasets to native PyTorch or TensorFlow formats for seamless integration with training loops:
from datasets import load_dataset dataset = load_dataset("imdb", split="train") # PyTorch integration dataset.set_format(type="torch", columns=["text", "label"]) # Now dataset returns PyTorch tensors # Or use with_format for temporary conversion torch_dataset = dataset.with_format("torch") # Use with PyTorch DataLoader from torch.utils.data import DataLoader dataloader = DataLoader(torch_dataset, batch_size=32) for batch in dataloader: # batch contains PyTorch tensors print(batch["label"].shape) # torch.Size([32]) break # TensorFlow integration tf_dataset = dataset.to_tf_dataset( columns=["text", "label"], batch_size=32, shuffle=True ) # Use with TensorFlow/Keras import tensorflow as tf for batch in tf_dataset.take(1): print(batch["label"].shape) # (32,) - Step 9
Loading Custom Datasets
Load datasets from local files in various formats including CSV, JSON, Parquet, and more:
from datasets import load_dataset # Load from CSV csv_dataset = load_dataset("csv", data_files="my_data.csv") # Load from JSON json_dataset = load_dataset("json", data_files="my_data.json") # Load from multiple files with splits dataset = load_dataset( "csv", data_files={ "train": "train.csv", "test": "test.csv" } ) # Load from Parquet (most efficient) parquet_dataset = load_dataset("parquet", data_files="data.parquet") # Load from remote URL remote_dataset = load_dataset( "csv", data_files="https://example.com/data.csv" ) # Load from cloud storage (S3, GCS, etc.) s3_dataset = load_dataset( "csv", data_files="s3://my-bucket/data.csv" ) - Step 10
Creating Custom Datasets
Build datasets from scratch using Python dictionaries, pandas DataFrames, or by defining a loading script:
from datasets import Dataset, DatasetDict import pandas as pd # From a Python dictionary data_dict = { "text": ["Hello world", "How are you?", "I'm fine"], "label": [0, 1, 0] } dataset = Dataset.from_dict(data_dict) # From a pandas DataFrame df = pd.DataFrame({ "text": ["Example 1", "Example 2"], "label": [0, 1] }) dataset = Dataset.from_pandas(df) # Create a DatasetDict with multiple splits train_dict = {"text": ["train1", "train2"], "label": [0, 1]} test_dict = {"text": ["test1"], "label": [0]} dataset_dict = DatasetDict({ "train": Dataset.from_dict(train_dict), "test": Dataset.from_dict(test_dict) }) # From a generator function (memory efficient) def data_generator(): for i in range(1000): yield {"text": f"Example {i}", "label": i % 2} dataset = Dataset.from_generator(data_generator) - Step 11
Sharing Datasets on the Hub
Upload your datasets to the Hugging Face Hub to share with the community or for private use:
# Install huggingface-hub for authentication pip install huggingface-hub # Login to Hugging Face (requires account at huggingface.co) huggingface-cli login⚠ Heads up: You'll need a Hugging Face account to upload datasets. Create one at https://huggingface.co/join - Step 12
Upload Dataset to Hub
Push your dataset to the Hugging Face Hub using the push_to_hub method:
from datasets import Dataset # Create or load your dataset data = { "text": ["Example 1", "Example 2", "Example 3"], "label": [0, 1, 0] } dataset = Dataset.from_dict(data) # Upload to the Hub (creates a public dataset) dataset.push_to_hub("my-username/my-dataset") # Upload a private dataset dataset.push_to_hub( "my-username/my-private-dataset", private=True ) # Upload with a specific split name dataset.push_to_hub( "my-username/my-dataset", split="train" ) # Upload a DatasetDict with multiple splits from datasets import DatasetDict dataset_dict = DatasetDict({ "train": train_dataset, "test": test_dataset }) dataset_dict.push_to_hub("my-username/my-dataset") - Step 13
Dataset Caching and Storage
Datasets are automatically cached to avoid re-downloading. Manage cache location and behavior:
from datasets import load_dataset import os # Default cache location: ~/.cache/huggingface/datasets # Set custom cache directory via environment variable os.environ["HF_DATASETS_CACHE"] = "/path/to/custom/cache" # Or pass cache_dir directly dataset = load_dataset( "imdb", cache_dir="/path/to/custom/cache" ) # Disable caching (forces re-download) dataset = load_dataset("imdb", download_mode="force_redownload") # Use cached version only (fail if not cached) dataset = load_dataset("imdb", download_mode="reuse_cache_if_exists") # Clear cache for a specific dataset from datasets import load_dataset_builder builder = load_dataset_builder("imdb") builder.cache_dir # Shows cache location - Step 14
Advanced Features - Sharding
Split large datasets into shards for distributed processing or to work with subsets:
from datasets import load_dataset # Load only a specific shard (useful for distributed training) dataset = load_dataset( "c4", "en", split="train", streaming=True ) # Split into 8 shards, take shard 0 shard_0 = dataset.shard(num_shards=8, index=0) # In a distributed setting, each worker gets one shard # Worker 0: worker_dataset = dataset.shard(num_shards=4, index=0) # Worker 1: worker_dataset = dataset.shard(num_shards=4, index=1) # ... # Select a percentage of the data subset = dataset.select(range(1000)) # First 1000 examples # Random subset (10% of data) train_subset = dataset.train_test_split(test_size=0.1)["train"] - Step 15
Working with Images and Audio
Datasets library provides specialized support for multimedia data with automatic decoding:
from datasets import load_dataset # Load image dataset image_dataset = load_dataset("cifar10", split="train") # Images are automatically decoded to PIL Images image = image_dataset[0]["img"] # PIL.Image.Image print(type(image)) # <class 'PIL.JpegImagePlugin.JpegImageFile'> # Access image properties print(image.size) # (32, 32) # Disable automatic decoding for performance image_dataset = image_dataset.cast_column( "img", datasets.Image(decode=False) ) # Now images are returned as bytes # Load audio dataset audio_dataset = load_dataset( "common_voice", "en", split="train", streaming=True ) # Audio is automatically decoded to numpy arrays audio_sample = next(iter(audio_dataset)) print(audio_sample["audio"]) # {'array': array([...]), 'sampling_rate': 48000, 'path': '...'} # Resample audio to different sampling rate from datasets import Audio audio_dataset = audio_dataset.cast_column( "audio", Audio(sampling_rate=16000) ) - Step 16
Dataset Metrics and Evaluation
The datasets library previously included evaluation metrics, but these have been moved to the evaluate library:
# Install the evaluate library for metrics pip install evaluate - Step 17
Using Evaluation Metrics
Load and compute metrics using the separate evaluate library:
import evaluate # Load a metric accuracy = evaluate.load("accuracy") # Compute the metric predictions = [0, 1, 0, 1] references = [0, 1, 1, 1] results = accuracy.compute(predictions=predictions, references=references) print(results) # {'accuracy': 0.75} # Other common metrics f1 = evaluate.load("f1") bleu = evaluate.load("bleu") rouge = evaluate.load("rouge") wer = evaluate.load("wer") # Word Error Rate for speech # Metrics for specific tasks glue_metric = evaluate.load("glue", "mrpc") # GLUE benchmark squad_metric = evaluate.load("squad") # Question answering - Step 18
Dataset Configuration and Variants
Many datasets have multiple configurations or subsets. Specify the configuration when loading:
from datasets import load_dataset, get_dataset_config_names # List available configurations configs = get_dataset_config_names("super_glue") print(configs) # ['boolq', 'cb', 'copa', 'multirc', 'record', 'rte', 'wic', 'wsc'] # Load a specific configuration dataset = load_dataset("super_glue", "boolq") # Some datasets require a configuration try: dataset = load_dataset("super_glue") # Error! except ValueError as e: print("Must specify config:", e) # Load dataset with language configuration xnli = load_dataset("xnli", "en") # English version xnli_fr = load_dataset("xnli", "fr") # French version # Check dataset information from datasets import load_dataset_builder builder = load_dataset_builder("super_glue", "boolq") print(builder.info.description) print(builder.info.features) - Step 19
Performance Optimization Tips
Best practices for efficient dataset processing and loading:
1. Use streaming=True for datasets > 10GB to avoid downloading entire dataset 2. Use num_proc parameter in map() for parallel processing 3. Set batched=True in map() to process data in batches (10-100x faster) 4. Use remove_columns to drop unused features before processing 5. Cache processed datasets: dataset.save_to_disk("path/to/cache") 6. Use load_from_disk("path/to/cache") to reload cached datasets 7. For images/audio, set decode=False if you don't need the decoded data immediately 8. Use select() or shard() to work with subsets during development 9. Set keep_in_memory=False for large datasets that don't fit in RAM 10. Use Arrow-native operations when possible (filter, select, sort) - Step 20
Common Use Cases
Examples of typical workflows with Hugging Face Datasets:
from datasets import load_dataset, concatenate_datasets # 1. Train-validation split dataset = load_dataset("imdb", split="train") train_val = dataset.train_test_split(test_size=0.1) train = train_val["train"] val = train_val["test"] # 2. Combine multiple datasets dataset1 = load_dataset("dataset1", split="train") dataset2 = load_dataset("dataset2", split="train") combined = concatenate_datasets([dataset1, dataset2]) # 3. Sample a subset for quick experimentation small_dataset = dataset.shuffle(seed=42).select(range(1000)) # 4. Balance classes from collections import Counter labels = dataset["label"] label_counts = Counter(labels) # Undersample majority class or oversample minority # 5. Save processed dataset processed = dataset.map(preprocessing_function) processed.save_to_disk("./processed_data") # Later: reload without reprocessing from datasets import load_from_disk processed = load_from_disk("./processed_data") - Step 21
Troubleshooting Common Issues
Solutions to frequently encountered problems:
Problem: Dataset download fails or hangs Solution: Check internet connection, try download_mode="force_redownload", or use a mirror Problem: Out of memory errors Solution: Use streaming=True, process in smaller batches, or increase system swap Problem: Slow data loading Solution: Increase num_proc in map(), use batched=True, cache preprocessed data Problem: "Unable to find config" error Solution: Specify the config name: load_dataset("name", "config_name") Problem: Authentication errors with private datasets Solution: Run 'huggingface-cli login' and use token parameter Problem: Incompatible dataset version Solution: Specify revision: load_dataset("name", revision="main") Problem: Arrow memory mapping errors on Windows Solution: Use keep_in_memory=True or disable memory mapping Problem: Multiprocessing issues on Windows Solution: Use num_proc=1 or wrap code in if __name__ == "__main__": - Step 22
Additional Resources
Key resources for learning more about Hugging Face Datasets:
Official Documentation: https://huggingface.co/docs/datasets Dataset Hub: https://huggingface.co/datasets (100,000+ datasets) GitHub Repository: https://github.com/huggingface/datasets Community Forum: https://discuss.huggingface.co Tutorials: https://huggingface.co/course API Reference: https://huggingface.co/docs/datasets/package_reference Dataset Cards: Each dataset has a detailed card explaining its contents Course: https://huggingface.co/learn/nlp-course Related Libraries: - transformers: Pre-trained models for NLP, vision, audio - evaluate: Metrics for ML evaluation - tokenizers: Fast text tokenization - accelerate: Distributed training made easy
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.