Intermediatedatasetsmachine-learningnlpcomputer-visionaudiopythonhuggingfacepytorchtensorflowdata-science

Hugging Face Datasets - Hub of ready-to-use ML datasets

The largest hub of ready-to-use datasets for machine learning models with over 100,000 datasets for NLP, computer vision, audio, and more.

Step 1
Overview
Hugging Face Datasets is a library for easily accessing and sharing datasets for machine learning tasks. It provides a unified interface to over 100,000 datasets spanning NLP, computer vision, audio, and multimodal tasks. The library handles data downloading, caching, and memory-efficient loading automatically, enabling researchers and practitioners to focus on model development rather than data pipeline engineering.

Step 2

Technology Stack

Hugging Face Datasets is built with the following technologies:

Language: Python (3.8+)
License: Apache 2.0
Stars: 21,534+
Owner: huggingface
Repo: https://github.com/huggingface/datasets

Core Dependencies:
- pyarrow (15.0.0+) - Columnar storage format
- numpy - Numerical computing
- pandas - Data manipulation
- fsspec - Filesystem abstraction (local, S3, GCS, etc.)
- requests - HTTP library
- tqdm - Progress bars
- xxhash - Fast hashing
- multiprocess - Parallel processing
- dill - Serialization for complex objects

Optional Integrations:
- torch (PyTorch) - Deep learning framework
- tensorflow - Deep learning framework
- jax - High-performance ML framework
- pillow - Image processing
- librosa - Audio processing
- soundfile - Audio I/O
- transformers - Hugging Face model library

Step 3

Installation

Install the datasets library using pip. The base installation is lightweight, with optional dependencies for specific modalities:

# Basic installation
pip install datasets

# With vision support (for image datasets)
pip install datasets[vision]

# With audio support (for audio datasets)
pip install datasets[audio]

# With all optional dependencies
pip install datasets[all]

# Install from source for latest development version
pip install git+https://github.com/huggingface/datasets.git

Step 4

Quick Start - Loading a Dataset

Load any dataset from the Hugging Face Hub with a single line of code. The library automatically downloads, caches, and prepares the data:

from datasets import load_dataset

# Load a text classification dataset
dataset = load_dataset("imdb")

# Access splits
train_data = dataset["train"]
test_data = dataset["test"]

# Inspect the dataset structure
print(dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['text', 'label'],
#         num_rows: 25000
#     })
#     test: Dataset({
#         features: ['text', 'label'],
#         num_rows: 25000
#     })
# })

# Access individual examples
print(train_data[0])
# {'text': 'I loved this movie...', 'label': 1}

Step 5

Dataset Features and Types

Datasets library supports rich data types including text, images, audio, and custom structures. Features are strongly typed and automatically validated:

from datasets import load_dataset

# Text dataset
text_dataset = load_dataset("squad")  # Question answering
print(text_dataset["train"].features)
# {'id': Value('string'), 'title': Value('string'), 'context': Value('string'), ...}

# Image dataset
image_dataset = load_dataset("mnist")  # Handwritten digits
print(image_dataset["train"].features)
# {'image': Image(decode=True), 'label': ClassLabel(num_classes=10)}

# Audio dataset
audio_dataset = load_dataset("common_voice", "en", split="train")
print(audio_dataset.features)
# {'audio': Audio(sampling_rate=48000), 'sentence': Value('string'), ...}

# Multi-modal dataset
multimodal = load_dataset("nlphuji/flickr30k")  # Image captioning
print(multimodal["test"].features)
# {'image': Image(), 'caption': Sequence(Value('string'))}

Step 6

Data Processing and Transformation

Transform datasets efficiently using the map() function, which applies processing in parallel and caches results:

from datasets import load_dataset

dataset = load_dataset("imdb", split="train")

# Apply a function to each example
def tokenize_function(examples):
    return {"length": [len(text.split()) for text in examples["text"]]}

# Process in batches for efficiency
tokenized = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    num_proc=4  # Use 4 processes for parallel processing
)

# Filter dataset
long_reviews = dataset.filter(lambda x: len(x["text"]) > 500)

# Select specific columns
text_only = dataset.remove_columns(["label"])

# Sort by a column
sorted_dataset = dataset.sort("text")

# Shuffle the dataset
shuffled = dataset.shuffle(seed=42)

Step 7

Memory-Efficient Data Loading

Datasets are memory-mapped from disk using Apache Arrow, allowing you to work with datasets larger than RAM:

from datasets import load_dataset

# Load a large dataset - only metadata is loaded into memory
dataset = load_dataset("c4", "en", split="train", streaming=False)

# For extremely large datasets, use streaming mode
# Data is loaded on-the-fly without downloading the full dataset
streaming_dataset = load_dataset(
    "c4",
    "en",
    split="train",
    streaming=True  # Loads data as needed
)

# Iterate over a streaming dataset
for i, example in enumerate(streaming_dataset):
    print(example["text"])
    if i >= 10:
        break  # Process only first 10 examples

# Take a subset from streaming dataset
subset = streaming_dataset.take(1000)  # First 1000 examples

# Skip examples in streaming mode
skipped = streaming_dataset.skip(5000)  # Skip first 5000

Step 8

Integration with PyTorch and TensorFlow

Convert datasets to native PyTorch or TensorFlow formats for seamless integration with training loops:

from datasets import load_dataset

dataset = load_dataset("imdb", split="train")

# PyTorch integration
dataset.set_format(type="torch", columns=["text", "label"])
# Now dataset returns PyTorch tensors

# Or use with_format for temporary conversion
torch_dataset = dataset.with_format("torch")

# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(torch_dataset, batch_size=32)

for batch in dataloader:
    # batch contains PyTorch tensors
    print(batch["label"].shape)  # torch.Size([32])
    break

# TensorFlow integration
tf_dataset = dataset.to_tf_dataset(
    columns=["text", "label"],
    batch_size=32,
    shuffle=True
)

# Use with TensorFlow/Keras
import tensorflow as tf
for batch in tf_dataset.take(1):
    print(batch["label"].shape)  # (32,)

Step 9

Loading Custom Datasets

Load datasets from local files in various formats including CSV, JSON, Parquet, and more:

from datasets import load_dataset

# Load from CSV
csv_dataset = load_dataset("csv", data_files="my_data.csv")

# Load from JSON
json_dataset = load_dataset("json", data_files="my_data.json")

# Load from multiple files with splits
dataset = load_dataset(
    "csv",
    data_files={
        "train": "train.csv",
        "test": "test.csv"
    }
)

# Load from Parquet (most efficient)
parquet_dataset = load_dataset("parquet", data_files="data.parquet")

# Load from remote URL
remote_dataset = load_dataset(
    "csv",
    data_files="https://example.com/data.csv"
)

# Load from cloud storage (S3, GCS, etc.)
s3_dataset = load_dataset(
    "csv",
    data_files="s3://my-bucket/data.csv"
)

Step 10

Creating Custom Datasets

Build datasets from scratch using Python dictionaries, pandas DataFrames, or by defining a loading script:

from datasets import Dataset, DatasetDict
import pandas as pd

# From a Python dictionary
data_dict = {
    "text": ["Hello world", "How are you?", "I'm fine"],
    "label": [0, 1, 0]
}
dataset = Dataset.from_dict(data_dict)

# From a pandas DataFrame
df = pd.DataFrame({
    "text": ["Example 1", "Example 2"],
    "label": [0, 1]
})
dataset = Dataset.from_pandas(df)

# Create a DatasetDict with multiple splits
train_dict = {"text": ["train1", "train2"], "label": [0, 1]}
test_dict = {"text": ["test1"], "label": [0]}

dataset_dict = DatasetDict({
    "train": Dataset.from_dict(train_dict),
    "test": Dataset.from_dict(test_dict)
})

# From a generator function (memory efficient)
def data_generator():
    for i in range(1000):
        yield {"text": f"Example {i}", "label": i % 2}

dataset = Dataset.from_generator(data_generator)

Step 11
Sharing Datasets on the Hub
Upload your datasets to the Hugging Face Hub to share with the community or for private use:
```
# Install huggingface-hub for authentication
pip install huggingface-hub

# Login to Hugging Face (requires account at huggingface.co)
huggingface-cli login
```
⚠ Heads up: You'll need a Hugging Face account to upload datasets. Create one at https://huggingface.co/join

Step 12

Upload Dataset to Hub

Push your dataset to the Hugging Face Hub using the push_to_hub method:

from datasets import Dataset

# Create or load your dataset
data = {
    "text": ["Example 1", "Example 2", "Example 3"],
    "label": [0, 1, 0]
}
dataset = Dataset.from_dict(data)

# Upload to the Hub (creates a public dataset)
dataset.push_to_hub("my-username/my-dataset")

# Upload a private dataset
dataset.push_to_hub(
    "my-username/my-private-dataset",
    private=True
)

# Upload with a specific split name
dataset.push_to_hub(
    "my-username/my-dataset",
    split="train"
)

# Upload a DatasetDict with multiple splits
from datasets import DatasetDict
dataset_dict = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})
dataset_dict.push_to_hub("my-username/my-dataset")

Step 13

Dataset Caching and Storage

Datasets are automatically cached to avoid re-downloading. Manage cache location and behavior:

from datasets import load_dataset
import os

# Default cache location: ~/.cache/huggingface/datasets

# Set custom cache directory via environment variable
os.environ["HF_DATASETS_CACHE"] = "/path/to/custom/cache"

# Or pass cache_dir directly
dataset = load_dataset(
    "imdb",
    cache_dir="/path/to/custom/cache"
)

# Disable caching (forces re-download)
dataset = load_dataset("imdb", download_mode="force_redownload")

# Use cached version only (fail if not cached)
dataset = load_dataset("imdb", download_mode="reuse_cache_if_exists")

# Clear cache for a specific dataset
from datasets import load_dataset_builder
builder = load_dataset_builder("imdb")
builder.cache_dir  # Shows cache location

Step 14

Advanced Features - Sharding

Split large datasets into shards for distributed processing or to work with subsets:

from datasets import load_dataset

# Load only a specific shard (useful for distributed training)
dataset = load_dataset(
    "c4",
    "en",
    split="train",
    streaming=True
)

# Split into 8 shards, take shard 0
shard_0 = dataset.shard(num_shards=8, index=0)

# In a distributed setting, each worker gets one shard
# Worker 0:
worker_dataset = dataset.shard(num_shards=4, index=0)
# Worker 1:
worker_dataset = dataset.shard(num_shards=4, index=1)
# ...

# Select a percentage of the data
subset = dataset.select(range(1000))  # First 1000 examples

# Random subset (10% of data)
train_subset = dataset.train_test_split(test_size=0.1)["train"]

Step 15

Working with Images and Audio

Datasets library provides specialized support for multimedia data with automatic decoding:

from datasets import load_dataset

# Load image dataset
image_dataset = load_dataset("cifar10", split="train")

# Images are automatically decoded to PIL Images
image = image_dataset[0]["img"]  # PIL.Image.Image
print(type(image))  # <class 'PIL.JpegImagePlugin.JpegImageFile'>

# Access image properties
print(image.size)  # (32, 32)

# Disable automatic decoding for performance
image_dataset = image_dataset.cast_column(
    "img",
    datasets.Image(decode=False)
)
# Now images are returned as bytes

# Load audio dataset
audio_dataset = load_dataset(
    "common_voice",
    "en",
    split="train",
    streaming=True
)

# Audio is automatically decoded to numpy arrays
audio_sample = next(iter(audio_dataset))
print(audio_sample["audio"])
# {'array': array([...]), 'sampling_rate': 48000, 'path': '...'}

# Resample audio to different sampling rate
from datasets import Audio
audio_dataset = audio_dataset.cast_column(
    "audio",
    Audio(sampling_rate=16000)
)

Step 16
Dataset Metrics and Evaluation
The datasets library previously included evaluation metrics, but these have been moved to the evaluate library:
```
# Install the evaluate library for metrics
pip install evaluate
```

Step 17

Using Evaluation Metrics

Load and compute metrics using the separate evaluate library:

import evaluate

# Load a metric
accuracy = evaluate.load("accuracy")

# Compute the metric
predictions = [0, 1, 0, 1]
references = [0, 1, 1, 1]
results = accuracy.compute(predictions=predictions, references=references)
print(results)  # {'accuracy': 0.75}

# Other common metrics
f1 = evaluate.load("f1")
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
wer = evaluate.load("wer")  # Word Error Rate for speech

# Metrics for specific tasks
glue_metric = evaluate.load("glue", "mrpc")  # GLUE benchmark
squad_metric = evaluate.load("squad")  # Question answering

Step 18

Dataset Configuration and Variants

Many datasets have multiple configurations or subsets. Specify the configuration when loading:

from datasets import load_dataset, get_dataset_config_names

# List available configurations
configs = get_dataset_config_names("super_glue")
print(configs)  # ['boolq', 'cb', 'copa', 'multirc', 'record', 'rte', 'wic', 'wsc']

# Load a specific configuration
dataset = load_dataset("super_glue", "boolq")

# Some datasets require a configuration
try:
    dataset = load_dataset("super_glue")  # Error!
except ValueError as e:
    print("Must specify config:", e)

# Load dataset with language configuration
xnli = load_dataset("xnli", "en")  # English version
xnli_fr = load_dataset("xnli", "fr")  # French version

# Check dataset information
from datasets import load_dataset_builder
builder = load_dataset_builder("super_glue", "boolq")
print(builder.info.description)
print(builder.info.features)

Step 19

Performance Optimization Tips

Best practices for efficient dataset processing and loading:

1. Use streaming=True for datasets > 10GB to avoid downloading entire dataset
2. Use num_proc parameter in map() for parallel processing
3. Set batched=True in map() to process data in batches (10-100x faster)
4. Use remove_columns to drop unused features before processing
5. Cache processed datasets: dataset.save_to_disk("path/to/cache")
6. Use load_from_disk("path/to/cache") to reload cached datasets
7. For images/audio, set decode=False if you don't need the decoded data immediately
8. Use select() or shard() to work with subsets during development
9. Set keep_in_memory=False for large datasets that don't fit in RAM
10. Use Arrow-native operations when possible (filter, select, sort)

Step 20

Common Use Cases

Examples of typical workflows with Hugging Face Datasets:

from datasets import load_dataset, concatenate_datasets

# 1. Train-validation split
dataset = load_dataset("imdb", split="train")
train_val = dataset.train_test_split(test_size=0.1)
train = train_val["train"]
val = train_val["test"]

# 2. Combine multiple datasets
dataset1 = load_dataset("dataset1", split="train")
dataset2 = load_dataset("dataset2", split="train")
combined = concatenate_datasets([dataset1, dataset2])

# 3. Sample a subset for quick experimentation
small_dataset = dataset.shuffle(seed=42).select(range(1000))

# 4. Balance classes
from collections import Counter
labels = dataset["label"]
label_counts = Counter(labels)
# Undersample majority class or oversample minority

# 5. Save processed dataset
processed = dataset.map(preprocessing_function)
processed.save_to_disk("./processed_data")

# Later: reload without reprocessing
from datasets import load_from_disk
processed = load_from_disk("./processed_data")

Step 21

Troubleshooting Common Issues

Solutions to frequently encountered problems:

Problem: Dataset download fails or hangs
Solution: Check internet connection, try download_mode="force_redownload", or use a mirror

Problem: Out of memory errors
Solution: Use streaming=True, process in smaller batches, or increase system swap

Problem: Slow data loading
Solution: Increase num_proc in map(), use batched=True, cache preprocessed data

Problem: "Unable to find config" error
Solution: Specify the config name: load_dataset("name", "config_name")

Problem: Authentication errors with private datasets
Solution: Run 'huggingface-cli login' and use token parameter

Problem: Incompatible dataset version
Solution: Specify revision: load_dataset("name", revision="main")

Problem: Arrow memory mapping errors on Windows
Solution: Use keep_in_memory=True or disable memory mapping

Problem: Multiprocessing issues on Windows
Solution: Use num_proc=1 or wrap code in if __name__ == "__main__":

Step 22

Additional Resources

Key resources for learning more about Hugging Face Datasets:

Official Documentation: https://huggingface.co/docs/datasets
Dataset Hub: https://huggingface.co/datasets (100,000+ datasets)
GitHub Repository: https://github.com/huggingface/datasets
Community Forum: https://discuss.huggingface.co
Tutorials: https://huggingface.co/course
API Reference: https://huggingface.co/docs/datasets/package_reference
Dataset Cards: Each dataset has a detailed card explaining its contents
Course: https://huggingface.co/learn/nlp-course

Related Libraries:
- transformers: Pre-trained models for NLP, vision, audio
- evaluate: Metrics for ML evaluation
- tokenizers: Fast text tokenization
- accelerate: Distributed training made easy