TechSetupGuides
Intermediatedatasetsmachine-learningnlpcomputer-visionaudiopythonhuggingfacepytorchtensorflowdata-science

Hugging Face Datasets - Hub of ready-to-use ML datasets

The largest hub of ready-to-use datasets for machine learning models with over 100,000 datasets for NLP, computer vision, audio, and more.

  1. Step 1

    Overview

    Hugging Face Datasets is a library for easily accessing and sharing datasets for machine learning tasks. It provides a unified interface to over 100,000 datasets spanning NLP, computer vision, audio, and multimodal tasks. The library handles data downloading, caching, and memory-efficient loading automatically, enabling researchers and practitioners to focus on model development rather than data pipeline engineering.

  2. Step 2

    Technology Stack

    Hugging Face Datasets is built with the following technologies:

    Language: Python (3.8+)
    License: Apache 2.0
    Stars: 21,534+
    Owner: huggingface
    Repo: https://github.com/huggingface/datasets
    
    Core Dependencies:
    - pyarrow (15.0.0+) - Columnar storage format
    - numpy - Numerical computing
    - pandas - Data manipulation
    - fsspec - Filesystem abstraction (local, S3, GCS, etc.)
    - requests - HTTP library
    - tqdm - Progress bars
    - xxhash - Fast hashing
    - multiprocess - Parallel processing
    - dill - Serialization for complex objects
    
    Optional Integrations:
    - torch (PyTorch) - Deep learning framework
    - tensorflow - Deep learning framework
    - jax - High-performance ML framework
    - pillow - Image processing
    - librosa - Audio processing
    - soundfile - Audio I/O
    - transformers - Hugging Face model library
  3. Step 3

    Installation

    Install the datasets library using pip. The base installation is lightweight, with optional dependencies for specific modalities:

    # Basic installation
    pip install datasets
    
    # With vision support (for image datasets)
    pip install datasets[vision]
    
    # With audio support (for audio datasets)
    pip install datasets[audio]
    
    # With all optional dependencies
    pip install datasets[all]
    
    # Install from source for latest development version
    pip install git+https://github.com/huggingface/datasets.git
  4. Step 4

    Quick Start - Loading a Dataset

    Load any dataset from the Hugging Face Hub with a single line of code. The library automatically downloads, caches, and prepares the data:

    from datasets import load_dataset
    
    # Load a text classification dataset
    dataset = load_dataset("imdb")
    
    # Access splits
    train_data = dataset["train"]
    test_data = dataset["test"]
    
    # Inspect the dataset structure
    print(dataset)
    # DatasetDict({
    #     train: Dataset({
    #         features: ['text', 'label'],
    #         num_rows: 25000
    #     })
    #     test: Dataset({
    #         features: ['text', 'label'],
    #         num_rows: 25000
    #     })
    # })
    
    # Access individual examples
    print(train_data[0])
    # {'text': 'I loved this movie...', 'label': 1}
  5. Step 5

    Dataset Features and Types

    Datasets library supports rich data types including text, images, audio, and custom structures. Features are strongly typed and automatically validated:

    from datasets import load_dataset
    
    # Text dataset
    text_dataset = load_dataset("squad")  # Question answering
    print(text_dataset["train"].features)
    # {'id': Value('string'), 'title': Value('string'), 'context': Value('string'), ...}
    
    # Image dataset
    image_dataset = load_dataset("mnist")  # Handwritten digits
    print(image_dataset["train"].features)
    # {'image': Image(decode=True), 'label': ClassLabel(num_classes=10)}
    
    # Audio dataset
    audio_dataset = load_dataset("common_voice", "en", split="train")
    print(audio_dataset.features)
    # {'audio': Audio(sampling_rate=48000), 'sentence': Value('string'), ...}
    
    # Multi-modal dataset
    multimodal = load_dataset("nlphuji/flickr30k")  # Image captioning
    print(multimodal["test"].features)
    # {'image': Image(), 'caption': Sequence(Value('string'))}
  6. Step 6

    Data Processing and Transformation

    Transform datasets efficiently using the map() function, which applies processing in parallel and caches results:

    from datasets import load_dataset
    
    dataset = load_dataset("imdb", split="train")
    
    # Apply a function to each example
    def tokenize_function(examples):
        return {"length": [len(text.split()) for text in examples["text"]]}
    
    # Process in batches for efficiency
    tokenized = dataset.map(
        tokenize_function,
        batched=True,
        batch_size=1000,
        num_proc=4  # Use 4 processes for parallel processing
    )
    
    # Filter dataset
    long_reviews = dataset.filter(lambda x: len(x["text"]) > 500)
    
    # Select specific columns
    text_only = dataset.remove_columns(["label"])
    
    # Sort by a column
    sorted_dataset = dataset.sort("text")
    
    # Shuffle the dataset
    shuffled = dataset.shuffle(seed=42)
  7. Step 7

    Memory-Efficient Data Loading

    Datasets are memory-mapped from disk using Apache Arrow, allowing you to work with datasets larger than RAM:

    from datasets import load_dataset
    
    # Load a large dataset - only metadata is loaded into memory
    dataset = load_dataset("c4", "en", split="train", streaming=False)
    
    # For extremely large datasets, use streaming mode
    # Data is loaded on-the-fly without downloading the full dataset
    streaming_dataset = load_dataset(
        "c4",
        "en",
        split="train",
        streaming=True  # Loads data as needed
    )
    
    # Iterate over a streaming dataset
    for i, example in enumerate(streaming_dataset):
        print(example["text"])
        if i >= 10:
            break  # Process only first 10 examples
    
    # Take a subset from streaming dataset
    subset = streaming_dataset.take(1000)  # First 1000 examples
    
    # Skip examples in streaming mode
    skipped = streaming_dataset.skip(5000)  # Skip first 5000
  8. Step 8

    Integration with PyTorch and TensorFlow

    Convert datasets to native PyTorch or TensorFlow formats for seamless integration with training loops:

    from datasets import load_dataset
    
    dataset = load_dataset("imdb", split="train")
    
    # PyTorch integration
    dataset.set_format(type="torch", columns=["text", "label"])
    # Now dataset returns PyTorch tensors
    
    # Or use with_format for temporary conversion
    torch_dataset = dataset.with_format("torch")
    
    # Use with PyTorch DataLoader
    from torch.utils.data import DataLoader
    dataloader = DataLoader(torch_dataset, batch_size=32)
    
    for batch in dataloader:
        # batch contains PyTorch tensors
        print(batch["label"].shape)  # torch.Size([32])
        break
    
    # TensorFlow integration
    tf_dataset = dataset.to_tf_dataset(
        columns=["text", "label"],
        batch_size=32,
        shuffle=True
    )
    
    # Use with TensorFlow/Keras
    import tensorflow as tf
    for batch in tf_dataset.take(1):
        print(batch["label"].shape)  # (32,)
  9. Step 9

    Loading Custom Datasets

    Load datasets from local files in various formats including CSV, JSON, Parquet, and more:

    from datasets import load_dataset
    
    # Load from CSV
    csv_dataset = load_dataset("csv", data_files="my_data.csv")
    
    # Load from JSON
    json_dataset = load_dataset("json", data_files="my_data.json")
    
    # Load from multiple files with splits
    dataset = load_dataset(
        "csv",
        data_files={
            "train": "train.csv",
            "test": "test.csv"
        }
    )
    
    # Load from Parquet (most efficient)
    parquet_dataset = load_dataset("parquet", data_files="data.parquet")
    
    # Load from remote URL
    remote_dataset = load_dataset(
        "csv",
        data_files="https://example.com/data.csv"
    )
    
    # Load from cloud storage (S3, GCS, etc.)
    s3_dataset = load_dataset(
        "csv",
        data_files="s3://my-bucket/data.csv"
    )
  10. Step 10

    Creating Custom Datasets

    Build datasets from scratch using Python dictionaries, pandas DataFrames, or by defining a loading script:

    from datasets import Dataset, DatasetDict
    import pandas as pd
    
    # From a Python dictionary
    data_dict = {
        "text": ["Hello world", "How are you?", "I'm fine"],
        "label": [0, 1, 0]
    }
    dataset = Dataset.from_dict(data_dict)
    
    # From a pandas DataFrame
    df = pd.DataFrame({
        "text": ["Example 1", "Example 2"],
        "label": [0, 1]
    })
    dataset = Dataset.from_pandas(df)
    
    # Create a DatasetDict with multiple splits
    train_dict = {"text": ["train1", "train2"], "label": [0, 1]}
    test_dict = {"text": ["test1"], "label": [0]}
    
    dataset_dict = DatasetDict({
        "train": Dataset.from_dict(train_dict),
        "test": Dataset.from_dict(test_dict)
    })
    
    # From a generator function (memory efficient)
    def data_generator():
        for i in range(1000):
            yield {"text": f"Example {i}", "label": i % 2}
    
    dataset = Dataset.from_generator(data_generator)
  11. Step 11

    Sharing Datasets on the Hub

    Upload your datasets to the Hugging Face Hub to share with the community or for private use:

    # Install huggingface-hub for authentication
    pip install huggingface-hub
    
    # Login to Hugging Face (requires account at huggingface.co)
    huggingface-cli login
    ⚠ Heads up: You'll need a Hugging Face account to upload datasets. Create one at https://huggingface.co/join
  12. Step 12

    Upload Dataset to Hub

    Push your dataset to the Hugging Face Hub using the push_to_hub method:

    from datasets import Dataset
    
    # Create or load your dataset
    data = {
        "text": ["Example 1", "Example 2", "Example 3"],
        "label": [0, 1, 0]
    }
    dataset = Dataset.from_dict(data)
    
    # Upload to the Hub (creates a public dataset)
    dataset.push_to_hub("my-username/my-dataset")
    
    # Upload a private dataset
    dataset.push_to_hub(
        "my-username/my-private-dataset",
        private=True
    )
    
    # Upload with a specific split name
    dataset.push_to_hub(
        "my-username/my-dataset",
        split="train"
    )
    
    # Upload a DatasetDict with multiple splits
    from datasets import DatasetDict
    dataset_dict = DatasetDict({
        "train": train_dataset,
        "test": test_dataset
    })
    dataset_dict.push_to_hub("my-username/my-dataset")
  13. Step 13

    Dataset Caching and Storage

    Datasets are automatically cached to avoid re-downloading. Manage cache location and behavior:

    from datasets import load_dataset
    import os
    
    # Default cache location: ~/.cache/huggingface/datasets
    
    # Set custom cache directory via environment variable
    os.environ["HF_DATASETS_CACHE"] = "/path/to/custom/cache"
    
    # Or pass cache_dir directly
    dataset = load_dataset(
        "imdb",
        cache_dir="/path/to/custom/cache"
    )
    
    # Disable caching (forces re-download)
    dataset = load_dataset("imdb", download_mode="force_redownload")
    
    # Use cached version only (fail if not cached)
    dataset = load_dataset("imdb", download_mode="reuse_cache_if_exists")
    
    # Clear cache for a specific dataset
    from datasets import load_dataset_builder
    builder = load_dataset_builder("imdb")
    builder.cache_dir  # Shows cache location
  14. Step 14

    Advanced Features - Sharding

    Split large datasets into shards for distributed processing or to work with subsets:

    from datasets import load_dataset
    
    # Load only a specific shard (useful for distributed training)
    dataset = load_dataset(
        "c4",
        "en",
        split="train",
        streaming=True
    )
    
    # Split into 8 shards, take shard 0
    shard_0 = dataset.shard(num_shards=8, index=0)
    
    # In a distributed setting, each worker gets one shard
    # Worker 0:
    worker_dataset = dataset.shard(num_shards=4, index=0)
    # Worker 1:
    worker_dataset = dataset.shard(num_shards=4, index=1)
    # ...
    
    # Select a percentage of the data
    subset = dataset.select(range(1000))  # First 1000 examples
    
    # Random subset (10% of data)
    train_subset = dataset.train_test_split(test_size=0.1)["train"]
  15. Step 15

    Working with Images and Audio

    Datasets library provides specialized support for multimedia data with automatic decoding:

    from datasets import load_dataset
    
    # Load image dataset
    image_dataset = load_dataset("cifar10", split="train")
    
    # Images are automatically decoded to PIL Images
    image = image_dataset[0]["img"]  # PIL.Image.Image
    print(type(image))  # <class 'PIL.JpegImagePlugin.JpegImageFile'>
    
    # Access image properties
    print(image.size)  # (32, 32)
    
    # Disable automatic decoding for performance
    image_dataset = image_dataset.cast_column(
        "img",
        datasets.Image(decode=False)
    )
    # Now images are returned as bytes
    
    # Load audio dataset
    audio_dataset = load_dataset(
        "common_voice",
        "en",
        split="train",
        streaming=True
    )
    
    # Audio is automatically decoded to numpy arrays
    audio_sample = next(iter(audio_dataset))
    print(audio_sample["audio"])
    # {'array': array([...]), 'sampling_rate': 48000, 'path': '...'}
    
    # Resample audio to different sampling rate
    from datasets import Audio
    audio_dataset = audio_dataset.cast_column(
        "audio",
        Audio(sampling_rate=16000)
    )
  16. Step 16

    Dataset Metrics and Evaluation

    The datasets library previously included evaluation metrics, but these have been moved to the evaluate library:

    # Install the evaluate library for metrics
    pip install evaluate
  17. Step 17

    Using Evaluation Metrics

    Load and compute metrics using the separate evaluate library:

    import evaluate
    
    # Load a metric
    accuracy = evaluate.load("accuracy")
    
    # Compute the metric
    predictions = [0, 1, 0, 1]
    references = [0, 1, 1, 1]
    results = accuracy.compute(predictions=predictions, references=references)
    print(results)  # {'accuracy': 0.75}
    
    # Other common metrics
    f1 = evaluate.load("f1")
    bleu = evaluate.load("bleu")
    rouge = evaluate.load("rouge")
    wer = evaluate.load("wer")  # Word Error Rate for speech
    
    # Metrics for specific tasks
    glue_metric = evaluate.load("glue", "mrpc")  # GLUE benchmark
    squad_metric = evaluate.load("squad")  # Question answering
  18. Step 18

    Dataset Configuration and Variants

    Many datasets have multiple configurations or subsets. Specify the configuration when loading:

    from datasets import load_dataset, get_dataset_config_names
    
    # List available configurations
    configs = get_dataset_config_names("super_glue")
    print(configs)  # ['boolq', 'cb', 'copa', 'multirc', 'record', 'rte', 'wic', 'wsc']
    
    # Load a specific configuration
    dataset = load_dataset("super_glue", "boolq")
    
    # Some datasets require a configuration
    try:
        dataset = load_dataset("super_glue")  # Error!
    except ValueError as e:
        print("Must specify config:", e)
    
    # Load dataset with language configuration
    xnli = load_dataset("xnli", "en")  # English version
    xnli_fr = load_dataset("xnli", "fr")  # French version
    
    # Check dataset information
    from datasets import load_dataset_builder
    builder = load_dataset_builder("super_glue", "boolq")
    print(builder.info.description)
    print(builder.info.features)
  19. Step 19

    Performance Optimization Tips

    Best practices for efficient dataset processing and loading:

    1. Use streaming=True for datasets > 10GB to avoid downloading entire dataset
    2. Use num_proc parameter in map() for parallel processing
    3. Set batched=True in map() to process data in batches (10-100x faster)
    4. Use remove_columns to drop unused features before processing
    5. Cache processed datasets: dataset.save_to_disk("path/to/cache")
    6. Use load_from_disk("path/to/cache") to reload cached datasets
    7. For images/audio, set decode=False if you don't need the decoded data immediately
    8. Use select() or shard() to work with subsets during development
    9. Set keep_in_memory=False for large datasets that don't fit in RAM
    10. Use Arrow-native operations when possible (filter, select, sort)
  20. Step 20

    Common Use Cases

    Examples of typical workflows with Hugging Face Datasets:

    from datasets import load_dataset, concatenate_datasets
    
    # 1. Train-validation split
    dataset = load_dataset("imdb", split="train")
    train_val = dataset.train_test_split(test_size=0.1)
    train = train_val["train"]
    val = train_val["test"]
    
    # 2. Combine multiple datasets
    dataset1 = load_dataset("dataset1", split="train")
    dataset2 = load_dataset("dataset2", split="train")
    combined = concatenate_datasets([dataset1, dataset2])
    
    # 3. Sample a subset for quick experimentation
    small_dataset = dataset.shuffle(seed=42).select(range(1000))
    
    # 4. Balance classes
    from collections import Counter
    labels = dataset["label"]
    label_counts = Counter(labels)
    # Undersample majority class or oversample minority
    
    # 5. Save processed dataset
    processed = dataset.map(preprocessing_function)
    processed.save_to_disk("./processed_data")
    
    # Later: reload without reprocessing
    from datasets import load_from_disk
    processed = load_from_disk("./processed_data")
  21. Step 21

    Troubleshooting Common Issues

    Solutions to frequently encountered problems:

    Problem: Dataset download fails or hangs
    Solution: Check internet connection, try download_mode="force_redownload", or use a mirror
    
    Problem: Out of memory errors
    Solution: Use streaming=True, process in smaller batches, or increase system swap
    
    Problem: Slow data loading
    Solution: Increase num_proc in map(), use batched=True, cache preprocessed data
    
    Problem: "Unable to find config" error
    Solution: Specify the config name: load_dataset("name", "config_name")
    
    Problem: Authentication errors with private datasets
    Solution: Run 'huggingface-cli login' and use token parameter
    
    Problem: Incompatible dataset version
    Solution: Specify revision: load_dataset("name", revision="main")
    
    Problem: Arrow memory mapping errors on Windows
    Solution: Use keep_in_memory=True or disable memory mapping
    
    Problem: Multiprocessing issues on Windows
    Solution: Use num_proc=1 or wrap code in if __name__ == "__main__":
  22. Step 22

    Additional Resources

    Key resources for learning more about Hugging Face Datasets:

    Official Documentation: https://huggingface.co/docs/datasets
    Dataset Hub: https://huggingface.co/datasets (100,000+ datasets)
    GitHub Repository: https://github.com/huggingface/datasets
    Community Forum: https://discuss.huggingface.co
    Tutorials: https://huggingface.co/course
    API Reference: https://huggingface.co/docs/datasets/package_reference
    Dataset Cards: Each dataset has a detailed card explaining its contents
    Course: https://huggingface.co/learn/nlp-course
    
    Related Libraries:
    - transformers: Pre-trained models for NLP, vision, audio
    - evaluate: Metrics for ML evaluation
    - tokenizers: Fast text tokenization
    - accelerate: Distributed training made easy

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.