Datasets由Hugging Face开源，用于轻松访问和共享音频、计算机视觉和自然语言处理（NLP）任务的数据集。只需一行代码即可加载数据集，并使用强大的数据处理方法快速让你的数据集准备好在深度学习模型中进行训练。在Apache Arrow格式的支持下，以零拷贝读取的方式处理大型数据集，没有任何内存限制，以实现最佳速度和效率。
并且Datasets还与Hugging Face Hub深度集成，允许我们轻松加载数据集并与更广泛的机器学习社区共享数据集。

Datasets可以用来构建3中形式的数据：

文本
音频
图像

为了演示效果，在这里我仅以文本数据为例进行相关演示，关于音频和图像的使用，各位可以前往Hugging Face官网学习。

1. 快速使用

# 在Hugging Face中加载数据集
from datasets import load_dataset
dataset = load_dataset("nyu-mll/glue", "mrpc", split="train")
# dataset = load_dataset("datas_save/mrcp", split="train")  # 如果无法访问外网，也可以将数据集下载下来，然后读取本地数据集

# 加载bert-base-uncased 的模型和分词器
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# 使用本地路径加载模型和分词器（可以在魔搭社区离线下载模型）
model = AutoModelForSequenceClassification.from_pretrained("./models_save/bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("./models_save/bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./models_save/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

# 使用分词器对datasets中的数据进行编码
def encode(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")
dataset = dataset.map(encode, batched=True)
print(dataset[0])

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

1 2	# 将label键修改为labels dataset = dataset.map(lambda examples: {"labels": examples["label"]}, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

1	print(dataset[0])

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': 1}

# 在PyTorch中使用该dataset
import torch
# 转化为PyTorch中的dataset对象
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
# 使用dataset对象定义dataloader对象
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

到这一步，我们就可以直接使用dataloader进行PyTorch深度学习模型的训练。

2. Load a dataset from the Hub

2.1 查看数据集的说明和字段类型

1
2
3

# 为了避免无意义的数据集下载，我们在下载数据集之前，可以先查看数据集的说明和字段类型
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("nyu-mll/glue", "mrpc")

1	ds_builder.info.description

''

1	ds_builder.info.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

2.2 用load_dataset下载数据集

1
2
3

# 确认数据集类型是你需要的，你可以继续使用load_dataset进行下载使用
from datasets import load_dataset
dataset = load_dataset("nyu-mll/glue", "mrpc", split="train")

# 在前面我们加载数据集的时候，我们传入了split字段用于获取原始数据的子集
# 这个split的名称在不同的数据集中可能设置的不一样，我们需要查看数据集的分割名称
from datasets import get_dataset_split_names
get_dataset_split_names("nyu-mll/glue", "mrpc")

['train', 'validation', 'test']

可以看到”nyu-mll/glue”, “mrpc”数据集中数据集的切分字段为[‘train’, ‘validation’, ‘test’]

1 2	# 我们可以通过为split指定不同的类型，来加载不同的子集 dataset = load_dataset("nyu-mll/glue", "mrpc", split="validation")

dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 408
})

1 2	# 如果不指定split的值，默认的清空下是加载所有的数据子集到一个字典中 dataset = load_dataset("nyu-mll/glue", "mrpc")

dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

2.3 数据集配置

数据集有的时候可能会出现一个大的数据集中包含多个小数据子集，这个时候在读取数据的时候必须指定小数据子集的名称，如果不指定会发生报错。

1 2	from datasets import get_dataset_config_names get_dataset_config_names("nyu-mll/glue")

['ax',
 'cola',
 'mnli',
 'mnli_matched',
 'mnli_mismatched',
 'mrpc',
 'qnli',
 'qqp',
 'rte',
 'sst2',
 'stsb',
 'wnli']

1 2	# 指定不同的数据子集名称，来加载不同的数据集 load_dataset("nyu-mll/glue", "cola")

train-00000-of-00001.parquet:   0%|          | 0.00/251k [00:00<?, ?B/s]



validation-00000-of-00001.parquet:   0%|          | 0.00/37.6k [00:00<?, ?B/s]



test-00000-of-00001.parquet:   0%|          | 0.00/37.7k [00:00<?, ?B/s]



Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]



Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]



Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]





DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

2.4 使用数据集自带的数据加载脚本

某些数据集存储库包含一个加载脚本，其中包含用于生成数据集的 Python 代码。如果想要使用数据集自带的脚本进行数据集在，需要设置

1	trust_remote_code=True

否则会产生报错异常。

1
2
3

from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset

ds = load_dataset("nyu-mll/glue", "mrpc", split="train", trust_remote_code=True)

1	get_dataset_config_names("nyu-mll/glue", trust_remote_code=True)

['ax',
 'cola',
 'mnli',
 'mnli_matched',
 'mnli_mismatched',
 'mrpc',
 'qnli',
 'qqp',
 'rte',
 'sst2',
 'stsb',
 'wnli']

1	get_dataset_split_names("nyu-mll/glue", "mrpc", trust_remote_code=True)

['train', 'validation', 'test']

3. 使用迭代器的方式加载数据集

在Datasets中有两种类型的数据集对象，一种是常规的Dataset，还有一种是IterableDataset。

Dataset提供对行的快速随机访问，并进行内存映射，以便即使加载大型数据集也仅使用相对少量的设备内存。但是对于非常非常大的数据集，甚至无法在磁盘或内存中容纳时，IterableDataset允许您访问和使用数据集，而无需等待它完全下载！

from datasets import load_dataset
# 通过设置streaming=True来使用IterableDataset
iterable_dataset = load_dataset("nyu-mll/glue", "mrpc", split="train", streaming=True)
for example in iterable_dataset:
    print(example)
    break

'(ReadTimeoutError("HTTPSConnectionPool(host='cdn-lfs.hf-mirror.com', port=443): Read timed out. (read timeout=10)"), '(Request ID: cf1b68bf-240a-4e80-94a9-0217f4e77ab9)')' thrown while requesting GET https://hf-mirror.com/datasets/nyu-mll/glue/resolve/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c/mrpc/train-00000-of-00001.parquet
Retrying in 1s [Retry 1/5].


{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}

1	type(iterable_dataset)

datasets.iterable_dataset.IterableDataset

1 2	# 也可以将Dataset类型转化为IterableDataset ds_iterable = ds.to_iterable_dataset()

1	type(ds_iterable)

datasets.iterable_dataset.IterableDataset

1 2	# 在普通的Dataset中我们可以通过索引的方式访问我们想要的数据 ds[2:4]

{'sentence1': ['They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .'],
 'sentence2': ["On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .'],
 'label': [1, 0],
 'idx': [2, 3]}

1 2	# 但是在IterableDataset中我们无法通过这种方式访问数据，只能通过迭代的方式逐步获取数据 ds_iterable[2:4]

---------------------------------------------------------------------------

NotImplementedError                       Traceback (most recent call last)

Cell In[57], line 2
      1 # 但是在IterableDataset中我们无法通过这种方式访问数据，只能通过迭代的方式逐步获取数据
----> 2 ds_iterable[2:4]


File ~/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataset.py:61, in Dataset.__getitem__(self, index)
     60 def __getitem__(self, index) -> T_co:
---> 61     raise NotImplementedError("Subclasses of Dataset should implement __getitem__.")


NotImplementedError: Subclasses of Dataset should implement __getitem__.

1 2	# 迭代的方式获取数据 next(iter(ds_iterable))

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

1 2	# 迭代的方式获取指定数量的子集 list(ds_iterable.take(10))

[{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  'label': 1,
  'idx': 0},
 {'sentence1': "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'sentence2': "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  'label': 0,
  'idx': 1},
 {'sentence1': 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'sentence2': "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'label': 1,
  'idx': 2},
 {'sentence1': 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'sentence2': 'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .',
  'label': 0,
  'idx': 3},
 {'sentence1': 'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .',
  'sentence2': 'PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .',
  'label': 1,
  'idx': 4},
 {'sentence1': 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .',
  'sentence2': "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .",
  'label': 1,
  'idx': 5},
 {'sentence1': 'The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday .',
  'sentence2': 'The tech-laced Nasdaq Composite .IXIC rallied 30.46 points , or 2.04 percent , to 1,520.15 .',
  'label': 0,
  'idx': 6},
 {'sentence1': 'The DVD-CCA then appealed to the state Supreme Court .',
  'sentence2': 'The DVD CCA appealed that decision to the U.S. Supreme Court .',
  'label': 1,
  'idx': 7},
 {'sentence1': 'That compared with $ 35.18 million , or 24 cents per share , in the year-ago period .',
  'sentence2': 'Earnings were affected by a non-recurring $ 8 million tax benefit in the year-ago period .',
  'label': 0,
  'idx': 8},
 {'sentence1': 'Shares of Genentech , a much larger company with several products on the market , rose more than 2 percent .',
  'sentence2': 'Shares of Xoma fell 16 percent in early trade , while shares of Genentech , a much larger company with several products on the market , were up 2 percent .',
  'label': 0,
  'idx': 10}]

4. 使用Datasets对数据进行预处理

但是在几乎所有的预处理情况下，根据你的数据集模式，你需要：

对文本数据集进行标记化。
对音频数据集进行重采样。
对图像数据集应用变换。

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("./models_save/bert-base-uncased")
dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]



validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]



test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]



Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]



Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]



Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

1
2
3

# 获取一条文本
test_text = dataset[0]["text"]
test_text

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

1 2	# 将该文本转化为token tokenizer(test_text)

{'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

上面我们演示了，对于一条数据我们需要按照这种方式将其转化为tokens序列的形式，这种操作实际上是需要将所有的文本全部进行转化的。最快速的方式是定义一个转换函数，然后使用dataset中的map方法，批量对所有的文本进行token化。

def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

1	print(dataset[0])

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1, 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

从结果我们可看出，实际上这种方式是对每个文本数据，添加了3个键值对，这三个键值对刚好就是tokenizer的输出。

1
2

# 现在我们只需要将其转化为深度学习模型对于数据集的要求样式即可。（例如PyTorch中的dataset）
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])

1	dataset[0]

{'label': tensor(1),
 'input_ids': tensor([  101,  1996,  2600,  2003, 16036,  2000,  2022,  1996,  7398,  2301,
          1005,  1055,  2047,  1000, 16608,  1000,  1998,  2008,  2002,  1005,
          1055,  2183,  2000,  2191,  1037, 17624,  2130,  3618,  2084,  7779,
         29058,  8625, 13327,  1010,  3744,  1011, 18856, 19513,  3158,  5477,
          4168,  2030,  7112, 16562,  2140,  1012,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

1 2	# 可以查看input_ids的数据类型 type(dataset[0]['input_ids'])

torch.Tensor

5. 构建自定义的Dataset

自定义Dataset支持2种创建方式：

基于单个文件
基于文件夹

5.1 基于单个文件的构建

单个文件数据集支持许多常见格式，如csv、json/jsonl、parquet、txt。

1
2
3

from datasets import load_dataset

dataset = load_dataset("csv", data_files="./datas_save/area_street_data.csv", split='train')

1	dataset[:5]

{'area_name': ['东城区', '东城区', '东城区', '东城区', '东城区'],
 'area_code': [110101, 110101, 110101, 110101, 110101],
 'street_name': ['东华门街道', '景山街道', '交道口街道', '安定门街道', '北新桥街道'],
 'street_code': [110101001, 110101002, 110101003, 110101004, 110101005]}

5.2 基于文件夹的构建

基于文件夹的构建一般是引用与大量的音频和图片数据文件。

分别使用

AudioFolder
ImageFolder

在此仅介绍文本数据处理相关内容，故此处不做深入介绍，想要深入了解可以前往Hugging Face官网，链接直达

6. 将子集的Dataset分享到Hugging Face Hub

HuggingFaceHub提供有Hub UI页面，用户可以登录到Hugging Face官网，自行进行手动上传。查看官网详细教程

Ming-Log's Blog

Transformers（3）_Datasets