# 「PyTorch」：3-Data And Data Processing

PyTorch框架学习。

colab笔记：Data And Data Processing

# Data-Fashion MNIST

## Why Study A Dataset?

Data is the primary ingredient of deep learning.

【Data是deep learning的原材料】

Data focused considerations:

• Who created the dataset?【谁收集的数据集】
• How was the dataset created?【数据集是如何收集的】
• What transformations were used?【数据运用了哪些变换】
• What intent does the dataset have?【数据集的意图是什么】
• Possible unintentional consequences?【还可能有什么其他结果吗】
• Is the dataset biased?【数据集是否是biased】
• Are there ethical issues with the dataset?【数据集会引起道德问题吗】

## What Is The MNIST Dataset?

The MNIST dataset, Modified National Institute of Standards and Technology database, is a famous dataset of handwritten digits that is commonly used for training image processing systems for machine learning. NIST stands for National Institute of Standards and Technology.

【MNIST, 全称Modified National Institute of Standards and Technology。著名的手写数据集，用于训练图像处理系统。】

MNIST is famous because of how often the dataset is used. It’s common for two reasons:

1. Beginners use it because it’s easy
2. Researchers use it to benchmark (compare) different models.

【MNIST简单；其次研究者常常用MNIST作为其他模型的基准】

The dataset consists of 70,000 images of hand written digits with the following split:

• 60,000 training images
• 10,000 testing images

【MNIST的组成，60000个training 图像，10000个testing图像】

MNIST has been so widely used, and image recognition tech has improved so much that the dataset is considered to be too easy. This is why the Fashion-MNIST dataset was created.

【因为MNIST数据集太过简单了，因此Fashion-MNIST数据集出现了】

## What Is Fashion-MNIST?

Fashion-MNIST as the name suggests is a dataset of fashion items. Specifically, the dataset has the following ten classes of fashion items:

【Fashion-MNIST数据集由许多fashion的物件组成，物件类别如下。】

Index Label
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

Fashion-MNIST is based on the assortment on Zalando’s website. Zalando is a German based multi-national fashion commerce company that was founded in 2008.

【Fashion-MNIST数据集中的数据都是来着Zalando网址上售卖的样图，Zalando创立于2008年，是一家德国跨国时尚公司】

We’ll see the specific ways that Fashion-MNIST mirrors the original dataset in the paper, but one thing we have already seen is the number of classes.

• MNIST – has 10 classes (one for each digit 0-9)
• Fashion-MNIST – has 10 classes (this is intentional)

【Fashion-MNIST和MNIST是镜像对应的，比如他们都有10个类别】

#### How Fashion-MNIST Was Built

Unlike the MNIST dataset, the fashion set wasn’t hand-drawn, but the images in the dataset are actual images from Zalando’s website.

【Fashion-MNNST中所有的图像都来自Zalando的官网的图片，再通过多种变换，变成28*28的图像】

There are four general steps that we’ll be following as we move through this project:

【一般分为4步：准备数据；构建模型；训练模型；分析结果】

1. Prepare the data
2. Build the model
3. Train the model
4. Analyze the model’s results

## The ETL Process

In this post, we’ll kick things off by preparing the data. To prepare our data, we’ll be following what is loosely known as an ETL process.

【准备数据的过程一般又叫ETL过程：提取、转化、装载】

• Extract data from a data source.【从数据源提取数据】

• Transform data into a desirable format.【转化为便于处理的格式】

• Load data into a suitable structure.【装载数据，便于读取】

PyTorch包的主要组成：

Package Description
torch PyTorch的顶层包和tensor库
torch.nn 包含构建NN的模型和扩展类
torch.nn.functional 包含构建NN的函数接口，像loss function, activation fucntion, convolution operation
torch.utils 包含实用类，像数据集，数据装载器，方便数据预处理
torchvision 提供著名的数据集，模型架构和计算机视觉图像转换
• torchvision.transforms: An interface that contains common transforms for image processing.

【一个包含图像转换（用于图像处理）的接口。】

• pandas:https://www.pypandas.cn/

Pandas是一个强大的分析结构化数据的工具集；它的使用基础是Numpy（提供高性能的矩阵运算）；用于数据挖掘和数据分析，同时也提供数据清洗功能。

• NumPy是Python中科学计算的基础包。它是一个Python库，提供多维数组对象，各种派生对象（如掩码数组和矩阵），以及用于数组快速操作的各种API，有包括数学、逻辑、形状操作、排序、选择、输入输出、离散傅立叶变换、基本线性代数，基本统计运算和随机模拟等等。

• Matplotlib：https://www.matplotlib.org.cn/

Matplotlib 是一个 Python 的 2D绘图库，它以各种硬拷贝格式和跨平台的交互式环境生成出版质量级别的图形。Matplotlib可用于Python脚本，Python和IPython Shell、Jupyter笔记本，Web应用程序服务器和四个图形用户界面工具包。

为了简单绘图，该 pyplot 模块提供了类似于MATLAB的界面，尤其是与IPython结合使用时。 对于高级用户，您可以通过面向对象的界面或MATLAB用户熟悉的一组功能来完全控制线型，字体属性，轴属性等。

• pdb 是Python的调试器。

## Preparing Our Data

1. Extract – Get the Fashion-MNIST image data from the source.【获得Fashion-MNIST数据集】
2. Transform – Put our data into tensor form.【转换：将我们的数据转换为tensor】
3. Load – Put our data into an object to make it easily accessible.【装载：聚合数据为一个对象，方便获取】

For these purposes, PyTorch provides us with two classes:

【PyTorch为处理数据所提供的两个类】

Class Description
torch.utils.data.Dataset 数据集的抽象类

To create a custom dataset using PyTorch, we extend the Dataset class by creating a subclass that implements these required methods. Upon doing this, our new subclass can then be passed to the a PyTorch DataLoader object.

All subclasses of the Dataset class must override __len__, that provides the size of the dataset, and __getitem__, supporting integer indexing in range from 0 to len(self) exclusive.

【Dateset的子类必须重写__len__ 方法（表示数据集的大小），重写__getitem__ （按索引获得特定数据）】

### PyTorch Torchvision Package

The torchvision package, gives us access to the following resources:

【torchvision主要提供一些典型数据集、模型、转换、工具】

• Datasets (like MNIST and Fashion-MNIST)
• Models (like VGG16)
• Transforms
• Utils

The PyTorch FashionMNIST dataset simply extends the MNIST dataset and overrides the urls.

【FashionMNIST数据集继承了MNIST数据集，只重写了数据集的url】

Let’s see now how we can take advantage of torchvision.

#### PyTorch Dataset Class

To get an instance of the FashionMNIST dataset using torchvision, we just create one like so:

We specify the following arguments:

Parameter Description
root The location on disk where the data is located.【数据集的位置】
train If the dataset is the training set.【是否是训练数据集】
transform A composition of transformations that should be performed on the dataset elements.【变换的组合】

To create a DataLoader wrapper for our training set, we do it like this:

We just pass train_set as an argument. Now, we can leverage the loader for tasks that would otherwise be pretty complicated to implement by hand:

• batch_size (1000 in our case)
• shuffle (True in our case)
• num_workers (Default is 0 which means the main process will be used)

## Working With The Training Set

In this post, we are going to see how we can work with the dataset and the data loader objects that we created in the previous post.

### PyTorch Dataset:

Suppose we want to see the labels for each image. This can be done like so:

If we want to see how many of each label exists in the dataset, we can use the PyTorch bincount() function like so:

#### Class Imbalance: Balanced And Unbalanced Datasets

This shows us that the Fashion-MNIST dataset is uniform with respect to the number of samples in each class. This means we have 6000 samples for each class. As a result, this dataset is said to be balanced. If the classes had a varying number of samples, we would call the set an unbalanced dataset.

【Fashion-MNIST数据集是均匀分布，即每个class的samples数量相同。均匀分布的数据集被称为是balanced。】

To read more about the ways to mitigate unbalanced datasets in deep learning, see this paper: A systematic study of the class imbalance problem in convolutional neural networks.

#### Accessing Data In The Training Set

【获得数据集中的数据：先将train_set 传递给Python函数iter() 生成迭代器，再将迭代器传递给内置函数next 用来迭代】

iter(object)

• object：支持迭代的集合对象
• 返回值：迭代器对象

next(iterable[, default]) :

• 常和iter() 一同使用
• 返回迭代器中的下一个项目

To access an individual element from the training set, we first pass the train_set object to Python’s iter() built-in function, which returns an object representing a stream of data.

With the stream of data, we can use Python built-in next() function to get the next data element in the stream of data.

After passing the sample to the len() function, we can see that the sample contains two items, and this is because the dataset contains image-label pairs. Each sample we retrieve from the training set contains the image data as a tensor and the corresponding label as a tensor.

【train_set中的sample是一个image-label对，因此sample的len为2】

## Working With Batches Of Data

We’ll start by creating a new data loader with a smaller batch size of 10 so it’s easy to demonstrate what’s going on:

We get a batch from the loader in the same way that we saw with the training set. We use the iter() and next() functions.

Checking the length of the returned batch, we get 2 just like we did with the training set. Let’s unpack the batch and take a look at the two tensors and their shapes:

The size of each dimension in the tensor that contains the image data is defined by each of the following values:

(batch size, number of color channels, image height, image width)

plt.imshow(X) : X=(M,N,3) RGB2D图

## Plot Images Using PyTorch DataLoader

Here is another was to plot the images using the PyTorch DataLoader.