# Data Preparation

In the previous tutorial on starting the 2PM Node with a Docker image, we discussed creating a directory named `2pm_node` and binding it to the Docker container. After initializing the 2PM Node, several subdirectories, including one named `data`, are created within the `2pm_node` folder. This `data` directory is designated for storing data files. Currently, 2PM Task Management supports several fixed data formats, such as directories, CSV files, etc. For any training task, the required data must be placed in the same format across all nodes. Developers specify the filename or directory name in the 2PM Task, and when the task runs on other nodes in the network, it loads data from the `data` folder of that node using the same filename.

To facilitate developers getting started, the Docker image for the 2PM Node includes commands to download the MNIST dataset. The following command will automatically download the MNIST dataset, retaining a third of the data and deleting the rest to simulate a privacy computing network where different nodes each possess different data portions:

```bash
$ docker run -it --rm -v ${PWD}:/app 2pmmpc/2pm-node:latest get-mnist
```

## Supported Data Formats by 2PM Node

In the `Data` folder, each file or directory represents a complete dataset.

### **Using Single Files for Entire Samples**&#x20;

Each file in the `Data` folder represents a complete dataset containing all sample data. 2PM currently supports the following file formats:

<table><thead><tr><th width="157">File Extension</th><th width="330">Content Description</th><th>Type Passed to Preprocess Function in 2PM Task</th></tr></thead><tbody><tr><td>npy</td><td>numpy.ndarray where the 0th dimension represents sample dimensions (e.g., data[0], data[1] are the first and second samples respectively)</td><td>numpy.ndarray containing a single sample's data</td></tr><tr><td>npz</td><td>Same as npy</td><td>Same as npy</td></tr><tr><td>pt</td><td>torch.Tensor where the 0th dimension represents sample dimensions</td><td>torch.Tensor containing a single sample's data</td></tr><tr><td>csv</td><td>Comma-separated values without a header, each row is a sample</td><td>pandas.DataFrame containing a single sample's data</td></tr><tr><td>tsv</td><td>Tab-separated values without a header, each row is a sample</td><td>pandas.DataFrame containing a single sample's data</td></tr><tr><td>txt</td><td>Same as csv, but format may vary</td><td>pandas.DataFrame containing a single sample's data</td></tr><tr><td>xls/xlsx</td><td>Excel file without a header, each row is a sample</td><td>pandas.DataFrame containing a single sample's data</td></tr></tbody></table>

### **Using Directories for Entire Samples**&#x20;

When a directory is placed under the `data` folder, it equates to the entire directory being a dataset. The contents in the folder could be:

#### **Subdirectories within the Folder**

Each subdirectory represents a category, with the subdirectory's name as the category name. Each data file within these subdirectories is a sample.

#### **Data Files within the Folder**

In the absence of subdirectories, individual data files are directly placed. In this scenario, each data file is a sample.

The data files placed in the folder, where each data file is a sample's data, support all the above-listed formats. The difference is that these file contents do not need to represent an additional sample dimension, similar to the data format passed to the Preprocess function.

In addition to the listed formats, individual sample data files placed in the folder can also include image formats. Each image file is treated as a sample, supported by most common image formats and read using `PIL.Image.open` to be passed to the Preprocess function.

## Get Dataset from 0G Storage Node

Suppose someone has successfully registered and indexed some datasets on 0G using the Data Standardization and Index Contract, as well as the Data Flow Contract on 0G blockchain.&#x20;

Developers can request specific datasets using the names and IDs recorded in the Smart Contracts with the following command line while running the node:

```bash
$ ppm_node_dataset request DATASET_ID DATASET_NAME
```

The details of this process have been specified in the corresponding documentation:

{% content-ref url="../../2pm-data-vsies-service/s-data-storage-and-access" %}
[s-data-storage-and-access](https://docs.2pm.network/2pm-data-vsies-service/s-data-storage-and-access)
{% endcontent-ref %}
