Data Preparation
In the previous tutorial on starting the 2PM Node with a Docker image, we discussed creating a directory named 2pm_node and binding it to the Docker container. After initializing the 2PM Node, several subdirectories, including one named data, are created within the 2pm_node folder. This data directory is designated for storing data files. Currently, 2PM Task Management supports several fixed data formats, such as directories, CSV files, etc. For any training task, the required data must be placed in the same format across all nodes. Developers specify the filename or directory name in the 2PM Task, and when the task runs on other nodes in the network, it loads data from the data folder of that node using the same filename.
To facilitate developers getting started, the Docker image for the 2PM Node includes commands to download the MNIST dataset. The following command will automatically download the MNIST dataset, retaining a third of the data and deleting the rest to simulate a privacy computing network where different nodes each possess different data portions:
$ docker run -it --rm -v ${PWD}:/app 2pmmpc/2pm-node:latest get-mnistSupported Data Formats by 2PM Node
In the Data folder, each file or directory represents a complete dataset.
Using Single Files for Entire Samples 
Each file in the Data folder represents a complete dataset containing all sample data. 2PM currently supports the following file formats:
npy
numpy.ndarray where the 0th dimension represents sample dimensions (e.g., data[0], data[1] are the first and second samples respectively)
numpy.ndarray containing a single sample's data
npz
Same as npy
Same as npy
pt
torch.Tensor where the 0th dimension represents sample dimensions
torch.Tensor containing a single sample's data
csv
Comma-separated values without a header, each row is a sample
pandas.DataFrame containing a single sample's data
tsv
Tab-separated values without a header, each row is a sample
pandas.DataFrame containing a single sample's data
txt
Same as csv, but format may vary
pandas.DataFrame containing a single sample's data
xls/xlsx
Excel file without a header, each row is a sample
pandas.DataFrame containing a single sample's data
Using Directories for Entire Samples 
When a directory is placed under the data folder, it equates to the entire directory being a dataset. The contents in the folder could be:
Subdirectories within the Folder
Each subdirectory represents a category, with the subdirectory's name as the category name. Each data file within these subdirectories is a sample.
Data Files within the Folder
In the absence of subdirectories, individual data files are directly placed. In this scenario, each data file is a sample.
The data files placed in the folder, where each data file is a sample's data, support all the above-listed formats. The difference is that these file contents do not need to represent an additional sample dimension, similar to the data format passed to the Preprocess function.
In addition to the listed formats, individual sample data files placed in the folder can also include image formats. Each image file is treated as a sample, supported by most common image formats and read using PIL.Image.open to be passed to the Preprocess function.
Get Dataset from 0G Storage Node
Suppose someone has successfully registered and indexed some datasets on 0G using the Data Standardization and Index Contract, as well as the Data Flow Contract on 0G blockchain.
Developers can request specific datasets using the names and IDs recorded in the Smart Contracts with the following command line while running the node:
$ ppm_node_dataset request DATASET_ID DATASET_NAMEThe details of this process have been specified in the corresponding documentation:
[S] Data Storage and AccessLast updated
