Data Preparation

In the previous tutorial on starting the 2PM Node with a Docker image, we discussed creating a directory named 2pm_node and binding it to the Docker container. After initializing the 2PM Node, several subdirectories, including one named data, are created within the 2pm_node folder. This data directory is designated for storing data files. Currently, 2PM Task Management supports several fixed data formats, such as directories, CSV files, etc. For any training task, the required data must be placed in the same format across all nodes. Developers specify the filename or directory name in the 2PM Task, and when the task runs on other nodes in the network, it loads data from the data folder of that node using the same filename.

To facilitate developers getting started, the Docker image for the 2PM Node includes commands to download the MNIST dataset. The following command will automatically download the MNIST dataset, retaining a third of the data and deleting the rest to simulate a privacy computing network where different nodes each possess different data portions:

$ docker run -it --rm -v ${PWD}:/app 2pmmpc/2pm-node:latest get-mnist

Supported Data Formats by 2PM Node

In the Data folder, each file or directory represents a complete dataset.

Using Single Files for Entire Samples

Each file in the Data folder represents a complete dataset containing all sample data. 2PM currently supports the following file formats:

File ExtensionContent DescriptionType Passed to Preprocess Function in 2PM Task

npy

numpy.ndarray where the 0th dimension represents sample dimensions (e.g., data[0], data[1] are the first and second samples respectively)

numpy.ndarray containing a single sample's data

npz

Same as npy

Same as npy

pt

torch.Tensor where the 0th dimension represents sample dimensions

torch.Tensor containing a single sample's data

csv

Comma-separated values without a header, each row is a sample

pandas.DataFrame containing a single sample's data

tsv

Tab-separated values without a header, each row is a sample

pandas.DataFrame containing a single sample's data

txt

Same as csv, but format may vary

pandas.DataFrame containing a single sample's data

xls/xlsx

Excel file without a header, each row is a sample

pandas.DataFrame containing a single sample's data

Using Directories for Entire Samples

When a directory is placed under the data folder, it equates to the entire directory being a dataset. The contents in the folder could be:

Subdirectories within the Folder

Each subdirectory represents a category, with the subdirectory's name as the category name. Each data file within these subdirectories is a sample.

Data Files within the Folder

In the absence of subdirectories, individual data files are directly placed. In this scenario, each data file is a sample.

The data files placed in the folder, where each data file is a sample's data, support all the above-listed formats. The difference is that these file contents do not need to represent an additional sample dimension, similar to the data format passed to the Preprocess function.

In addition to the listed formats, individual sample data files placed in the folder can also include image formats. Each image file is treated as a sample, supported by most common image formats and read using PIL.Image.open to be passed to the Preprocess function.

Get Dataset from 0G Storage Node

Suppose someone has successfully registered and indexed some datasets on 0G using the Data Standardization and Index Contract, as well as the Data Flow Contract on 0G blockchain.

Developers can request specific datasets using the names and IDs recorded in the Smart Contracts with the following command line while running the node:

$ ppm_node_dataset request DATASET_ID DATASET_NAME

The details of this process have been specified in the corresponding documentation:

[S] Data Storage and Access

Last updated