Data Preparation
Last updated
Last updated
In the previous tutorial on starting the 2PM Node with a Docker image, we discussed creating a directory named 2pm_node
and binding it to the Docker container. After initializing the 2PM Node, several subdirectories, including one named data
, are created within the 2pm_node
folder. This data
directory is designated for storing data files. Currently, 2PM Task Management supports several fixed data formats, such as directories, CSV files, etc. For any training task, the required data must be placed in the same format across all nodes. Developers specify the filename or directory name in the 2PM Task, and when the task runs on other nodes in the network, it loads data from the data
folder of that node using the same filename.
To facilitate developers getting started, the Docker image for the 2PM Node includes commands to download the MNIST dataset. The following command will automatically download the MNIST dataset, retaining a third of the data and deleting the rest to simulate a privacy computing network where different nodes each possess different data portions:
In the Data
folder, each file or directory represents a complete dataset.
Each file in the Data
folder represents a complete dataset containing all sample data. 2PM currently supports the following file formats:
File Extension | Content Description | Type Passed to Preprocess Function in 2PM Task |
---|---|---|
When a directory is placed under the data
folder, it equates to the entire directory being a dataset. The contents in the folder could be:
Each subdirectory represents a category, with the subdirectory's name as the category name. Each data file within these subdirectories is a sample.
In the absence of subdirectories, individual data files are directly placed. In this scenario, each data file is a sample.
The data files placed in the folder, where each data file is a sample's data, support all the above-listed formats. The difference is that these file contents do not need to represent an additional sample dimension, similar to the data format passed to the Preprocess function.
In addition to the listed formats, individual sample data files placed in the folder can also include image formats. Each image file is treated as a sample, supported by most common image formats and read using PIL.Image.open
to be passed to the Preprocess function.
Suppose someone has successfully registered and indexed some datasets on 0G using the Data Standardization and Index Contract, as well as the Data Flow Contract on 0G blockchain.
Developers can request specific datasets using the names and IDs recorded in the Smart Contracts with the following command line while running the node:
The details of this process have been specified in the corresponding documentation:
npy
numpy.ndarray where the 0th dimension represents sample dimensions (e.g., data[0], data[1] are the first and second samples respectively)
numpy.ndarray containing a single sample's data
npz
Same as npy
Same as npy
pt
torch.Tensor where the 0th dimension represents sample dimensions
torch.Tensor containing a single sample's data
csv
Comma-separated values without a header, each row is a sample
pandas.DataFrame containing a single sample's data
tsv
Tab-separated values without a header, each row is a sample
pandas.DataFrame containing a single sample's data
txt
Same as csv, but format may vary
pandas.DataFrame containing a single sample's data
xls/xlsx
Excel file without a header, each row is a sample
pandas.DataFrame containing a single sample's data