Logistic Regression Task

This is an example of a horizontal logistic regression task written using the 2PM Node Framework.

The data is from the spector dataset, which concerns experimental data on the effectiveness of the Personalized System of Instruction (PSI) program. The data format is a CSV file, containing four columns: GPA (Grade Point Average), TCUE (Test Scores in Economics), PSI (participation in the personalized instruction system), and Grade (whether there was an improvement in student grades). The task is to predict whether a student's grades improved using a logistic regression model.

Importing Necessary Packages

import numpy as np
import pandas
import 2pm.dataset
from 2pm import 2PMNode
from 2pm.statsmodel import LogitTask

Here, we have imported the LogitTask class from the 2pm.statsmodel package. To define a logistic regression task in 2PM, you need to define a subclass that inherits from LogitTask.

Next, we will define the logistic regression task.

Defining the Logistic Regression Task

class SpectorLogitTask(LogitTask):
    def __init__(self) -> None:
        super().__init__(
            name="spector_logit",  # Task name
            min_clients=2,  # Minimum required number of clients, at least 2
            max_clients=3,  # Maximum supported number of clients, must be greater than or equal to min_clients
            wait_timeout=5,  # Wait timeout to control the timeout for a round of computation
            connection_timeout=5,  # Connection timeout to control the timeout for each phase in the process
            verify_timeout=500,  # Verification timeout to control the timeout for the zero-knowledge proof stage
            enable_verify=True  # Whether to enable the zero-knowledge proof stage after task completion
        )

    def dataset(self):
        """
        Define the data required for the task.
        Output: Dictionary, keys are the names of the data, which must correspond with the parameter names in the preprocess method.
        """
        return {
            "data": 2pm.dataset.DataFrame("spector.csv"),
        }

    def preprocess(self, data: pandas.DataFrame):
        """
        Preprocessing function to process the dataset and split it into features (x) and labels (y).
        Input: Corresponds with the return value of the dataset method
        Output: Features (x) and labels (y)
        """
        names = data.columns

        y_name = names[3]
        y = data[y_name].copy()  # type: ignore
        x = data.drop([y_name], axis=1)
        return x, y
    
    def options(self):
        """
        Optional method to configure the training for the logistic regression task.
        Output: Dictionary, training configuration options for the logistic regression task.
        """
        return {
            "maxiter": 35,  # Maximum number of training iterations, default is 35
            "method": "newton",  # Training method, current option is "newton"
            "start_params": None,  # Initial weights for logistic regression, default is None. If None, weights are initialized to zero
            "ord": np.inf,  # Coefficient related to the newton method. Order of the gradient norm
            "tol": 1e-8,  # Coefficient related to the newton method. Tolerance for stopping the training
            "ridge_factor": 1e-10,  # Coefficient related to the newton method. Ridge regression coefficient
        }

The definition of the logistic regression task includes four parts: task configuration, selection of the dataset, preprocessing of the dataset, and configuration of the training.

Task Configuration

The task is configured in the super().__init__() method. Configuration options include the task name (name), the minimum number of clients required (min_clients), the maximum number of clients supported (max_clients), the wait timeout (wait_timeout), which controls the timeout for a computation round, and the connection timeout (connection_timeout), which controls the timeout for each phase of the process.

Additionally, the logistic regression task can enable a zero-knowledge proof stage after task completion to verify the convergence of the final results and the consistency of the data across the computation process at various nodes. To enable this feature, set the enable_verify parameter in super().__init__() to True. You can control the duration of the zero-knowledge proof stage with the verify_timeout parameter. Currently, as the zero-knowledge proof stage is time-consuming, the default value for verify_timeout is 300 seconds. If a timeout occurs during this stage, it is advisable to increase verify_timeout appropriately.

Since the nodes in the network are not always online and there is some selection process for the tasks they wish to participate in, the number of nodes required for the task is defined here. Once the task is published, nodes independently decide whether to join the task. When the number of nodes that have opted in meets the task requirements, the task will commence.

Here, we assume three nodes participate in the logistic regression task, so we set both the minimum and maximum number of nodes to three, requiring their full participation.

Dataset

The dataset required for the task is defined in the dataset method. This method returns a dictionary where the keys are the names of the datasets that correspond with the parameters of the execute method; the corresponding values are instances of 2pm.dataset.DataFrame, with the dataset parameter representing the name of the required dataset.

The definition of the dataset primarily clarifies which data is needed for the computation. This data is distributed across different nodes and must be stored with the same naming and format for access by 2PM nodes.

The dataset definition also specifies the data format after it is read in. Here, 2pm.dataset.DataFrame is used, indicating to 2PM to convert the read data into a Pandas DataFrame for subsequent use.

In this case, we are reading from spector.csv. Each node has its own spector.csv file.

Data Preprocessing

In the preprocessing function, we process the dataset returned by the dataset method, ultimately returning features (x) and labels (y) for training. The input must correspond with the return value of the dataset method, meaning one input parameter matches an item from the returned dictionary. The outputs x and y can be either a pandas.DataFrame or a numpy.ndarray, where y must be a one-dimensional vector representing class labels.

Since spector.csv contains four columns—GPA, TCUE, PSI, and Grade—with our task being to predict Grade, we take the first three columns as features (x) and the last column as the label (y). No further preprocessing is required beyond splitting the input DataFrame. The features (x) and labels (y) are then returned directly.

Logistic Regression Options

This method is optional. In the options method, we can configure various parameters for logistic regression. Common parameters include method (the optimization method for logistic regression, currently only "newton" is available), maxiter (maximum number of iterations, default is 35), and start_params (initial weights for logistic regression, default is None). If start_params is None, the framework will initialize weights to zero.

Additional parameters are specific to each optimization method. For Newton's method, these include ord (order of the gradient norm, default is +inf), tol (tolerance for stopping iterations, default is 1e-8), and ridge_factor (ridge regression coefficient for the Hessian matrix).

All the aforementioned configuration items have default values. If you have no specific requirements, you can omit implementing this method, and all parameters will default to their preset values.

Specifying the API for the 2PM Node to Execute Tasks

Once the task is defined, we can begin preparations to execute it on the 2PM Node.

The 2PM Task Management allows for direct interaction with the 2PM Node API to dispatch tasks to the 2PM Node for execution. Simply specify the API address of the 2PM Node when initiating the task execution.

Last updated