# Logistic Regression Task

This is an example of a horizontal logistic regression task written using the 2PM Node Framework.

The data is from the spector dataset, which concerns experimental data on the effectiveness of the Personalized System of Instruction (PSI) program. The data format is a CSV file, containing four columns: GPA (Grade Point Average), TCUE (Test Scores in Economics), PSI (participation in the personalized instruction system), and Grade (whether there was an improvement in student grades). The task is to predict whether a student's grades improved using a logistic regression model.

### **Importing Necessary Packages**

```python
import numpy as np
import pandas
import 2pm.dataset
from 2pm import 2PMNode
from 2pm.statsmodel import LogitTask
```

Here, we have imported the `LogitTask` class from the `2pm.statsmodel` package. To define a logistic regression task in 2PM, you need to define a subclass that inherits from `LogitTask`.

Next, we will define the logistic regression task.

### **Defining the Logistic Regression Task**

```python
class SpectorLogitTask(LogitTask):
    def __init__(self) -> None:
        super().__init__(
            name="spector_logit",  # Task name
            min_clients=2,  # Minimum required number of clients, at least 2
            max_clients=3,  # Maximum supported number of clients, must be greater than or equal to min_clients
            wait_timeout=5,  # Wait timeout to control the timeout for a round of computation
            connection_timeout=5,  # Connection timeout to control the timeout for each phase in the process
            verify_timeout=500,  # Verification timeout to control the timeout for the zero-knowledge proof stage
            enable_verify=True  # Whether to enable the zero-knowledge proof stage after task completion
        )

    def dataset(self):
        """
        Define the data required for the task.
        Output: Dictionary, keys are the names of the data, which must correspond with the parameter names in the preprocess method.
        """
        return {
            "data": 2pm.dataset.DataFrame("spector.csv"),
        }

    def preprocess(self, data: pandas.DataFrame):
        """
        Preprocessing function to process the dataset and split it into features (x) and labels (y).
        Input: Corresponds with the return value of the dataset method
        Output: Features (x) and labels (y)
        """
        names = data.columns

        y_name = names[3]
        y = data[y_name].copy()  # type: ignore
        x = data.drop([y_name], axis=1)
        return x, y
    
    def options(self):
        """
        Optional method to configure the training for the logistic regression task.
        Output: Dictionary, training configuration options for the logistic regression task.
        """
        return {
            "maxiter": 35,  # Maximum number of training iterations, default is 35
            "method": "newton",  # Training method, current option is "newton"
            "start_params": None,  # Initial weights for logistic regression, default is None. If None, weights are initialized to zero
            "ord": np.inf,  # Coefficient related to the newton method. Order of the gradient norm
            "tol": 1e-8,  # Coefficient related to the newton method. Tolerance for stopping the training
            "ridge_factor": 1e-10,  # Coefficient related to the newton method. Ridge regression coefficient
        }

```

The definition of the logistic regression task includes four parts: task configuration, selection of the dataset, preprocessing of the dataset, and configuration of the training.

#### **Task Configuration**

The task is configured in the `super().__init__()` method. Configuration options include the task name (`name`), the minimum number of clients required (`min_clients`), the maximum number of clients supported (`max_clients`), the wait timeout (`wait_timeout`), which controls the timeout for a computation round, and the connection timeout (`connection_timeout`), which controls the timeout for each phase of the process.

Additionally, the logistic regression task can enable a zero-knowledge proof stage after task completion to verify the convergence of the final results and the consistency of the data across the computation process at various nodes. To enable this feature, set the `enable_verify` parameter in `super().__init__()` to `True`. You can control the duration of the zero-knowledge proof stage with the `verify_timeout` parameter. Currently, as the zero-knowledge proof stage is time-consuming, the default value for `verify_timeout` is 300 seconds. If a timeout occurs during this stage, it is advisable to increase `verify_timeout` appropriately.

Since the nodes in the network are not always online and there is some selection process for the tasks they wish to participate in, the number of nodes required for the task is defined here. Once the task is published, nodes independently decide whether to join the task. When the number of nodes that have opted in meets the task requirements, the task will commence.

Here, we assume three nodes participate in the logistic regression task, so we set both the minimum and maximum number of nodes to three, requiring their full participation.

#### **Dataset**

The dataset required for the task is defined in the `dataset` method. This method returns a dictionary where the keys are the names of the datasets that correspond with the parameters of the `execute` method; the corresponding values are instances of `2pm.dataset.DataFrame`, with the `dataset` parameter representing the name of the required dataset.&#x20;

The definition of the dataset primarily clarifies which data is needed for the computation. This data is distributed across different nodes and must be stored with the same naming and format for access by 2PM nodes.&#x20;

The dataset definition also specifies the data format after it is read in. Here, `2pm.dataset.DataFrame` is used, indicating to 2PM to convert the read data into a Pandas DataFrame for subsequent use.

In this case, we are reading from `spector.csv`. Each node has its own `spector.csv` file.

#### **Data Preprocessing**

In the preprocessing function, we process the dataset returned by the `dataset` method, ultimately returning features (`x`) and labels (`y`) for training. The input must correspond with the return value of the dataset method, meaning one input parameter matches an item from the returned dictionary. The outputs `x` and `y` can be either a pandas.DataFrame or a numpy.ndarray, where `y` must be a one-dimensional vector representing class labels.

Since `spector.csv` contains four columns—GPA, TCUE, PSI, and Grade—with our task being to predict Grade, we take the first three columns as features (`x`) and the last column as the label (`y`). No further preprocessing is required beyond splitting the input DataFrame. The features (`x`) and labels (`y`) are then returned directly.

#### **Logistic Regression Options**

This method is optional. In the `options` method, we can configure various parameters for logistic regression. Common parameters include `method` (the optimization method for logistic regression, currently only "newton" is available), `maxiter` (maximum number of iterations, default is 35), and `start_params` (initial weights for logistic regression, default is None). If `start_params` is None, the framework will initialize weights to zero.

Additional parameters are specific to each optimization method. For Newton's method, these include `ord` (order of the gradient norm, default is +inf), `tol` (tolerance for stopping iterations, default is 1e-8), and `ridge_factor` (ridge regression coefficient for the Hessian matrix).

All the aforementioned configuration items have default values. If you have no specific requirements, you can omit implementing this method, and all parameters will default to their preset values.

### **Specifying the API for the 2PM Node to Execute Tasks**

Once the task is defined, we can begin preparations to execute it on the 2PM Node.

The 2PM Task Management allows for direct interaction with the 2PM Node API to dispatch tasks to the 2PM Node for execution. Simply specify the API address of the 2PM Node when initiating the task execution.
