Privacy-preserving techniques

SMPC (Secure MultiParty Computation)

SMPC is a technique that allows the secure sharing of data, only making an aggregated version of the data visible to the coordinator. Without SMPC, the coordinator gets data from each client. When using SMPC, the clients split the data they have into multiple masked models and masks themselves. The masked models and the masks then get sent to different clients. The clients split their data as described above n times, which is described as the number of shards. Each split gets sent to another client, so the number of shards must be a number between 1 and number_clients.

Usage

As with general app development as described in Getting Started, either an app template can be used or development can be done from scratch.

Developing applications from scratch (advanced)

To use SMPC, when sending data to the coordinator, when answering GET /status, the response must contain the smpc option:

smpc: {
  operation: enum[add, multiply]
  serialization?: enum[json] // default is json
  shards?: number // default is number of participants including coordinator
  exponent?: number // default is 8
}

See the information in app template based development for more information on setting the variables. The data must be serialized as defined in the serialization variable.

Furthermore, it should be considered that when SMPC is used, the controller will aggregate data according to the operation option given in the status call. Then, the ONE aggregated package will be send to the application and will be serialized as given by serialization. In conclusion that means that only ONE model will be send (via the POST /data request) and that model will be serialized according to serialization.

We suggest only giving the parameter operation and exponent. Not giving the parameters shards and serialization will use the default values, JSON for serialization and number_clients for shards.

If shards is 0, number_clients shards are used.

DP (Differential Privacy)

Differential privacy describes a privacy enhancing technique that conceils the contribution of each individual row of data. This is achieved by adding noise to any numerical data sent.

Usage

As with general app development as described in Getting Started, either an app template can be used or development can be done from scratch.

App template based development(recommended)

  1. First, DP must be configured. Use this method for configuration. The following parameters can be set. See here for a quick guide on how to choose the parameters.

    • noisetype: describes the distribution from which noise is drawn. See here for all possible distributions.

    • epsilon: describes the epsilon privacy budget value. Please refer to here for information on choosing epsilon

    • delta: describes the delta privacy budget value. Must be 0 for laplacian noise, and should be of a smaller scale than \(\frac{1}{numRows}\), where numRows is the amount of rows in the data used to train the model that is send out. See here for more information.

    • sensitivity: describes the sensitivity of the function that was used on the data. See this quide about how to choose the sensitivity.

    • clippingVal: this value describes the maximum norm of send data. This will be ensured by scaling the send data down so the maximum norm holds. This generates a fixed sensitivity and therefore can be given instead of or additional to the sensitivity. See this quide for more information

  2. DP can now be used whenever sending data to any other client:

  3. As serialization of incoming data might differ when data was sent using DP, the corresponding functions gathering the data must also be informed that the data was sent with the corresponding serialization. This affects the following methods:

Developing applications from scratch (advanced)

Please follow the general steps for developing an app as given in getting started However, your application should add the following parameters to the response body of the GET /status request:

dp: {
  serialization?: enum[json] // default is json
  noisetype?: enum[laplace, gauss] // default is laplace
  epsilon?: float // default is 0.99999
  delta?: float // default is 0 for laplace noise and 0.01 for gauss noise
  sensitivity?: float
  clippingVal?: float
    // default is 10.0 and only set if neither
    // clippingVal nor sensitivity are given
}

See here for a quick guide on how to choose these parameters. Furthermore, data must be serialized according to the given serialization value in the status call (JSON).

Parameter Guide

This step by step guide goes through all needed parameters for DP and how to set them.

  1. sensitivity/clippingVal: DP works on the assumption that some database (a collection of rows/vectors) is used as input of a function. The function must output numerical data. In the context of FeatureCloud, the functions are usually the training algorithms and the output of these functions is the local models that are send around. Input is therefore normally the csv data. You can read more here

    There are two ways to find the correct sensitivity.

    1. For many functions, e.g. for any count query, the sensitivity is fixed and can be found with some research.

    2. Alternatively, the so called local sensitivity can be calculated: \(max_{D, D'} ||function(D) - function(D')||p\) Where D is all data, D’ is all data except for one row and p is 1 for laplace noise and 2 for gauss noise. In practice, that means generating the model using all data except for one row, for EACH row, and then finding the norm of the biggest pairwise difference of these models. This method is computationally intense, it transforms any training algorithm of O(1) into O(N*1), where N is the databasesize. See this section for more information about this method and what the sensitivty is

    In case both of these ways are not feasible or in case clipping the values is beneficial, the clippingVal can be used. The right value for clippingVal depends largely on the data and the training algorithm, but generally it should be choosen as low as possible without the scaling down of values interfering with training. To understand what clipping does, see here

  1. delta: When using laplace noise, delta must be 0. When using gauss noise, delta must be smaller than 1. We recomment setting delta to a smaller scale than the value \(\frac{1}{numberRows}\) as proposed by [Dwork et al 2014].

  1. epsilon: For choosing epsilon, we recommend choosing of the following 3 tiers as proposed by [Ponomareva et al, 2023]. Generally, the lowest possible epsilon should be choosen. Either different epsilons can be tested locally or the 3 tiers can be iterated from most strict (1) to most loose(3) until a satisfactory result is reached.

  • Tier 1: Strong formal privacy guarantees: epsilon < 1

    This gives formal guarantees and high protection, but often heavily decreases accuracy.

  • Tier 2: Reasonable privacy guarantees: epsilon <= 10

    This tier is currently the most used. It gives reasonable protection but can still produce acceptable results. Technically DP with gauss noise is not defined for any epsilon > 1, while in practice the protection is still reasonable.

  • Tier 3: epsilon ~ few 100s

    While formally, this tier offers no protection, in practice, data reconstruction attacks can still be prevented using an epsilon of a few 100s, e.g. upto 300, see e.g. [Balle et al, 2022].

Background

Sensitivity

Sensitivity is a metric to reveal the privacy loss through publishing of the result of some function, in our case publishing of the model of a training algorithm. There are two forms of sensitivity:

  1. Global Sensitivity: \(\Delta f = \max_{D}{||f(D) - f(D')||p}\)

  2. Local Sensitivity: \(\Delta f = \max_{D, D'}{||f(D) - f(D')||p}\)

Global sensitivity considers ANY data used, while local sensitivity considers some specific data. \(D'\) considers all of \(D\) except for one row. Local sensitivty tends to be lower and therefore needs less noising, but is also more computationally intense to calculate. The method of finding the local sensitivity is the following:

Input:
  Data D:      A collection of rows, where each row represents only ONE
               individual, e.g. any csv data WITHOUT repeating ids.
  Function f:  The training algorithm that gets used and whose output is sent
  Norm p:      The norm to be used for the sensitivity. p = 1 is used for
               laplace noise and delivers the L1-Sensitivity, p = 2 is used
               for gaussian noise and delivers the L2-Sensitivity
Output:
  Sensitivity: The L1/L2-Sensitivity of F considering D. L1 or L2 is decided
               depending on given norm p
Algorithm:
  sensitivity = 0
  basemodel = f(D)
  for row in data:
    D_prime = D.remove(row)
      # remove returns a copy of D without row while not changing D
    sensitivity = max(sensitivity, ||basemodel - f(D_prime)||p)
  return sensitivity

Clipping

The clippingVal defines the maximum p-norm of the numerical data that is send with DP. For laplace noise, the 1-norm is used, for gauss noise the 2-norm. This comes from the fact that laplace uses L1-Sensitivity, while gauss noise uses L2-Sensitivity. If the norm exceeds the clippingVal, then the values are scaled down. The scalling happens according to the following formula:

\(w_{clipped} = w \cdot \min{(1, \frac{C}{||w||p})}\), where \(w\) is the numerical data which gets clipped and C is the clippingVal.

Given clipping, the sensitivity is fixed as \(2 \cdot C\). This is due to the fact that when using clipping, \(w\) can at most change from being the biggest postive norm value to the smallest negative norm value.