Internet-Draft ECDH-PSI November 2024
Wang, et al. Expires 19 May 2025 [Page]
Workgroup:
Privacy Preserving Measurement
Internet-Draft:
draft-wang-ppm-ecdh-psi-00
Published:
Intended Status:
Informational
Expires:
Authors:
Y. Wang
Ant Group
W. Chang
Ant Group
Y. Lu
Ant Group
C. Hong
Ant Group
J. Peng
Ant Group

PSI based on ECDH

Abstract

This document describes Elliptic Curve Diffie-Hellman Private Set Intersection (ECDH-PSI). It instantiates the classical Meadows match-making protocol with standard elliptic curves and hash-to-curve methods. In ECDH-PSI, data items are encoded to points on an elliptic curve, and masked by the private keys of both parties. After collecting the mutually masked datasets from both parties, a participant computes their intersection and outputs the corresponding original data items as result.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 19 May 2025.

Table of Contents

1. Introduction

Private Set Intersection (PSI) schemes enable the discovery of shared elements among different parties' datasets without revealing individual data. They are widely used when there's a need to identify overlapping data elements between two or more parties while preserving the confidentiality of each party's original data. PSI is one of the most frequently used privacy preserving techniques in business, e.g. it enables a user to detect whether his/her password is leaked without giving away the password to the server [MS21], or multiple companies to find their common customers without giving each other their raw data. Furthermore, many privacy-respecting systems can leverage PSI schemes to collaborate with other data providers for the purpose of enhancing the privacy of data-exchange processes. As an example, a service provider who has aggregated attributes from individuals following [I-D.ietf-ppm-dap][draft-irtf-cfrg-vdaf-12] can use PSIs to compare its results with another entity without disclosing any attribute that does not owned by the partner.

The classical Diffie-Hellman PSI [Meadows86] has been published for more than thirty years. The academic has also proposed numerous PSI schemes with new functionalities and secure guarantees, such as [KKRT16][CHLR18][RR22], but DH-PSI is still preferable due to its simplicity and communication efficiency. This document describes a widely deployed instance of DH-PSI denoted by Elliptic Curve Diffie-Hellman PSI (ECDH-PSI). Compared with the finite field parameters used in the original version, elliptic curves are commonly considered to be more efficient and secure [IMC][LOGJAM], and are extensively adopted by recent standards including [RFC8031][RFC8446][RFC9497].

In document describes 2-party ECDH-PSI as a two stage protocol, including a handshake phase and a data exchange phase. In the handshake phase, the parties agree on the cipher suite and message flow to be used in data exchange phase. Then, during the data exchange phase, original data elements are masked and transmitted in batches, allowing one or both parties to output the result.

1.1. Requirements Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

1.2. Data Structures and Notations

In this document, data structures are defined and encoded according to the conventions laid out in Section 3 of [RFC8446].

2. Background

This section gives brief introductions for elliptic curves and hash-to-curve methods.

2.1. Elliptic Curves

An elliptic curve E is defined by a two-variable equation over a finite field F with prime characteristic p larger than three. Except for a special point called point at infinity, a point on E is a (x,y)-coordinate pair that satisfies the curve equation, where x and y are elements of F. All distinct points of curve E consists an algebraic group of order n, where the point at infinity is the identity element. Applications and standards generally use a prime order (sub)group <G> generated by a point G on E. The order of <G> is denoted by r.

This document uses term "addition" to refer to the group operation of a elliptic curve group, meaning two points P and Q can be added together to get a third point R. Scalar multiplication refers to the operation of adding a point P with itself repeatedly. This document uses scalar_mul to denote the scalar multiplication process. It takes a integer x<r and a point P as inputs and outputs the multiplication result Q, which is written by Q=scalar_mul(P,x).

Many cryptographic applications require the participants to generate elliptic curve key pairs. Usually, a key pair on refers to a (PK,sk) tuple, where sk is a integer generated uniformly between 1 and r-1, and PK=scalar_mul(G,sk). In the context of ECDH-PSI, only sk, namely the private key, is used. Thus, this document uses the notation sk=keygen() to refer to the process of generating a private key on the elliptic curve and omits PK.

2.2. Hashing to Elliptic Curves

A hash_to_curve function encodes data of arbitrary length to a point on a elliptic curve. The encoding process first hashes the byte string to one or more elements of fixed length, and then calculates a point on the curve with the elements using so-called map_to_curve and clear_cofactor functions.

[RFC9380] describes a series of hash_to_curve methods (which are also called "suites") for standard curves such as NIST P-256 and curve25519. It specifies two encoding types denoted by "encode_to_curve" and "hash_to_curve". Both encodings can be proven indifferentiable from random oracles (ROs)[MRH04], but have different output distributions. This document only uses "hash_to_curve" encodings which output uniformly random points. It also uses the notation P=hash_to_curve(data) to refer to the process of encoding data of arbitrary length to a point P on the curve given in the context.

[RFC9380] requires all uses of the hash_to_curve suites must enforce domain separation with domain separation tags (DSTs). In particular, DSTs are distinct strings that enable different applications to simulate distinct RO instances with the same hash function. That is, each application concatenates the DST with its message as the input to hash function. This document allocates each suit with a unique DST. For simplicity, DST strings are omitted when writing the process of calculating the hash value of a string. The notation tag=hash(str) means that the hash function actually takes "dst||str" as its input rather than str alone, where dst is the DST string defined for the corresponding hash_to_curve suite. Similarly, this document also omits DST strings when describing hash_to_curve functions, assuming that the corresponding DST has been taken as a part of the input.

A suite defined by [RFC9380] includes a series of parameters such as the curve, hash function, and encoding type. In applications, these suites are very efficient to be referred to, as the included parameters can also be used beyond the scenario of mapping data to curves. For example, the participants can negotiate a hash_to_curve suite, and then use the curve deduced from the suite for other cryptographic functions such as encryption and key establishment.

To extend the usage of hash_to_curve suites, [RFC9380] also describes methods and naming conventions for defining new suites, which enables developers to define suites with new curves and hash functions.

3. Protocol

We begin with a very brief description on how ECDH-PSI works. Firstly, the participants, A and B, agree on a EC group <G> along with the hash_to_curve suites, and generate private keys over <G>. Then, both A and B convert their data to points over <G> and multiply the points with both parties' private keys. Next, they exchange the sets of masked elements as sets of EC points, and compute the intersection over these sets. Finally, they output the original data items corresponding to the intersection of masked sets.

Assume that the parties have agreed on the suite, and generated private keys of sk_A and sk_B, an simplified protocol flow of ECDH-PSI is shown by Figure 1, and explained step-by-step as follows:

  1. A participant maps its data items to EC points with hash_to_curve, and masks the points locally with its own private key by scalar_mul.
  2. A and B exchange their locally masked elements.
  3. Upon receiving the masked data from its partner, a participant doubly masks the received points with its private key.
  4. A participant sends the doubly-masked elements back to its partner.
  5. The participant calculates intersection of the set calculated in Step 3 and the set received in Step 4, and finally outputs the corresponding original elements.

In order to clarify the statement, this document uses "the first round" to refer to the data exchange process of Step 2, and "the second round" for the process of Step 4.

The classical description of ECDH-PSI enables both parties to output PSI results and thus requires two complete rounds (i.e. Step 2 and Step 4) to exchange masked data. However, many application scenarios require that should only one participant output the intersection and the other one get nothing. In such scenarios, Step 4 and Step 5 are executed unidirectionally, where only one party sends doubly-masked data in Step 4 and only the receiver of that message can output the result.

To output PSI result, the participants should also match the original data items with their corresponding EC points. In particular, the relationship can be established with unique indexes, where the original data item and the masked EC point(s) are linked to the same index.



              A(data_A,sk_A)                  B(data_B,sk_B)
-----------------------------------------------------------------------
Step 1:             |                               |
                    |                               |
            p_A=scalar_mul(sk_A,            p_B=scalar_mul(sk_B,
            hash_to_curve(data_A))          hash_to_curve(data_B))
                    |                               |
Step 2:             |                               |
                    |                               |
                    |-------------p_A-------------->|
                    |<------------p_B---------------|
                    |                               |
Step 3:             |                               |
                    |                               |
       p_AB=scalar_mul(sk_A,p_B)     p_BA = scalar_mul(sk_B,p_A)
                    |                               |
                    |                               |
                    |                               |
Step 4:             |                               |
                    |                               |
                    |-------------p_AB------------->|
                    |<------------p_BA--------------|
                    |                               |
Step 5:             |                               |
                    |                               |
         set_A=intersect(p_AB,p_BA)      set_B=intersect(p_BA,p_AB)
                    |                               |
                    |                               |
         output match(set_A,data_A)      output match(set_B,data_B)

Figure 1: A Simplified Protocol Flow of ECDH-PSI

3.1. Overview

The specification of ECDH-PSI consists of two phases as shown in Figure 2, including a handshake phase which negotiates the protocol parameters and a data exchange phase that performs the operations described by Figure 1.

  • In the handshake phase, a participant, which is denoted by requester, sends a HandshakeRequest message to initiate the ECDH-PSI protocol flow. This message carries lists of parameters supported by the requester and information about the requester's data. Then, the other participant, denoted by responder, selects parameters from the requester's lists, and sends the parameters with its data information in a HandshakeResponse message.
  • Next, in the data exchange phase, both parties perform ECDH-PSI operations with parameters selected in the handshake phase, where EcdhPsiBatch messages are exchanged. The receiver of an EcdhPsiBatch replies with a EcdhPsiBatchResponse message, which tells sender the handling status of last EcdhPsiBatch message.

This specification enables masked data to be transferred by many batches, as the entire data set may be extremely large and requires numerous resources to be handled with properly. For example, the size of a data set may exceed the limit of a single package of the underlying network protocol. The participants can divide the data set to multiple batches and use multiple EcdhPsiBatch messages to send them so as to meet the limits of network, storage or computation resources.

Besides hash_to_curve suites and data formats, the handshake phase also determines the interaction pattern for data exchange phase. The pattern is decided by the following options:

  • The participant that outputs the result. If only one participant (for example, A) has the privilege of obtaining the result, then, in the second round, EcdhPsiBatch messages will be only sent from B to A.
  • The maximum size of a single EcdhPsiBatch message. Upon the determination of maximum batch size, the participants can deduce the number of batch messages sent or received in each round by dividing the size of entire data set with the maximum size.
  • The order of sending EcdhPsiBatch messages. When both participants need to send batch messages in the same round, the requester will always send the first one. Then, they can send EcdhPsiBatch messages in turn, or let the requester send all batches in sequence before the responder sending its data.


                  A                                     B
-----------------------------------------------------------------------
                  |                                     |
Handshake Phase:  |                                     |
                  |----------HandshakeRequest---------->|
                  |<---------HandshakeResponse----------|
                  |                                     |
Data Exchange     |
Phase:            |                                     |
                  |                                     |
                  |-------------EcdhPsiBatch----------->|
                  |<--------EcdhPsiBatchResponse--------|
                  |                   .                 |
                  |                   .                 |
                  |                   .                 |
                  |-------------EcdhPsiBatch----------->|
                  |<--------EcdhPsiBatchResponse--------|
                  |                   .                 |
                  |                   .                 |
                  |                   .                 |
Figure 2: An Overview of the Two-Phase Specification of ECDH-PSI

3.2. Handshake

The handshake phase includes a HandshakeRequest and a HandshakeResponse message. The structures of these messages are documented as follows.

3.2.1. HandshakeRequest

Structure of HandshakeRequest:

  struct {
    uint8 version = 1;
    ProtocolProposal protocol_param;
    DataProposal data_param;
  } HandshakeRequest;

version: The version of ECDH-PSI protocol.

protocol_param: ECDH-PSI protocol configurations supported by the requester.

data_param: The requester's data information, including its data set size and proposals for data exchange pattern.

3.2.1.1. Protocol Proposal

The ProtocolProposal message describes ECDH-PSI protocol parameters supported by the requester.

Structure of this message:

enum {
  null (0),
  P256_XMD:SHA-256_SSWU_NU_ (1),
  P384_XMD:SHA-384_SSWU_NU_ (2),
  P521_XMD:SHA-512_SSWU_NU_ (3),
  curve25519_XMD:SHA-512_ELL2_NU_ (4),
  (255)
} HashToCurveSuite

enum { compressed (0), uncompressed  (1), (255) } PointOctetFormat

enum {
  no_truncation (0),
  128_bit_truncation (1),
  (255)
  } TruncationOption

struct {
  HashToCurveSuite    suites<1..255>;
  PointOctetFormat    point_octet_formats<1..255>;
  TruncationOption    truncation_options<1..255>;
} ProtocolProposal

suites: The list of supported HashToCurveSuite in order of the requester's preference. .

Different suites use different DST strings as required by [RFC9380]. In particular, ECDH-PSI uses "ECDH-PSI-V<xx>-<suites>" as its DST, where <xx> is a two-digit number indicating the current protocol version as specified by the version field of HandshakeRequest, and <suites> is the ASCII string representation of the currently adopted suite as defined by [RFC9380]. As an example, when both participants agree to use the suite for P256 curve, the corresponding DST is "ECDH-PSI-V01-P256_XMD:SHA-256_SSWU_NU_".

point_octet_formats: The list of octet string formats representing EC points, in the order of requester's preference. This field enables the participants to weigh the tradeoffs between the costs of communication and computation as discussed by Section 4.1.

  • For the curves defined in [FIPS186-4], the PointOctetFormat values refer to the corresponding compressed/uncompressed formats documented by [ECDSA].
  • Points on Curve25519 are always encoded as the input and output of X25519 function defined by Section 5 of [RFC7748] as 32 byte strings. Unlike the other curves, the scalar multiplication of curve25519 only relies on one 32-byte coordinate without the cost of decompressing to the (x,y)-coordinate representation, which eliminates the need of making a tradeoff.

truncation_options: Whether the requester supports truncation for data transferred in the second round. The options are listed in the requester's preference. The truncation could decrease the costs of data transmission and storage, and even accelerate the computation process of intersection, at the expense of a (negligible) false-positive rate.

In this document, only 128-bit truncation is allowed, with a restriction that the total number of data items owned by both participants SHOULD be no more than 2^40, ensuring that the probability of collision is smaller than 2^-48. See Section 5.1 for a detailed discussion. The requester MUST set this field by a single no_truncation option when its data set size has already been equal or larger than 2^40.

The truncation process utilizes Key Derivation Functions (KDFs) as many key-exchange protocols that use shared secrets to derive shorter strings which can be seen as uniformly distributed cryptographic keys (e.g., [RFC8446][RFC9382]). In this document, the KDFs are instantiated by Hashed Message Authentication Code (HMAC) [RFC2104], along with HMAC-based KDF (HKDF)[RFC5869], and the hash function included in hash_to_curve suite.

Following [RFC5869], a KDF function kdf() takes a input keying material ikm, a salt value salt, an application specific information info and output length len as its inputs. Let mask_data be the string representation of a doubly-masked element, a participant SHOULD truncate mask_data with:

truncated_mask_data = kdf(mask_data, nil, "ECDH-PSI", 128)

In particular, the KDF takes mask_data as the input keying material. It does not take a salt value, uses ASCII string "ECDH-PSI" as the application specification information, and outputs a 128-bit string as the truncation result.

3.2.1.2. Data Proposal

The data_param field describes requester's data size and provides supported options on how to exchange EcdhPsiBatch messages.

Structure of this message:

enum { continuous (0), interactive (1), (255) } BatchMode

enum { both (0), requester (1), responder (2), (255) } OutputMode

struct{
  BatchMode  supported_batch_modes<1..255>;
  OutputMode supported_output_modes<1..255>;
  uint64     max_batch_size;
  uint64     item_num;
} DataProposal

supported_batch_modes: The list of interaction patterns supported by the requester in the order of preference. BatchMode describes the message interaction pattern when both parties need to send their data sets with multiple EcdhPsiBatch messages in the same round.

  • continuous: A participant sends its entire data set with continuous EcdhPsiBatch messages.

  • interactive: The participants send EcdhPsiBatch messages in turn. If one party has sent all its batch messages before its partner, the partner SHOULD send its messages continuously.

When both parties need to send data batches in the same round, the requester SHOULD always send the first EcdhPsiBatch message.

supported_output_modes: The list of output modes supported by the requester in the order of its preference, where OutputMode describes which party has the privilege of obtaining the result.

The meaning of OutputMode is explained as follows:

  • both: Both parties output the intersection result, which means EcdhPsiBatch are sent mutually in the second round.

  • requester: Only the requester outputs PSI result, which means only the responder sends message batches in the second round.

  • responder: Only the responder outputs PSI result, which means only the requester sends message batches in the second round.

max_batch_size: The maximum byte size of a single EcdhPsiBatch message allowed by the requester. That is, the requester will not send a batch message beyond this limit, and it cannot receive a batch message having larger size. The requester SHOULD decide the value of this field comprehensively according to its hardware/software configuration and environment.

item_num: The total number of requester's data items.

3.2.2. HandshakeResponse

The responder sends HandshakeResponse to reply to a HandshakeRequest message. It includes selected parameters and the responder's data information.

Structure of this message:


enum {
  success(0),
  generic_error         (0x01),
  unsupported_version   (0x02)
  invalid_request       (0x03),
  out_of_resource       (0x04),
  unsupported_parameter (0x05),
  (0xFF)
} HandshakeStatus

struct {
  HandshakeStatus status;
  ProtocolResult protocol_param;
  DataResult data_param;
} HandshakeResponse;

status: Indicating the status of handshake. The meanings of error statuses are listed as follows:

  • generic_error: The error cannot be categorized to the rest error statuses defined by HandshakeStatus.

  • unsupported_version: The responder does not support the ECDH-PSI protocol version given by version.

  • invalid_request: The responder cannot parse the request correctly. The message may be in wrong format and cannot be parsed as a normal HandshakeRequest message.

  • out_of_resource: The responder does not have enough resource to handle the ECDH-PSI process following to the options provided by the requester.

  • unsupported_parameter: The responder rejects all options proposed by one of the suggested option lists included in HandshakeRequest.

The responder MUST ignore all options or parameters that it cannot recognize when parsing the HandshakeRequest message. If one of the suggested option lists is filled with unrecognized parameters, it SHOULD reply with a HandshakeResponse carrying unsupported_parameter.

Upon receiving a HandshakeResponse message without success status, the requester MUST ignore the other fields included in this message.

protocol_param: The protocol parameter selected by the responder, including the hash_to_curve suite, EC point string format and truncation option.

data_param: This message includes the data exchange pattern selected by the responder and the size of responder's dataset.

3.2.2.1. Protocol Result

The structure of ProtocolResult is defined as follows:

struct {
  HashToCurveSuite    suite;
  PointOctetFormat    point_octet_format;
  TruncationOption    truncation_option;
} ProtocolResult;

suite: The hash_to_curve suite selected by the responder.

point_octet_format: The format of EC point octet string chosen by the responder.

truncation_option: The truncation option selected by the responder. When deciding the field, the responder SHOULD consider the sum of both participants' dataset sizes. If the sum is larger than 2^40, truncation MUST no be used, which means the responder MUST set this field by no_truncation.

3.2.2.2. Data Result

The structure of DataInfoResult is defined as follows:

struct  {
  BatchMode   batch_mode;
  OutputMode  output_mode;
  uint64      max_batch_size;
  uint64      item_num;
} DataInfoResult;

batch_mode: The BatchMode selected by responder.

output_mode: The OutputMode selected by responder.

max_batch_size: The maximum size of a EcdhPsiBatch message, which SHOULD NOT be larger than the max_batch_size given by the requester.

item_num: The size of responder's dataset.

3.3. Data Exchange

The data exchange phase consists of two rounds. In each round, a participant may send or receive multiple EcdhPsiBatch messages. Upon receiving a EcdhPsiBatch and handling the data properly, the participant SHOULD respond with a EcdhPsiBatchResponse message, indicating its partner that it is ready to receive the next EcdhPsiBatch message.

3.3.1. EcdhPsiBatch

The structure of EcdhPsiBatch is defined as follows:

enum { transmit(0), retransmit(1), fatal_error(2), (255)} BatchStatus

struct {
  uint64 index;
  opaque octet;
} ECPointOctet;

struct {
  BatchStatus   status;
  uint32        batch_type;
  uint64        batch_index;
  uint64        batch_count;
  uint32        is_last_batch;
  ECPointOctet  data<1..2^64-1>;
} EcdhPsiBatch;

status: Status of current batch message. Values of this field are explained as follows:

  • transmit: It is a normal batch message which has not been transmitted before.

  • retransmit: This message has been transmitted, but it is re-transmitted due to the receiver's request. A batch message may be retransmitted several times before it is properly handled by the receiver. For example, the receiver may need a relative long time to prepare the storage resource, and require the sender to re-send the batch until it has enough space.

  • fatal_error: The sender of this batch message cannot continue the ECDH-PSI protocol procedure due to some fatal errors. The receiver MUST ignore the other fields included in the current EcdhPsiBatch message and terminate the PSI process.

    Such fatal error may occur in the handshake phase or data exchange phase. For example, if the responder sends a HandshakeRespond message with parameters unsupported by the requester, the requester should construct and send a EcdhPsiBatch message carrying fatal_error status so as to terminate the process.

batch_type: This field indicates the mask status of current batch. That is, value "1" stands for that the data included in the current batch are only masked by the owner's private key, and "2" means the data has been doubly-masked with both participants' private keys.

batch_index: The index of current EcdhPsiBatch message, which is an incremental counter starting with "1". The participant SHOULD use independent counters for different rounds and protocol runs. For example, if participant A needs to send dataset p_A in the first round by 10 batches, and sends p_AB in the second round by 5 batches, it should use batch_index from 1 to 10 for the first round, and 1 to 5 for the second.

batch_count: Number of EC points included in the current EcdhPsiBatch message.

is_last_batch: Indicating whether this batch is the last one, where "1" means the batch is the last one from the sender in current round, and "0" for there exists subsequent batches.

data: This field carries multiple EC points with ECPointOctet, which number is specified by the batch_count field. Each ECPointOctet structure includes an unique index allocated by the owner of the corresponding original data, and the encoded octet string.

The purpose of index is to associate the original data item with the (doubly)-masked EC points, such that the participant can match its dataset with the intersection of masked datasets. The values of index are generated by the owner of data so as to identify each data item uniquely. When masking the data from its partner, a participant MUST reserve the index for every record. Section 4.2 gives more discussions on the implementation and maintenance of such indexes.

The content of each octet string is determined by the selected hash_to_curve suite, octet format and truncation option. In the first round, the receiver SHOULD decode the octet strings with the configurations negotiated in the handshake phase in order to mask the data provided by the sender. However, in the second round where masking is not required, the receiver SHOULD treat the contents as octet strings rather than decode them to EC points, as such strings are enough for computing the intersection.

3.3.2. EcdhPsiBatchResponse

The structure of EcdhPsiBatchResponse is defined as follows:

enum {
  success (0),
  retransmit (1),
  fatal_error (2),
  (255)
} BatchResponseStatus;

struct {
  BatchResponseStatus status;
  uint64 batch_index;
} EcdhPsiBatchResponse;

status: Indicating whether the EcdhPsiBatch message identified by batch_index has been handled properly. Values of this field are explained as follows:

  • success: The EcdhPsiBatch message has been handled in a proper way. Upon receiving a response with this status, the sender of EcdhPsiBatch can send the subsequence batch or prepare to receive the EcdhPsiBatch message from its partner.

  • retransmit: Upon receiving retransmit, the participant MUST re-send the EcdhPsiBatch message identified by batch_index.

  • fatal_error: The ECDH-PSI process meets some fatal error such that both the participants MUST terminate the current session, and discard all data received.

4. Implementation Considerations

4.1. The Representation of EC Point

When deciding the point encoding format for the elliptic curves expect curve25519, the participants should consider the aspects of bandwidth, storage and computation comprehensively. Compressed point format could decrease the costs of transmission and storage, but it also needs more computation resources to "decompress" the point. The "decompress" process may be as costly as scalar multiplication [CZZDL24]. Uncompressed format requires more storage spaces and needs more time to transmit, but the participant can perform scalar multiplications without extra effort to recover the points.

For example, when both parties are deployed in the same data center and linked with a high-bandwidth local area network (LAN), they can choose to use the uncompressed format to achieve better performance.

As another example, the parties may be deployed in geographically separated data centers connected with low bandwidth wide area network (WAN), and equipped with high-end computation acceleration devices such as Graphics Processing Units (GPUs). In this case, the computation resource may not be the bottleneck, and the participants can choose the compress format to minimize the costs of data transmission.

4.2. The Management of Index

A RECOMMENDED method of maintaining indexes is storing the records with a database table and use the row numbers or prime keys as indexes. The intersection result can be matched with original data items by a simple join operation. The participant could also design a different indexing mechanism on its own, as long as guaranteeing its uniqueness.

This document uses explicit indexes to identify data items rather than treats the order of records as "implicit" indexes. Compare with the implicit counterpart, explicit indexes can be efficiently created and maintained by modern databases, and do not require the participants to implement extra logic to preserve the order of records.

When masking the EC points from its partner, a participant MUST send the doubly-masked data with correct indexes. To achieve this, it keeps the indexes of each data items received in the first round, masks the records with its private key, and associates the masked records with the same indexes. These operations can also be implemented easily with database operations by viewing the received records and masked records as columns in the same table, which enables them to be linked naturally via the primary key.

4.3. Support More hash_to_curve Suites

Participants can negotiate hash_to_curve suites besides the ones listed by Section 3.2.1.1. They can use other suites defined by [RFC9380], or use a self-defined suite following the syntax and convention given by Section 8.9 and 8.10 of [RFC9380].

As an example, this document next defines a hash_to_curve suite for the ShangMi(SM) curve and cryptography algorithms. The SM algorithms are becoming mandatory in China, and have been accepted by international standards such as [ISO-SM2], [ISO-SM3] and [RFC8998].

The new suite, denoted by curveSM2_XMD:SM3_SSWU_RO, encodes data to points on the so-called curveSM2[GBT.32918.5-2017] with SM3 hash algorithm[GBT.32905-2016], and reuses the expand_message_xmd and Simplified Shallue-van de Woestijne-Ulas (SSWU) methods for message expansion and mapping.

In particular, curveSM2_XMD:SM3_SSWU_RO_ is defined as follows:

  • encoding type: hash_to_curve
  • E: y^2 = x^3 + A * x + B, where

    • A = -3
    • B = 0x28e9fa9e9d9f5e344d5a9e4bcf6509a7f39789f515ab8f92ddbcbd414d940e93
  • p: 2^256-2^224-2^96+2^64-1
  • m: 1
  • k: 128
  • expand_message: expand_message_xmd (Section 5.3.1 of [RFC9380])
  • H: SM3
  • L: 48
  • f: SSWU method (Section 6.6.2 of of [RFC9380])
  • Z: -9
  • h_eff: 1

The value Z is calculated by the sage script given by Appendix H.2 of [RFC9380] with the p, A and B parameters of curveSM2.

The participants SHOULD agree on the meaning of HashToCurveSuite values. In this document, suite curveSM2_XMD:SM3_SSWU_RO_ is identified as follows:

enum {
  null                              (0),
  ... // (1) to (4) follow section 3.2.1.1
  curveSM2_XMD:SM3_SSWU_RO_         (5),
  ... // other hash_to_curve suites
  (255)
} HashToCurveSuite;

5. Security Considerations

5.1. Data Truncation

This section provides a detailed discussion on the truncation mechanism presented in this document.

The truncation, undoubtedly, will raise the probability of message collision. That is, two different data item may be mapped to the same string after the procedures of masking and truncation by coincidence. The collision may happen across the datasets owned by different participants, which causes a false-positive case that two different records are matched by PSI, or happen in the same dataset. The later situation may also lead to a false-positive case. To be more specific, consider a participant A has two different data items data_X and data_Y which are mapped to the same record mask_data_XY. Its partner, say B, also has record data_X which is mapped to mask_data_XY. Finally, A outputs both data_X and data_Y as the result of PSI, as it cannot distinguish which one matches the record of mask_data_XY.

To avoid such false-positive case, we have to consider the collision probability with respect to the sum number of data items owned by A and B. Such probability can be computed with the well-known birthday paradox bound. Let n be the number of sampling and d be the output space, the probability of collision can be approximately computed with:

p(n,d)=1-e^(-n(n-1)/2d)

This document uses a truncated length of 128 bit, which means d=2^128. For different n, which is the sum size of records, the probability are calculated as follows:

  • If n=2^50, p(2^50, 2^128) = 2^-29

  • If n=2^45, p(2^45, 2^128) = 2^-33

  • If n=2^41, p(2^41, 2^128) = 2^-47

  • If n=2^40, p(2^40, 2^128) = 2^-49

  • If n=2^39, p(2^39, 2^128) = 2^-51

That is, if the number of records is less than 2^40, the probability of false-positive will be smaller than 2^-48. The participant can also decide the truncation option by calculating the collision probability, and only use truncation when they both agree that the probability is acceptable.

6. IANA Considerations

This document has no IANA actions.

7. References

7.1. Normative References

[ECDSA]
American National Standards Institute, "Public Key Cryptography for the Financial Services Industry: The Elliptic Curve Digital Signature Algorithm (ECDSA)", ANSI ANS X9.62-2005, .
[FIPS186-4]
National Institute of Standards and Technology (NIST), "Digital Signature Standard (DSS)", FIPS 186-4, DOI 10.6028/NIST.FIPS.186-4, , <https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.186-4.pdf>.
[GBT.32905-2016]
Standardization Administration of China, "Information security technology --- SM3 cryptographic hash algorithm", GB/T 32905-2016, , <http://www.gmbz.org.cn/upload/2018-07-24/1532401392982079739.pdf>.
[GBT.32918.5-2017]
Standardization Administration of the People's Republic of China, "Information security technology --- Public key cryptographic algorithm SM2 based on elliptic curves --- Part 5: Parameter definition", GB/T 32918.5-2017, , <http://www.gmbz.org.cn/upload/2018-07-24/1532401863206085511.pdf>.
[RFC2104]
Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed-Hashing for Message Authentication", RFC 2104, DOI 10.17487/RFC2104, , <https://www.rfc-editor.org/info/rfc2104>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC5869]
Krawczyk, H. and P. Eronen, "HMAC-based Extract-and-Expand Key Derivation Function (HKDF)", RFC 5869, DOI 10.17487/RFC5869, , <https://www.rfc-editor.org/info/rfc5869>.
[RFC7748]
Langley, A., Hamburg, M., and S. Turner, "Elliptic Curves for Security", RFC 7748, DOI 10.17487/RFC7748, , <https://www.rfc-editor.org/info/rfc7748>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC9380]
Faz-Hernandez, A., Scott, S., Sullivan, N., Wahby, R. S., and C. A. Wood, "Hashing to Elliptic Curves", .

7.2. Informative References

[CHLR18]
Chen, H., Huang, Z., Laine, K., and P. Rindal, "Labeled PSI from Fully Homomorphic Encryption with Malicious Security", Proceedings of the 2018 {ACM} {SIGSAC} Conference on Computer and Communications Security, {CCS} 2018, Toronto, ON, Canada, October 15-19, 2018 , , <https://doi.org/10.1145/3243734.3243836>.
[CZZDL24]
Chen, Y., Zhang, M., Zhang, C., Dong, M., and W. Liu, "Private Set Operations from Multi-query Reverse Private Membership Test", Public-Key Cryptography - {PKC} 2024 - 27th {IACR} International Conference on Practice and Theory of Public-Key Cryptography, 2024, Proceedings, Part {III} , , <https://doi.org/10.1007/978-3-031-57725-3_13>.
[draft-irtf-cfrg-vdaf-12]
Barnes, R., Cook, D., Patton, C., and P. Schoppmann, "Verifiable Distributed Aggregation Functions", Work in Progress, Internet-Draft, draft-irtf-cfrg-vdaf-12, , <https://datatracker.ietf.org/doc/html/draft-irtf-cfrg-vdaf-12>.
[I-D.ietf-ppm-dap]
Geoghegan, T., Patton, C., Pitman, B., Rescorla, E., and C. A. Wood, "Distributed Aggregation Protocol for Privacy Preserving Measurement", Work in Progress, Internet-Draft, draft-ietf-ppm-dap-12, , <https://datatracker.ietf.org/doc/html/draft-ietf-ppm-dap-12>.
[IMC]
Katz, J. and Y. Lindell, "Introduction to Modern Cryptography, Third Edition".
[ISO-SM2]
International Organization for Standardization, "IT Security techniques -- Digital signatures with appendix -- Part 3: Discrete logarithm based mechanisms", ISO/IEC 14888-3:2018, , <https://www.iso.org/standard/76382.html>.
[ISO-SM3]
International Organization for Standardization, "IT Security techniques -- Hash-functions -- Part 3: Dedicated hash-functions", ISO/IEC 10118-3:2018, , <https://www.iso.org/standard/67116.html>.
[KKRT16]
Kolesnikov, V., Kumaresan, R., Rosulek, M., and N. Trieu, "Efficient Batched Oblivious {PRF} with Applications to Private Set Intersection", Proceedings of the 2016 {ACM} {SIGSAC} Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016 , , <https://doi.org/10.1145/2976749.2978381>.
[LOGJAM]
Adrian, D., Bhargavan, K., Durumeric, Z., Gaudry, P., Green, M., Halderman, J. A., Heninger, N., Springall, D., Thomé, E., Valenta, L., VanderSloot, B., Wustrow, E., Zanella-Béguelin, S., and P. Zimmermann, "Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice", Proceedings of the 22nd {ACM} {SIGSAC} Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015 , , <https://doi.org/10.1145/2810103.2813707>.
[Meadows86]
Meadows, C., "A More Efficient Cryptographic Matchmaking Protocol for Use in the Absence of a Continuously Available Third Party", 1986 IEEE Symposium on Security and Privacy , <https://doi.org/10.1109/SP.1986.10022>.
[MRH04]
Maurer, U., Renner, R., and C. Holenstein, "Indifferentiability, Impossibility Results on Reductions, and Applications to the Random Oracle Methodology", In TCC 2004: Theory of Cryptography, pages 21-39, DOI 10.1007/978-3-540-24638-1_2, , <https://doi.org/10.1007/978-3-540-24638-1_2>.
[MS21]
Lauter, K., Kannepalli, S., Laine, K., and R. C. Moreno, "Password Monitor: Safeguarding passwords in Microsoft Edge", <https://www.microsoft.com/en-us/research/blog/password-monitor-safeguarding-passwords-in-microsoft-edge/>.
[RFC8031]
Nir, Y. and S. Josefsson, "Curve25519 and Curve448 for the Internet Key Exchange Protocol Version 2 (IKEv2) Key Agreement", RFC 8031, DOI 10.17487/RFC8031, , <https://www.rfc-editor.org/info/rfc8031>.
[RFC8446]
Rescorla, E., "The Transport Layer Security (TLS) Protocol Version 1.3", RFC 8446, DOI 10.17487/RFC8446, , <https://www.rfc-editor.org/info/rfc8446>.
[RFC8998]
Yang, P., "ShangMi (SM) Cipher Suites for TLS 1.3", RFC 8998, DOI 10.17487/RFC8998, , <https://www.rfc-editor.org/info/rfc8998>.
[RFC9382]
Ladd, W., "SPAKE2, a Password-Authenticated Key Exchange", RFC 9382, DOI 10.17487/RFC9382, , <https://www.rfc-editor.org/info/rfc9382>.
[RFC9497]
Davidson, A., Faz-Hernandez, A., Sullivan, N., and C. A. Wood, "Oblivious Pseudorandom Functions (OPRFs) Using Prime-Order Groups", RFC 9497, DOI 10.17487/RFC9497, , <https://www.rfc-editor.org/info/rfc9497>.
[RR22]
Raghuraman, S. and P. Rindal, "Blazing Fast PSI from Improved OKVS and Subfield VOLE", Proceedings of the 2022 {ACM} {SIGSAC} Conference on Computer and Communications Security, {CCS} 2022, Los Angeles, CA, USA, November 7-11, 2022 , , <https://doi.org/10.1145/3548606.3560658>.

Authors' Addresses

Yuchen Wang
Ant Group
Wenting Chang
Ant Group
Postfach 330440
Beijing
Yufei Lu
Ant Group
Cheng Hong
Ant Group
Jin Peng
Ant Group