# secdir review of draft-ietf-ippm-ioam-data-integrity-12
CC kaduk

I have reviewed this document as part of the security directorate's ongoing effort to review all IETF documents being processed by the IESG. These comments were written primarily for the benefit of the security area directors. Document editors and WG chairs should treat these comments just like any other last call comments.

This is a re-review; I had previously reviewed version -07, and mostly am looking at the diff from that revision (though in some sections that got many changes I just read the current version without trying to ascertain what exactly changed).

I see now as I am completing this review that Justin had responded to my previous review comments back in January (https://mailarchive.ietf.org/arch/msg/ippm/32pZbJc-8aEf6_MpH_2Sr4-sBSc/) but I did not respond at that time.  I am not entirely sure what happened there; my apologies for the lack of timely response (I will go ahead and reply now to those items that still need a response).

The changes from -07 to -12 have made big improvements on a number of the points raised by my initial review, but there are a few issues that are critical to the security properties of the overall system but do not seem to be completely addressed yet (I suspect that some of these were topics where text changes were pending me responding to the email thread from the previous review).

Of particular note (these topics will be covered in more detail below), that response does clarify that we don't want to trust nodes (other than verifiers) ver much and seems to support the idea that each node will manage their own nonces (vs reusing the nonce from a received packet).  However, this does not seem to have been fully reflected in the document yet.  (The proposal to include the previous tag in the next node's AAD is sound.)

In several places in this review I will quote portions from my original review of the -07.  This is just intended to set some context for my subsequent comments; in nearly all cases the comment made on the -07 no longer applies as-is to the -12.

## Discuss

### Key management

When reviewing the -07, I wrote:

> While the specifics of key distribution and management will inherently be out of scope for this specification, I think we do need to settle a core question: will each IOAM Node have its own unique key for signature generation, or do we expect some level of key reuse across machines within a given domain?  The latter scenario intuitively seems like it would make for a simpler deployment, but would place some quite stringent limitations on what cryptographic mechanisms we can use at runtime.  NIST's key usage limitations for GCM, in particular, might actually prove prohibitive for the "one key shared across nodes" option in practice.

The responses from Justin (both on the ability to use the deterministic nonce construction and on the trust model) seem to confirm that we expect each node to have their own unique key (shared between it and any Validators in the domain but not to any other nodes).  This uniqueness of key (in terms of nodes that are populating ICVs in packets) is vital to the safety of the system and needs to be clearly stated as an explicit requirement for using the mechanism.  Currently I see some brief mentions in §5.4 ("assumes that symmetric keys have been distributed to the Validator only") and §7 ("need to be exchanged in a secure way between the nodes involved with integrity protection of IOAM-Data-Fields").  The former weakly implies, by using "keys" plural, that each node gets their own key, but the latter implies an exchange of keying material between multiple nodes that are applying integrity protection, which (unless it's just between each node and the Validator(s)) is exactly what we don't want (in particular, it seems that it would violate the "don't trust the nodes" goal, because if two non-Validator nodes share a key one can successfully impersonate the other to the Validator).

So, I propose a new subsection for "key and nonce management" that covers the requirements and assumptions while avoiding any discussion of the specific key exchange method that might be used.  This would probably look something like:

> The Integrity Protection Method defined in this document leverages symmetric keys.  In order to use this mechanism, it is required that each encapsulating node and each transit node that updates the ICV has its own unique symmetric key for applying integrity protection.  Additional assumptions about these keys apply:
- that each key is securely generated
- that each key is securely distributed to only the corresponding encapsulating/transit node and any Validator that needs to validate messages protected by that key
>
> The details of key generation and distribution are outside the scope of this document.
>
> In addition to key management, the per-message nonces used with GMAC need to be managed to prevent (key,nonce)-pair reuse.  Since reuse of a nonce with a given key allows forgery of arbitrary ciphertexts with valid authentication tag, it is extremely important to have high confidence in nonce non-reuse.  While this document does not prescribe specific nonce-generation mechanisms, it does mention a non-exhaustive list of possibilities that includes: an integer counter, a linear feedback shift register, and a 64-bit timestamp.  With either of the former two mechanisms, there is some current state (counter or feedback register state) that must be retained on durable storage such that a power interruption or system crash does not lead to nonce reuse.  (For efficiency, implementations sometimes commit to durable storage a range of nonce values that can be used without a blocking write to durable storage, to amortize the cost of the write, which results in some nonce values going unused after a crash or power interruption but not a catastrophic loss of security.)  When timestamps are used for nonce generation, the system requires a time source that is strictly monotonic (i.e., never jumps backwards or repeats a value even in the face of leap seconds, daylight-saving time, or other potential disruptions).  If a remote time source is used, it must be authenticated in order to provide these properties.  Note that the time source does not need to be at all accurate in order to provide these properties!
>
> The integrity protection mechanism defined in this document relies only on the uniqueness of the nonce input to GMAC to provide its security properties.  However, use of a counter or timestamp does reveal some information about the rate and/or times at which packets are protected, and the considerations of RFC 9416 apply; in particular, some scenarios may benefit from using an unpredictable nonce, as is provided by the linear feedback shift register.
>
> The details of nonce management are outside the scope of this document.
>
> With the deterministic nonce construction used by the Integrity Protection Method defined in this document, a given key can be used to protect up to 2^64 messages before encountering the catastrophic (key,nonce) reuse problem.  Accordingly, the key generation/distribution mechanism will need to be able to rotate any keys that approach this limit before the limit is reached.  This protocol does not define any specific "key identifier" field, but a seamless key rotation might be achieved by, for example, associating a new node identifier to the new key and informing consumers of node identifiers that the identifiers in question reflect the same underlying device.

This subsection might fit as the first subsection of §5, or perhaps be a subsection under §7 (which might then motivate further division of the security considerations content into subsections).  It seems to have expanded a bit since I first started putting it together, which probably impacts where it fits best.

Note that this also addresses some lingering aspects of my previous comment about "GCM key usage limitations" (in particular, specifically documenting that there is a potential need for key rotation).

### Signature as nonce for transit nodes

In my review of the -07 I wrote:

> The specification of the actual cryptographic protection algorithm in §5 includes a provision (step 2) for a transit node to compute a new signature that accounts for the additions it has made to the IOAM data.  It seems to be saying that the corresponding cryptographic computation involves using the received signature value as the nonce input for producing this new signature. While SP800-38D does seem to permit using multiple IV lengths with a given key, using the received signature as a nonce seems to set us up for using nonces of length other than 96 bits, which (per {{GCM-Key-usage-limitations}}) strongly limits how much traffic we can send in a given key.  It also would entail using data received from the network as the nonce for a given node's private key, which seems like it would make it easy for an attacker to induce (key,nonce) reuse.  (This might be mitigated if the transit nodes diligently validate the received signature prior to using it as a nonce, but even that would not obivate the need for a fully reliable replay detection mechanism for the lifetime of the key, which seems prohibitively expensive.)
> [... more about validation being expensive and a potential alternate option with diferent tradeoffs]

In §5.2 where we discuss transit node behavior, we now say that the received ICV field is used in the input AAD for the transit node's GMAC calculation, which is a good way to ensure that the encapsulating (or previous transit) node's integrity protection tag gets validated by the ultimate validator of the packet.  But we seem to say that the "received Nonce field [is] provided to GMAC" for the transit node's GMAC invocation, and we even go further and say that "a transit node MUST NOT modify [...] the Nonce-Length field, and the Nonce field in the IOAM Integrity Protection header."  As an implementor, the only way I could plausibly interpret that combination of statements is that I should have the transit node use the nonce value from the received packet as the input nonce for GMAC, but that is trivially insecure unless the transit node is validating the received ICV before applying its own GMAC (since an attacker could just force a specific nonce value and thus nonce reuse on the transit node), and in general remains insecure even in the face of such validation absent other measures (since an attacker would be able to replay a valid packet/ICV under conditions that cause the transit node to generate a different AAD input for the (key,nonce) pair).  (For clarity: I think there are reasonable solutions that do not involve the transit node validating the received ICV and only applying checks to the received nonce, but I needed to qualify the "trivially insecure" statement somehow.) The most secure option is to have the transit node generate and use a new nonce in the same manner as an encapsulating node, but then we have the problem that both the nonces (from encapsulating node and transit node) need to be included in the packet in order for the final ICV to be validatable.  I don't know if there's scope for such expansion in the conveyed nonce (one might potentially be able to increase the Nonce-Length and concatenate them, if we can hold to the invariant that nonces are 96-bits).

If we want to have the transit nodes use the nonce from the received packet then we need to lay out some specific requirements that make that safe, which would include at a minimum that the encapsulating and transit nodes have distinct keys, that the scheme used to create the invocation field of the nonce is known to the transit node, and that the transit node applies some validation checks on the received nonce in order to avoid nonce reuse.  The nature of such checks depend on the nonce-generation scheme, of course, but for a timestamp or counter would include enforcing strict monotonicity (with optional lookahead/behind window to specifically track a fixed number of possible nonce values and tolerate some level of packet reordering)

I will note that a scheme based on HMAC or KMAC rather than GMAC would not be as impacted by this issue, since nonce reuse is not fatal to those MACs in the same way (we could probably just use the encapsulating node's nonce and tolerate some degree of replay), but that is a big change and would make it impossible to benefit from hardware GCM implementations, so we may be stuck working around GMAC's limitations.

### Bit-level hash inputs (Protected flags for trace and DEX option-types)

In my review of the -07 I wrote:
> I think we need to provide more clear guidance on how to handle the protected flags for the trace and DEX option-types.  First, a note in passing that interoperability does require locking in which flag bits are (not) covered by the MAC at the time the protected option type is specified (i.e., now), which blocks off future extensions since any new flag bits cannot be integrity protected using this mechanism.  That said, it would be possible to divide the unallocated flags range into a protected and unprotected portion, to leave a little bit of flexibility for the future.
> My main concern here, though, is to concretely specify, at the bit level, what the input to the MAC (or hash) function is -- do we mask out the unprotected bits (if so, to 0s or 1s?), or do we literally just extract the two bits in question and make the bitstream input to the MAC (or hash) be not byte aligned?  I strongly suggest the former, since implementation handling for non-octet inputs to hash functions is very poor, but reading the current text I would conclude that I must attempt to implement the latter.  (I note on re-read that SP800-38D does require the inputs to have bit lengths be a multiple of 8, so if we decide to use GMAC directly rather than hash-and-MAC, we would need to specify padding if we opt for the latter approach.)

Justin's reply indicated that the intention was to mask out (to zero) bits not covered by the integrity protection, which is fine from a perspective of fully specifying what we want to see and avoiding non-byte-aligned cipher inputs, but does not seem to be reflected in the -12.  Maybe this is because protection of flags was removed entirely and thus the behavior does not need to be specified at all?  Though I do not see any results in the mail archive for a search on "draft-ietf-ippm-ioam-data-integrity flags" that would line up with the right timeline for that.  So I'm curious about why/whether we can get away with not protecting any flags (including ones yet to be defined).  The following paragraphs are predicated on an assumption that we do integrity protect at least some flag bits, and are irrelevant if no flag bits will need protection.

The topic of how to handle flag bits that are currently undefined but get defined in the future also remains as an item to highlight in my current review.  I see Justin's indication that it's a hard problem (and to large extent agree with the analysis), but leaving things undefined here still feels like it's setting us up for bigger problems down the road.

Thinking through that a little more now, I suppose that it would technically be possible to define a new flag, roll out implementation support to all machines in the domain but without making use of the behavior that sets bits to one, and then start to gradually enable the other behavior while monitoring for breakage, but that's a very heavy deployment burden and requires all systems to be updated before any systems can make use of the new feature, whereas defining now which bits are/aren't covered would let integrity protection continue to work with only the machines that need to use the new feature needing to get updates.

I have extremely high confidence that the IESG will ask about protocol updates and interoperability if we don't have some text in place by the time they see the document, so I would suggest putting in some text with at least your analysis of the issues and motivation for taking the current approach, even if you don't want to adopt my proposal.  (In particular, the current state of affairs is implicitly applying a requirement that future protocol flags cannot provide context or meaning to the IOAM-Data-Fields, which seems like a fairly strong limitation on protocol evolution and would force definition of a new Option-Type in order to allow such protocol behavior.)

## Comments

### Threat model

In my review of the -07 I wrote:
> §1 lays out the scenario as having an IOAM-Domain that's under a single administrative control but invokes the possibility of data collected in untrused or semi-trusted environments as a motivation for integrity protection.  Is this just a risk that nodes which are supposed to be under the domain's administrative control get compromised, or are we intending to consider a subtler scenario with semi-trusted entities being authorized parts of the administrative domain directly, or something else?

The situation has improved since then, with §5.4 gaining a discussion of "Zero Trust" and that the Validator is the only point of trust, but I don't think this text is fully aligned with general usage of "trust" or "zero trust".  (Also, I think "Zero Trust" is not a good name for the Integrity Protection Method Suite, since it's not something unique about this mechanism.)  In particular, I think that we still trust encapsulating and transit nodes to behave properly while applying ICVs, but that we do not rely on their behavior for anything other than validating the ICVs they produce, and we do not have to trust other entities in the system "at all" (not technically true but informally okay for this discussion).  So I would propose something more like:

% The Validator is the key trusted entity in this system, as it has access to all of the keying material in use and makes the final determination of whether an ICV is valid.  A misbehaving Validator would be able to use those keys to forge or modify IOAM data as if it passed through the encapsulating or transit nodes in question, and could render incorrect assessments of an ICV's (in)validity, and we have no recourse or means to detect such misbehavior.  In contrast, we only trust non-IOAM nodes to convey packets without manipulation and can detect such modification by validating received ICV values; and we trust IOAM encapsulating and transit nodes to honestly apply integrity protection to packets they process.  Misbehaving encapsulating or transit nodes could forge or drop packets they process but cannot impersonate other IOAM nodes or modify integrity-protected IOAM data produced by other nodes without being detected by the Validator.  In particular, a transit node is limited in what forgery can be made without detection because the Validator will validate the encapuslating node's ICV as part of validating the final ICV, so modification to content protected by the encapsulating node would be detected at the time of validation.

### Separation of layers

In my review of the -07, I wrote:
> While it's generally reasonable (as §3 does) to require the lower layer protocol to handle threats at their own layer, I would probably call out that since IOAM is defined as data fields rather than a dedicated packet structure, we also rely on the lower layer to provide integrity protection for whch data fields (that is, IOAM Option-Types) are present in a given packet.

but I'm not sure my intent came through properly.  Basically, this is just saying that since each integrity-protected IOAM Option-Type is a distinct IOAM-Option with its own ICV, there is no protection for the set of which IOAM Option-Types are expected to be present in a packet -- an attacker could just strip one out or replace it with an unprotected analogue but the unmodified Option-Types would still validate.  So we rely on a different layer (possibly out-of-band communication between encapsulator and Validator) to ensure that the set of IOAM Option-Types which are present get some integrity protection.  We could say as much in the security considerations, e.g.:

% Because each integrity-protected IOAM Option-Type is a distinct IOAM-Option with its own ICV, there is no mapping or listing of which IOAM Option-Types are expected to be present in a packet.  Accordingly, an attacker with ability to modify the packet contents could just remove an integrity-protected IOAM Option-Type or replace it with an unprotected analogue, but the unmodified integrity-protected Option-Types would still validate successfully.  If there is a need to provide integrity protection for which IOAM Option-Types are present in a packet, that protection must be supplied at a different layer.

### Mutable fields to skip for signing

In my review of the -07, I wrote:
>  The discussion in §5 very quickly glosses over "IOAM-Data-Fields supposed to be modified by other IOAM nodes on the path MUST be excluded from the signature".  This is actually a critical point for constructing and validating ciphertexts, and seems like it would merit a longer treatment.  Most notably, I would probably want to have a central table (or maybe add to the IANA registry?) to indicate "does this field get skipped for integrity protection: Y/N" to try to leave out any guesswork by the implementor as to what is mutable or not.

and Justin proposed to add a table to this effect.  It looks like maybe he was waiting for a reply from me about whether to just do the table or also to shuffle around or remove some other content.  Some of that content has been relocated from -07 to -12 already, though, so I am not entirely sure of the current expected status.  I think the current content in the -12 is mostly talking about which header fields are included in the integrity protection, whereas I wanted to both list the header fields and the specific data fields that are (not) covered by the integriity protection (i.e., those that are mutable or written by other nodes, but to specifically list them out in a table rather than leaving it implicit).

### Take action based on IOAM data

In the -07 we had some fairly strong language about "Each node that takes actions triggered by fields in the IOAM Integrity Protected Option-Type header MUST act as a Validator" that doesn't seem to have an analogue in the -12.  But that may well be ok, as it's up to the consumer of the data whether they need the protection provided by the integrity protection method or are willing to treat it as functionally unprotected for the purpose to which it is being put.  We do, however, probably want to say (in the security considerations section) that nodes which consume the IOAM data without validating the ICV do not get the security benefit that the integrity protection mechanism provides.

### Selection of header fields

While I agree that NodeLen can be computed from the IOAM-Trace-Type, adding an extra byte of AAD input has an extremely low computational cost, no wire protocol overhead, and substantially improves the robustness of implementations (by forcing them to directly verify the computed length cryptographically instead of hoping that different implementations implement the right computation from the (cryptographically validated) trace type.  I see that, as a five-bit field, adding it to the ADD would require specifying the bit-level padding, but IMO that is a price worth paying in order to get the property that "you have to have the right value in order to cryptographically interoperate" (a philosophy that has served modern do novo security protocol designs like QUIC very well).

### Decapsulation

In §5.3 we have some text:

> The decapsulating node MUST NOT add the IOAM Integrity Protection header. Also, the decapsulating node MUST NOT modify the Method-ID field, the Nonce-Length field, and the Nonce field in the IOAM Integrity Protection header.

which is roughly analogous to what we say about transit nodes.  But some of these listed actions don't seem to make any sense for a decapsulating node and thus may not need to be listed?  E.g., if the decapsulating node is removing the whole header then what is the concern in potential modification to those fields?

### All or none for IOAM Integrity Protection

Since the -07 we have some new text in the security considerations:

> When enabled, the integrity protection MUST be applied to the entire IOAM set, not a subset.

The motivation for having this strict of a requirement is only partially clear to me.  At the protocol level, the integrity protected Option-Types are distinct from the unprotected ones and thus it is easy to identify which packets/options to validate or not validate upon receipt.  Now, as I mentioned previously there is no integrity protection for which option-types are present in a given packet, so the integrity-protected option-types could be stripped entirely or even replaced with the unprotected analogue (though of course with the quoted MUST only the "stripping" attack is possible, not replacement), and any Validator would need to have advance knowledge of which option-types should be present in order to detect such modification.  If the Validator is already going to need that kind of knowledge, it doesn't seem like a huge stretch for it to also know about a set of Option-Types that should and a set that should not be integrity protected, and such a scheme would have the same level of protection against tampering for the integrity-protected Option-Types that the current state of affairs has.  Following this "MUST" directive, we do go on and give some partial description of the problem it is trying to solve (in particular, to the subset of issues where a transit node could pose as an encapsulating node) and propose the mitigation of having the Validator know which Namespace-IDs get integrity protection.  However, this is both an incomplete description of the problem and an incomplete mitigation -- the Validator would also need to know which Option-Types are supposed to be added by which encapsulating node in order to be able to detect all the different modification scenarios.

Having said all that, it seems that there is some possibility for having an even more flexible policy still: if an encapsulating node is applying a known policy to only integrity protect a subset of traffic, adherence to that policy can be statistically verified at the Validator, and similarly statistical analysis can be performed to attempt to detect whether the protected and unprotected data are qualitatively different, as would happen if the IOAM data was tampered with but only the unprotected packets were tampered with.  So I think we need to have both a more thorough description of the problem statement in this space and some more exposition around the justification for imposing the chosen solution (whether it ends up remaining this "all or nothing" requirement or some more flexible option) -- from where I sit we would get more value out of putting IOAM on all packets and only putting integrity protection on some of them (or some of the Option-Types within them), than we would by only putting IOAM on some packets and integrity protecting all of those that we do.  The corresponding cost in terms of a weakening of what statements the Validator can make about traffic non-modification may be tolerable.

### Weaken recommendation to use 64-bit timestamp as nonce

In §7 (in the context of replay protection) we say: "a suggestion would be to put a 64-bit timestamp in the Invocation field of the nonce", which is probably a stronger recommendation than is defensible (see my discussion in the DISCUSS point); I'd go with just "a possibility" here.

## NITS

###

In §5.1 I'd s/mandatory requirements/the mandatory requirements/ in the first line.

### What is meaning?

As a general remark, I'd try to avoid saying that (absent some other context/metadata) a field is "meaningless" since there's probably enough lingering context to imply a "well, it's either X or Y but we're not sure" type scenario.  My preference would be for a phrasing like "cannot be reliably interpreted".  But that's largely a matter of style.

###

In §5.1.2, antepenultimate ¶, s/a E2E/an E2E/ -- whether you read it as "end to end" or "ee two ee", it's still a vowel sound and needs the 'n'.

## Notes

This review is formatted in the "IETF Comments" Markdown format, see https://github.com/mnot/ietf-comments.