Be ye not afraid -- I have reviewed this document as part of the operations directorate's ongoing effort to review all IETF documents being processed by the IESG. These comments were written primarily for the benefit of the operation area directors. Document editors and WG chairs should treat these comments just like any other last call comments. Version reviewed: draft-ietf-conex-tcp-modifications-09 Summary: I am not sure that this document is ready, and believe that review is needed from the Ops ADs. I believe that interactions with TCP feedback need to be performed very carefully, and I do not have the knowledge to adequately evaluate these, hopefully others with this experience (Transport ADs?) have also evaluated the document. The document needs significant readability cleanup and nit fixing. Detail: I found the document difficult to read -- there are a large number of typos, areas with lack of precision, ambiguities and grammar issues. This made a full evaluation difficult. I went to read the 'Document Quality' section of the Shepherd report to see if this was noted, but it was simply boilerplate. The document is marked as Experimental. I would be more concerned if it were a different stream... Open questions: Section 7 Open Areas for Experimentation; "This decision was taken because most network devices today expirience byte-congestion where the memory is filled exactly with the number of bytes a packet carries. However, there are also devices that may allocate a certain amount of memory per packet, no matter how larger a packet is." Citation needed; many devices I am familiar with split packets into multiple fixed size cells. e.g Juniper Q5 chips split packets into 96, 112, 128, 144, 160, or 176 byte cells and spray these across the fabric.I believe the Cisco CRS (and similar architectures) perform similar cell splitting. RFC7141 did not seem to contain this, but I may have missed it. In my opinion the document should discuss monitoring and reporting. What parameters should implementations expose to the user / operator? I think that the document should also discuss how this technique does or does not suffer from the potential of global synchronization (although this may be covered in other ConEx documents that I am not familiar with). The Security Considerations and Open Areas for Experimentation sections sort of discuss coexistence / deployment, but not in much depth Nits are readability issues - in a number of cases I was not entirely sure what was intended, so my suggestions may not be correct. Whereas loss has to be minimized, ECN can provide more fine-grained [O] Whereas loss has to be minimized, [P] Do you mean While, not whereas? I'm not sure how to parse this. feedback information. ConEx-based traffic measurement or management mechanisms could benefit from this. ... 3. Counting congestion ... The outstanding bytes counted based on ECN feedback information are maintained in the congestion exposure gauge (CEG), as explained in Section 3.2. When the sender sends a ConEx capable packet with the E or L flag set [O] When the sender sends a ConEx capable packet with the E or L flag set [P] When the sender sends a ConEx capable packet with the E or L flag set, [R] punctuation it reduces the respective counter by the byte-size of the packet. This is explained for both counters in Section 4.1. Note that all bytes of an IP packet must be counted in the LEG or CEG to capture the right number of bytes that should be marked. Therefore the sender SHOULD take the payload and headers into account, up to and including the IP header. However, in TCP the information how large the headers of an lost or marked pacekt were is [O] information how large the headers of an lost or marked pacekt [P] information regarding how large the headers of a lost or marked packet [R] readability, grammar, and typo (three changes) usually not available, as only payload data will be acknowledged. If equal-sized packets, or at least equally distributed packet sizes [O] or at least equally distributed packet sizes [P] or at least equally distributed packet sizes, [R] grammar can be assumed, the sender MAY only add and subtract TCP payload bytes. ... 3.1. Loss Detection This section applies whether or not SACK support is available. The following subsection in addition handles the case when SACK is not [O] The following subsection in addition handles [P] The following subsection (3.1.1) handles [R] Clarity available. A TCP sender detects losses and subsequently retransmits the lost data. Therefore, ConEx sender can simply set the ConEx L flag on all retransmissions in order to at least cover the amount of bytes lost. If this aprroach is taken, no LEG is needed. [O] aprroach [P] approach [R] spelling However, any retransmission may be spurious. In this case more bytes have been marked than necessary. To compensate this effect a ConEx [O] To compensate this effect [P] To compensate for this effect, [R] grammar sender can maintain a local signed counter, the (LEG), that indicats [O] indicats [P] inidicates [R] spelling the number of outstanding bytes to be sent with the ConEx L flag and also can become negative. Using the LEG, when a TCP sender decides that a data segment needs to be retransmitted, it will increase LEG by the size of the TCP payload bytes in the retransmission (assuming equal sized segments such that the retransmitted packet will have the same number of header bytes as the original ones): For each retransmision: [O] retransmision [P] retransmission [R] spelling LEG += payload Note, how the LEG is reduced when the ConEx L marking are set is described in section Section 4. Further to accommodate spurious restransmissions, a ConEx sender [O] restransmissions [P] retransmissions [R] spelling SHOULD make use of heuristics to detect such spurious retransmissions (e.g. F-RTO [RFC5682], DSACK [RFC3708], and Eifel [RFC3522], [RFC4015]) if are already available in a given implementation. If no [O] if are already available in a given implementation. [P] if already available in a given implementation. [R] grammar mechanism for detecting spurious retransmissions is available, the ConEx sender MAY chose to implement one of the mechanism stated above. However, given the inaccuracy that ConEx may have anyway and the timeliness of ConEx information, a ConEx MAY also chose to not componsate for spurious retransmission. In this case if spurious [O] componsate [P] compensate [R] spelling retransmissions occur, the ConEx sender simple has sent too much [O] retransmissions occur, the ConEx sender simple has sent too much [P] retransmissions occur, the ConEx sender simply has sent too many [R] grammar x 2 ConEx signals which e.g would decrease the congestion allowance in a ConEx policer unnecessary. [O] in a ConEx policer unnecessary. [R] ? Cannot parse. If a heuristic to detect spurious retransmission is used and has [O] If a heuristic to detect spurious retransmission is used [P] If a heuristic method is used to detect spurious retransmission [R] readability, and was missing a noun after heuristic. Could use analysis instead. determined that a certain number of packets were retransmitted erroneously, the ConEx sender subtracts the payload size of these TCP packets from LEG. If a spurious reransmission is detected: [O] reransmission [P] retransmission [R] spelling LEG -= payload Note that the LEG can get negative, if too many L marking have [O] Note that the LEG can get negative [P] Note that LEG can become negative [R] clarity already been sent. This case is further discussed in section Section 6. 3.1.1. Without SACK Support If multiple losses occur within one RTT and SACK is not used, it may take several RTTs until all lost data is retransmitted. With the scheme described above, the ConEx information will be delayed considerably, but timeliness is important for ConEx. However, for ConEx it is not important to know which data got lost but only how much. During the first RTT after the initial loss detection, the [O] However, for ConEx it is not important to know which data got lost but only how much. [P] For ConEx, it is important to know how much data was lot; it is not important to know what data is lost. [R] readability amount of received data and thus also the amount of lost data can be estimated based on the number of received ACKs. ... 3.2. ECN ... DeliveredData covers the number of bytes that has been newly delivered to the receiver. Therefore on each arrival of an ACK, DeliveredData will be increased by the newly acknowledged bytes (acked_bytes) as indicated by the current ACK, relative to all past ACKs. The formula depends on whether SACK is available: if SACK is not avaialble SACK_diff is always zero, whereas is ACK information is [O] avaialble [P] available [R] spelling available is_dup and is_after_dup are always zero. With SACK, DeliveredData is increased by the number of bytes provided by (new) SACK information (SACK_diff). Note, if less unacknowledged bytes are announced in the new SACK information than in the previous ACK, SACK_diff can be negative. In this case, data is newly acknowledged (in acked_bytes), that has previously already been accumulated into DeliveredData based on SACK information. Otherwise without SACK, DeliveredData is increased by 1 SMSS on duplicate acknowledgements as duplicate acknowledgements do not [O] as duplicate acknowledgements [P] because duplicate acknowedgements [R] grammar. Since would also be fine, instead of as. acknowlegde any new data (and acked_bytes will be zero). For the [O] acknowlegde [P] acknowledge [R] spelling subsequent partial or full ACK, acked_bytes cover all newly acknowledged bytes including the ones that where already accounted which the receiption of any duplicate acknowledgement. Therefore [O] the ones that where already accounted which the receiption of any duplicate acknowledgement [P] those already accounted for with the receipt of any duplicate acknowledgement. [R] spelling, grammar, inability to parse the original. I *think* this is what was meant. DeliveredData is reduced by one SMSS for each preceding duplicate ACK. Consequently, is_dup is one if the current ACK is a duplicated ACK without SACK, and zero otherwise. is_after_dup is only one for the next full or partial ACK after a number of duplicated ACKs without SACK and num_dup counts the number of duplicated ACKs in a row (which usually is 3 or more). With classic ECN, one congestion marked packet causes continuous congestion feedback for a whole round trip, thus hiding the arrival of any further congestion marked packets during that round trip. A more accurate ECN feedback scheme (AccECN) is needed to ensure that feedback properly reflects the extent of congestion marking. The two cases, with and without a receiver capable of AccECN, are discussed in the following sections. 3.2.1. Accurate ECN feedback With a more accurate ECN feedback scheme (AccECN) that is supported by the receiver either the number of marked packets or the number of [O] by the receiver either the number [P] by the receiver, either the number [R] grammar/punctuation marked bytes will be feed back from the receiver to the sender and is [O] feed back [P] fed back [R] spelling therefore know at sender-side. In the latter case the CEG can [O] therefore know at sender-side. In the latter case the CEG can [P] therefore know at sender-side. In the latter case, the CEG can [R] grammar/punctuation directly be increased by the number of marked bytes. Otherwise if D is assumed to be the number of marks, the gauge (CEG) will be conservatively increased by one SMSS for each marking or at max the number of newly acknowledged bytes: CEG += min(SMSS*D, DeliveredData) 3.2.2. Classic ECN support ... To extract more than one ECE indication per RTT, a ConEx sender could set the CWR flag continuously to force the receiver to signal only one ECE per CE mark. Unfortunately, the use of delayed ACKs [RFC5681] (which is common) will prevent feedback of every CE mark; if a CWR confirmation is received before the ECE can be sent out on the next ACK, ECN feedback information could get lost (depeding on [O] depeding [P] depending [R] spelling the actual receiver implementation). Thus a sender SHOULD set CWR only on those data segments that will presumably trigger a (delayed ACK. The sender would need an additional control loop to estimated [O] to estimated [P] to estimate [R] grammar which data segments will trigger an ACK in order to extract more timely congestion notifications. Still the CEG SHOULD be increased [O] Still the CEG SHOULD be increased [P] Still, the CEG SHOULD be increased [R] readability by DeliveredData, as one or more CE marked packets could be acknowledged by one delayed ACK. 4. Setting the ConEx Flags By setting the X flag, a packet is marked as ConEx-capable. All packets carrying payload MUST be marked with the X flag set, including retransmissions. Only if no congestion feedback information is (currently) available, the X flag SHOULD be zero, such as for control packets on a connection that has not sent any (user) data for some time e.g., sending only pure ACKs which are not carrying any payload. [O] Only if no congestion feedback information is (currently) available, the X flag SHOULD be zero, such as for control packets on a connection that has not sent any (user) data for some time e.g., sending only pure ACKs which are not carrying any payload. [P] The X flag SHOULD be zero only if no congestion feedback information is (currently) available (e.g. for control packets on a connection that not sent any user data for some time and is sending only pure ACKs that are not carrying any payload). [R] grammar and readability 4.1. Setting the E or the L Flag As described in section Section 3.1, the sender needs to maintain a CEG counter and might maintain a LEG counter. If no LEG is used, all retransmission will be marked with the L flag. Further, as long as the LEG or CEG counter is positive, the sender marks each ConEx-capable packet with L or E respectively, and decreases the LEG or CEG counter by the TCP payload bytes carried in the marked packet (assuming headers are not being counted because packet sizes are regular). No matter how small the value of LEG or CEG, if it is positive, the sender MUST NOT defer packet marking to ensure ConEx signals are timely. Therefore the value of LEG and CEG [O] No matter how small the value of LEG or CEG, if it is positive, the sender MUST NOT defer packet marking to ensure ConEx signals are timely. [P] No matter how small the value of LEG or CEG, if the value is positive the sender MUST NOT defer packet marking; this ensure ConEx signals are timely. [R] readability. will commonly be negative. If both LEG and CEG are positive, the sender MUST mark each ConEx- capable packet with both L and E. If a credit signal is also pending (see next section), the C flag can be set as well. 4.2. Setting the Credit Flag ... Recall that CSC will be decreased whenever congestion occurs, therefore CSC will need to be replenished as soon as CSC drops below [O] Recall that CSC will be decreased whenever congestion occurs, therefore CSC will need [P] CSC will be decreased whenever congestion occurs; therefore, CSC will need [R] grammar F. Also recall that the sender can set the C flag on a ConEx-capable packet whether or not the E or L flags are also set. In TCP Slow Start, the congestion window might grow much larger than during the rest of the transmission. Likely, a sender could consider sending fewer than F credits but risking being penalized by an audit function. However, the credits should at least cover the increase in sending rate. Given the exponential increase as implemented in the TCP Slow Start algorithm which means that the sending rate doubles every RTT, a ConEx sender should at least cover half the number of packets in flight by credits. Note that the number of losses or markings within one RTT does not solely depend on the sender's actions. In general, the behavior of the cross traffic, whether active queue management (AQM) is used and how it is parameterized influence how many packets might be dropped or marked. As long as any AQM encountered is not overly aggressive with ECN marking, sending half the flight size as credits should be sufficient whether congestion is signaled by loss or ECN. To maintain half of the packet in flight as credits, of course half [O] To maintain half of the packet in flight as credits, of course half [P] consider removing "of course" -- otherwise, put commas around it. [R] readability/grammer of the packet of the initial window must be C marked. In Slow Start marking every fourth packet introduces the correct amount of credit as can be seen in Figure 1. ... 5. Loss of ConEx information Packets carrying ConEx signals could be discarded themselves. This will be a second order problem (e.g. if the loss probability is 0.1%, the probability of losing a ConEx L signal will be 0.1% of 0.1% = 0.01%). Further, the penality an audit induces should be propotional [O] Further, the penality an audit induces should be propotional [P] Further, the penalty an audit induces should be proportionate [R] spelling x 2 to the mismatch of expected ConEx marks and observed congestion, therefore the audit might only slightly increase the loss level of this flow. Therefore, an implementer MAY choose to ignore this problem, accepting instead the risk that an audit function might wrongly penalize a flow. Nonetheless, a ConEx sender is responsible to always signal [O] responsible to always signal [P] responsible for always signalling [R] grammar sufficient congestion feedback and therefore SHOULD remember which packet was marked with either the L, the E or the C flag. If one of these packets is detected as lost, the sender SHOULD increase the respective gauge(s), LEG or CEG, by the number of lost payload bytes in addition to increasing LEG for the loss. 6. Timeliness of the ConEx Signals ConEx signals will only be useful to a network node within a time delay of about one RTT after the congestion occurred. To avoid further delays, a ConEx sender SHOULD send the ConEx signaling on the next available packet. Any or all of the ConEx flags can be used in the same packet, which allows delay to be minimised when multiple signals are pending. The need to set multiple ConEx flags at the same time, can occur if e.g [O] at the same time, can occur [P] at the same time can occur [R] unnecessary comma; grammar an ACK is received by the sender that simultaneously indicates that at least one ECN mark was received, and that one or more segements [O] segements [P] segments [R] spelling were lost. This may e.g. happen during excessive congestion, where [O] may e.g. happen uring excessive congestion, where [P] may happen during excessive congestion, if [R] readability the queues overflow even though ECN was used and currently all forwarded packets are marked, while others have to be dropped nevertheless. Another case when this might happen is when ACKs are [O] nevertheless. [P] [delete "nevertheless"] [R] readability lost, so that a subsequent ACK carries summary information not previously available to the sender. If a flow becomes application-limited, there could be insufficient bytes to send to reduce the gauges to zero or below. In such cases, the sender cannot help but delay ConEx signals. Nonetheless, as long as the sender is marking all outgoing packets, an audit function is unlikely to penalize ConEx-marked packets. Therefore, no matter how long a gauge has been positive, a sender MUST NOT reduce the gauge by more than the ConEx marked bytes it has sent. If the CEG or LEG counter is negative, the respective counter MAY be reset to zero within one RTT after it was decreased the last time or one RTT after recovery if no further congestion occurred. 7. Open Areas for Experimentation All proposed mechanisms in this document are experimental, and therefore further large-scale experimentation in the Internet is required to evaluate if the signaling provided by these mechanisms is accurate and timely enough to produce value for ConEx-based (traffic management or other) mechanisms. The current ConEx specifications assume that congestion is counted in number of bytes (including the IP header that directly encapsulates the CDO and everything that IP header encapsulates) [draft-ietf-conex-destopt]. This decision was taken because most network devices today expirience byte-congestion where the memory is [O] expirience [P] experience [R] spelling filled exactly with the number of bytes a packet carries. However, there are also devices that may allocate a certain amount of memory per packet, no matter how larger a packet is. These devices get congested based on the number of packets in their memory and therefore in this case congestion is determined by the number of packets that have been lost or marked. Furthermore, a transport layer endpoint, such as a TCP sender or receiver, might not know the exact number of bytes that a lower layer was carrying. Therefore a TCP endpoint may only be able to estimate the exact number of congested bytes (assuming that all lower layer header have the same length). If this estimation is sufficient to work with the ConEx [O] If this estimation is sufficient to work with the ConEx [P] If this estimation is sufficient to work with, the ConEx [R] readability signal needs to be further evaluated in tests in the Internet together with different auditor implementations. Further, the proposed marking schemes in this document are designed under the assumption that all TCP packets of a ConEx-capable flow are of equal size or that flows have a constant mean packet size over a rather small time frame, like one RTT or less. In most implementations this assumption might be taken as well and probably is true for most of the traffic flows. However, it should be evaluted how much the accuracy degrades if this precondition is not fulfilled, while the proposed scheme is used. Especially evaluating this with real traffic from different application is important to make a decision if the proposed schemes are sufficient or a more complexe scheme is needed. [O] However, it should be evaluted how much the accuracy degrades if this precondition is not fulfilled, while the proposed scheme is used. Especially evaluating this with real traffic from different application is important to make a decision if the proposed schemes are sufficient or a more complexe scheme is needed. [P] If this proposed scheme is used, it is necessary to evaluate how much accuracy degrades if this precondition is not met. Evaluating with real traffic from different applications is especially important in making the decision regarding whether the proposed schemes are sufficient or whether a more complex scheme is needed. [R] grammar, spelling, readability In this context the proposed scheme to set credit markings in Slow Start runs a risk to provide an insufficient number of markings which can cause an audit function to penalize this flow. Both the proposed credit scheme for Slow Start as well as the scheme in Congestion Avoidance must be evaluated together with one or more specific implementations of an ConEx auditor to ensure that both algorithms, in the sender and in the auditor, work propoerly together with a low [O] propoerly [P] properly [R] spelling risk of false positives (which would lead to penalization of an honest sender). However, if a sender is wrongly assumed to cheat, the penalization of the audit should be adequate and should allow an honest sender using a congestion control scheme that is commonly used today to recover quickly. Another open issue is the accuracy of the ECN feedback signal. At time of publication of this document there is no AccECN mechanism s pecified yet, and further AccECN will also take some time to be [O] s pecified yet, [P] specified yet, [R] spelling widely deployed. This document proposes an advanced compatibility mode for Classic ECN. The proposed mechanism can provide more accurate feedback by utilizing the way Classic ECN is speficed but [O] speficed [P] specified [R] spelling [O] loosing information [P] losing information [R] spelling this risk is in a real deployment scenario, further experimental evaluation is needed. The following argument is intended to prove that suppressing repetitions of ECE, however, is still safe against possible congestion collapse due to lost congestion feedback and should be further proven in experimentation: Repetition of ECE in classic ECN is intended to ensure reliable delivery of congestion feedback. However, with advanced compatibility mode, it is possible to miss congestion notifications. This can happen in some implementations if delayed acknowledgements are used. Further an ACK containing ECE can simply get lost. If [O] Further an ACK [P] Further, an ACK [R] readability only a few CE marks are received within one congestion event (e.g., only one), the loss of one acknowledgements due to (heavy) congestion on the reverse path can prevent that any congestion notification is received by the sender. However, if loss of feedback exacerbates congestion on the forward path, more forward packets will be CE marked, increasing the likelihood that feedback from at least one CE will get through per RTT. As long as one ECE reaches the sender per RTT, the sender's congestion response will be the same as if CWR were not continuous. The only way that heavy congestion on the forward path could be completely hidden would be if all ACKs on the reverse path were lost. If total ACK loss persisted, the sender would time out and do a congestion response anyway. Therefore, the problem seems confined to potential suppression of a congestion response during light congestion. Anyway, even if loss of all ECN feedback leads to no congestion [O] Anyway [P] Furthermore [R] tone response, the worst that could happen would be loss instead of ECN- signaled congestion on the forward path. Given compatibility mode does not affect loss feedback, there would be no risk of congestion collapse. 10. Security Considerations ... However, if the receiver is solely interested in making the sender draw down its allowance, the net effect will depend on the sender's congestion control algorithm as permanetly adding more and more [O] permanetly [P] permanently -- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf