Draft minutes for Endpoint Congestion Management WG meeting Monday, July 31 IETF 48, Pittsburgh Notes by Aaron Falk, with editing by Vern Paxson. The chair began the meeting with an overview of the WG's deliverables. The congestion control principles document is complete and in the RFC pipeline. The abstract CM API has completed WG last call. However, it appears that few people have read the document, leading the chair to wonder if there's been sufficient WG review to constitute consensus. A third deliverable is a document specifying correct behavior of a congestion controller. The chair proposed that the API document has sufficient discussion and pointers to previous documents to serve as such. No comments were heard in support or counter to this proposal. Hari Balakrishnan then lead an extensive overview of draft-ietf-ecm-cm-01.txt, using viewgraphs available from http://nms.lcs.mit.edu. Summary of the draft: the desire is to integrate CM across all applications -- not just TCP-based ones. It goes just above the IP layer and exposes an API that allows applications to get information about the state of the network. One issue that needs more discussion is: what should the granularity of a macroflow be? This was discussed at the Nov. 99 IETF. The default is to aggregate all streams to a given addresss. The grouping and ungrouping API allows this to be changed by an application program. Suggestion from the floor: why not let the application (cm_update) tell CM whether it's getting receiver feedback or not, rather than using terms for reporting loss like PERSISTENT and TRANSIENT, which allow too much ambiguity? Suggestion: give applications simple and non-ambiguous signals - e.g., it's receiving feedback; no-feedback; non-congestion-related loss. Vern asked why the grant time is in terms of RTT rather than RTO? Hari replied that RTO would not be appropriate because it's possible to build a TCP-friendly app without a notion of RTO. Hari proposed removing the notion of rttdev from the cm_query() call. Joe Touch suggested that any mechanism collecting data and reporting aggregate values should be calculating deviations to give a 'credibility weighting' to reported values. (I.e., to distinguish wildly varying values from stable ones). Joe also questioned the utility of reporting a rate if there's no information given as to the interval over which the rate is computed. Hari clarified that rates are over a small time window (i.e., one RTT). Two folks suggested that a jitter measurement would be useful in determining the aggregate jitter for things like RTP streams. This would allow applications to make adjustments using more information than just their individual RTP jitter measurements. Mark Handley suggested keeping rttdev to allow new TCPs access to the info when they start. Vern asked whether it should be defined the way TCP currently computes it (including using a deviation rather than a variance). Joe suggested that good statistics for the CM to report would be those pertaining to a group/macroflow rather than a single stream. Vern asked will there be part of the scheduler API the defines the scheduling discipline? The answer is Yes. Joe thought that an application should be able to use a single call of cm_get/setmacroflow() to forward some data to the scheduler, rather than requiring two calls from the app, one to the CM and one to the scheduler. He further emphasized that we really need a scheduler API in order to evaluate the completeness/adequacy of the CM API. cm_get/setmacroflow depends significantly on the choice of scheduler and it's not possible to evaluate the proposal without understanding the scheduler better. Vern stated that the concept was to nail down the CM part now to enable experimentation with schedulers. This is appropriate for Proposed Standard documents - there's plenty of leeway to change them, including recycling them at Proposed. Unattributed question: what will be the incentive for applications to use CM? Hari replied that they will attain better performance in situations such as slow start. Vern added that in the future the the IETF may require new protocols to use CM to for congestion management rather than inventing their own. Hari then raised a pending issue regarding temporarily overriding cwnd restrictions. Suppose a TCP loses a packet due to congestion. The sender calls cm_update(). This causes the CM to cut the window. Now, the outstanding data exceeds cwnd. So what happens to the retransmission? How does it manage to go out? One solution (hack): add a priority parameter to cm_request(), perhaps with the restriction of you can request at most one high-priority packet per RTT? Tim Shepard thought that solution was okay, but prefers FACK and rate-halving (which is more aggressive but allows you to keep sending packets). Hari agreed that if you use FACK, this isn't an issue, but we don't want to restrict implementors to doing a TCP-style congestion controller. Tim mentioned that another alternative would be to not change TCP, and only use CM for other apps. Sally Floyd asked for clarification regarding what does a TCP sender tell the CM when it receives dupacks? Answer: any dupack is treated as feedback that packets have left the pipe. Tim was also concerned with the default policy of grouping TCP connections together based on the same same src/dst IP addresses. NAT boxes may mask a lot of complexity behind a single IP dst addresses. Vern commented that this is a key issue and we are hoping to crystalize an IRTF effort to look at this. He also pointed out that the assumption of sharing network path properties based on common destination address is already in use today, in route caches that include ssthresh and rtt/rttvar information. Matt Mathis stated that if there's a NAT box the behavior will still be safe from a network stability perspective, though traffic may be slowed down unneccesarily. Joe mentioned that a slow link and fast link behind a NAT will result in two connections running at the average of the two rates. Matt countered that this will still result in behavior that is safe, because of loss incurred on the slow link slowing down the entire aggregate. There was further discussion about possibly pathological situations in which the slow link could in fact be overwhelmed; it was not clear how plausible these scenarios are. Sally mentioned that the CM could recognize when two connections sharing congestion control state have vastly different behaviors (RTT, loss rate), and could move one of the connections to a different macroflow. Vern then asked about a scenario in which the sender transmits 10 packets with TCP, and they all get lost, incurring a timeout. When does the CM know that nothing got through, and that it should adjust its notion of how much data is outstanding? Answer: the app tells the CM that it sent 10 packets, and based on the feedback it received (i.e., only implicit feedback due to a timeout), none were received. Joe Touch then raised the question of whether a more complex cm_request() interface is needed, one that can issue a request to send multiple packets. The context in which this comes up is attempting to run a connection very fast, say at 10 Gbps. In this case, a function call is as bad as a kernel crossing. There shouldn't be a correlation between the number of packets to send and the number of function calls. Matt countered that people who are running at that kind of rate are not going to want to do this kind of congestion aggregation anyway. Vern pointed out that we are defining an abstract API and not a concrete API, so a key question is to what degree would this change affect the abstract API we're documenting? Joe said we would need to delete the wording that the app must call cm_request() per packet. But Hari argued for keeping one call per packet or per MTU, in order to elimiate bursts of packets from being sent that locks out other users. Joe thinks this overhead is excessive. Hari suggested we could add something about back to back bursts, and asked Sally whether the congestion control principals document includes discussion of bursts. Answer: no, that document isn't meant to be a set of specific mechanisms. Her view is that additional congestion control mechanisms can be defined, and should perhaps be vetted by IETF process, either in ECM or TSV. The chair then made the following proposal for moving forward with ecm-cm-01: 1. Add opaque (scheduler) data to the API when creating a macroflow. 2. Add specification of how to compute the RTT variation. 3. Change CM_PERSISTENT to CM_LOST_FEEDBACK, etc. 4. Add a comment to the document that some key experience we don't yet have is with scheduling APIs, and that this may lead to possible changes in the CM API. 5. Resolve the issue Joe raised regarding sending multiple packets. After addressing these, there would be one more WG Last Call. The chair then asked for a show of WG consensus for this plan, with it being understood that the chair would interpret consensus for the plan followed by a successful last call as consensus for the document. A show of hands revealed good consensus and no opposition. The meeting finished with brief discussion of the Informational document the WG is tasked to produce giving example(s) of implementing one or more CM schedulers. The chair asked for volunteers to begin an outline of the document, right up a particular scheduling policy, or serve as editor, noting that the document is already overdue and the chair hopes to expedite it. There were no public volunteers, however.