01UserInitial request

Please generate sessions.fxt.xml from bigFlows.pcap. This file represents our typical CI/CD runs.

02ClaudeFXT for the CI/CD-relevant subset

analyze_capture detects 20 protocols. For a CI/CD framing I narrowed to the five that matter for build, deploy and registry traffic: HTTP, TLS/HTTPS, SSH, DNS, DHCP. Each gets its own session-type in sessions.fxt.xml with:

  • multi-match="true" at the root, so a TCP FIN/RST scoped to an app-layer session-type can close that session without suppressing the app-layer message templates themselves.
  • Right key per protocol: tcp.stream for HTTP / TLS / SSH; dns.id for DNS; dhcp.id for DHCP.
  • Explicit session-result outcomes on every stop template — success::http-2xx, success::http-redirect, failure::http-client-error, failure::http-server-error, failure::tcp-reset, success::tls-close-notify, failure::tls-alert-fatal, success::resolved, failure::dns-error, success::ack, failure::nak, etc.
  • Unscoped TCP-analysis templates at the bottom (retransmission, fast retx, duplicate ACK, zero window) so every TCP session is decorated with health markers without being tied to a single session-type.

validate_fxt against the pcap reports: FXT valid. 5 session type(s), 37 template(s). 5,689 messages matched from 100,000 packets scanned — 6 % coverage is expected because the remaining protocols in the capture (SIP, SNMP, VNC, SMB, sFlow, Syslog) are intentionally out of scope for a CI/CD lens.

FXT downloadable here: bigflows-sessions.fxt.xml.

03UserRun extraction

Run extract_sessions with it.

04ClaudeSession counts — first pass

Format auto-resolved to lazy (per-session PDFs render on-demand from the navigator). Counts:

Session typeOutcomeCount
dns-resolvesuccess::resolved1,789
dns-resolvefailure::dns-error301
dns-resolveincomplete::no-close17
http-transactionsuccess::http-2xx6,109
http-transactionsuccess::http-redirect1,163
http-transactionfailure::http-client-error65
http-transactionfailure::http-server-error3
http-transactionfailure::tcp-reset27
http-transactionincomplete::no-close119
tls-sessionsuccess::tcp-fin552
tls-sessionfailure::tls-alert397
tls-sessionfailure::tcp-reset74
tls-sessionincomplete::no-close421
ssh-sessionsuccess::tcp-fin22
dhcp-transactionsuccess::ack7
dhcp-transactionincomplete::no-close22

The session navigator opens to the per-protocol tree; per-session PDFs render on demand.

05UserRank and drill

Rank the protocols by failure / incomplete rate. For each protocol with a non-trivial failure rate (anything above 5 %), pull one representative failing session and tell me the symptom in one sentence and the root cause in two. End with a one-paragraph executive summary an on-call engineer could paste into a ticket.

06ClaudeFailure / incomplete ranking
RankProtocolTotalBadRate
1dhcp-transaction292275.9 %
2tls-session1,44489261.8 %
3dns-resolve2,10731815.1 %
4http-transaction7,4862142.9 %
5ssh-session2200 %

DHCP — incomplete::no-close (session 0x0446379d, 172.16.133.87)

  • Symptom: client sent a DHCPINFORM broadcast to 255.255.255.255:67 and no server reply (ACK or NAK) is in the capture.
  • Root cause: DHCPINFORM is a one-shot config request that terminates on ACK; the FXT marks sessions complete on ACK / NAK / Release / Decline. Most “incomplete” DHCP rows are unanswered INFORM broadcasts — either no DHCP server services INFORM on this segment, or the unicast reply was outside the capture point.

TLS — failure::tls-alert (session tcp.stream=560, 172.16.133.57 → 96.43.146.48:443)

  • Symptom: a full SSL 3.0 handshake completes, application data flows, then the client sends a TLS Alert and the session closes.
  • Root cause: the alert is sent after Change Cipher Spec, so its payload is encrypted; tshark only exposes the bare tls.record opcode and the FXT cannot distinguish a benign close_notify from a fatal alert — it conservatively classifies it as failure. The bulk of the 397 failure::tls-alert sessions are almost certainly normal encrypted close_notifys, not real handshake failures; the genuine handshake-time signal failure::tls-alert-fatal is 0.

DNS — failure::dns-error (session 0x8df0, 172.16.133.6 → 8.8.8.8)

  • Symptom: PTR query for 45.66.120.96.in-addr.arpa returns rcode 2 (SERVFAIL) from 8.8.8.8 after ~260 ms.
  • Root cause: the 96.120.66.45 reverse-DNS zone has no authoritative PTR record (or its nameservers are unreachable from Google's resolver), so 8.8.8.8 propagates SERVFAIL. The 301 DNS errors are dominated by reverse lookups on third-party IPs — not a local resolver problem, just upstream zones without PTRs.

Exec summary — ticket-ready. CI/CD traffic analysis of bigFlows.pcap shows three protocols above the 5 % failure / incomplete threshold, but only one is a real concern. DHCP (75.9 %) is misleading — 22 of 29 sessions are unanswered DHCPINFORM broadcasts that never complete by design; investigate only if agents depend on INFORM responses. TLS (61.8 %) is largely a measurement artefact: 397 of 892 “failures” are encrypted post-handshake alerts (most are normal close_notify) that tshark cannot decrypt without keys; only 74 hard TCP resets and 0 fatal handshake alerts indicate genuine connectivity problems. DNS (15.1 %) shows 301 SERVFAILs concentrated on PTR / reverse lookups against third-party IPs through 8.8.8.8 — an upstream zone-config issue, not a pipeline failure. HTTP (2.9 %) and SSH (0 %) are healthy. Action: before treating TLS/DHCP as incidents, refine the FXT (decryption keylog for TLS, or scope DHCPINFORM as a non-failure outcome); the only real signal worth investigating is the cluster of 27 HTTP TCP-resets and 65 4xx responses.

07UserDrill into HTTP failures

Let us drill into the HTTP failures (2.9 %). What kind of failures are we seeing at the HTTP and TCP levels? Are we able to correlate errors across layers?

08ClaudeHTTP × TCP cross-layer correlation

Failure mix (214 / 7,486 = 2.9 %)

CategoryCountLayer
4xx client error (404 ≈ 55, 403 ≈ 7, 400 ≈ 2)65HTTP
5xx server error (one each: 500, 503, 504)3HTTP
TCP RST mid-transaction27TCP
Incomplete (request, no response)119both

HTTP layer

  • 404s dominate (~85 % of 4xx) — scattered across many hosts (content.plymedia.com, i.ytimg.com, ad servers, monster.com). Content-not-found errors from third-party endpoints; noise from browser / app traffic, not pipeline failures.
  • 403s (~7) — auth / permission denials, again third-party.
  • 5xx are rare and uncorrelated — three different upstreams: content.plymedia.com (500 Internal), sametime.lotus.com (504 Gateway Timeout on a POST), jdn.monster.com (503 Service Unavailable). No shared host.

TCP layer

  • 27 RSTs, several arriving in tight clusters on adjacent tcp.stream IDs to the same destination: streams 17970/17971 → 173.194.43.36 (i.ytimg.com), two YouTube thumbnail GETs reset by the client ~70 ms in. Streams 8306–8313: 8 consecutive resets within a few hundred ms. Streams 5368–5372: similar burst.
  • RST direction is mixed — some server-originated (immediate, before response: stream 132 to pandora.com, RST 254 µs after GET), some client-originated (stream 17970, client RSTs after no response in 70 ms).
  • No retransmissions / zero-windows appear in the failure NDJSONs — the unscoped TCP-analysis templates fired during render but did not co-occur within failing HTTP sessions. The RSTs are not congestion-driven; they are abrupt aborts.

Cross-layer correlation — yes, by tcp.stream

The FXT structure makes correlation cheap: every HTTP transaction and the TCP RST that ends it share the same flow.key = http-transaction::<tcp.stream>. Three patterns are visible:

  1. HTTP request → TCP RST in same session (the 27 RST failures): request line and RST appear in the same NDJSON file with the same flow.key. Stream 132 (pandora.com) shows GET at frame 889, RST at frame 891 — 254 µs gap, server-side connection refusal.
  2. HTTP 5xx → graceful FIN (no RST): all 3 5xx sessions end normally at TCP — the server sent the error and closed cleanly. No TCP-layer signal would have caught these.
  3. Burst RSTs to one host: streams 17970/17971 to 173.194.43.36 (both YouTube thumbnails, same client, same second) — suggests a per-destination throttle or blocklist, not a generic network event.
09UserThresholds: p95 + 60 s idle

Treat sessions whose duration exceeds the 95th percentile as “late,” and treat sessions with more than 1 minute of inactivity as “timed out.” Update sessions.fxt.xml with appropriate max-duration and idle-period values to reflect these thresholds, then run extract_sessions against the updated FXT file.

10ClaudeApply p95 + idle-timeout=60

Computed p95 of observed session durations per type from the first run's NDJSONs, then set max-duration = p95 and idle-timeout = 60 s on every session-type:

Session typenmedianp95maxmax-durationidle-timeout
dns-resolve2,0940.041 s2.95 s167.4 s3 s60 s
tls-session1,4405.89 s175.7 s293.1 s176 s60 s
http-transaction7,3710.070 s0.54 s183.5 s1 s60 s
ssh-session227.71 s11.84 s11.88 s12 s60 s
dhcp-transaction90.0 s294.0 s294.0 s300 s60 s

Outcome shifts after re-extracting

  • DNS: 169 sessions reclassified — 65 late:: (slow reverse-DNS PTRs that eventually returned) + 104 timeout::max-duration (no response within 3 s). The earlier failure::dns-error count dropped from 301 → 236; the difference were SERVFAILs that arrived after 3 s and are now late.
  • TLS: 334 long-lived sessions reclassified — 128 timeout::max-duration + 206 timeout::idle 60s. Previously categorised as success::tcp-fin or incomplete, but actually sat idle for minutes inside keep-alive.
  • HTTP: 311 reclassified — 182 late + 129 timeout. The p95 = 1 s threshold is tight on a noisy mixed-traffic capture.
  • DHCP: 18 of the previous 22 incomplete::no-close now properly labeled timeout::idle 60s — the 60 s gap caught unanswered INFORM / Discover broadcasts cleanly.
  • SSH: unchanged — all 22 sessions completed well under 12 s.
11UserPattern hunting

Do we see a pattern in the late and timeout sessions?

12ClaudeThree distinct patterns, one real problem

DNS — same reverse zone is dragging both buckets

  • DNS late (65 sessions, 6–13 s) and DNS timeout (104 sessions, no response in 3 s) are the same problem at two latency cutoffs: the reverse-PTR 45.66.120.96.in-addr.arpa accounts for 75 / 133 late queries and 65 / 122 timeouts. Internal resolver 172.16.128.202 carries 60–94 % of both buckets — it is retrying upstream for ~12 s before either giving up (timeout) or eventually returning SERVFAIL (late). 88 of 122 timeouts and 68 of the late ones are PTR records.
  • Pattern: one specific reverse zone is unauthoritative or unreachable upstream; the internal resolver keeps eating the retry budget.

HTTP — “late” is mostly long-poll, “timeout” is real

  • HTTP late (182, p50 = 2.17 s, max = 167 s) is dominated by long-poll endpoints: notify17/24/33.dropbox.com (16 sessions, Dropbox's HTTP long-polling notification channel), realtime.services.disqus.com, and high-latency content sites (LinkedIn × 35, WSJ × 11, TripAdvisor × 7). The 167 s max matches typical 60–120 s long-poll holds — these are working as intended, not slow.
  • HTTP timeout (129, duration = 0) are genuine “request sent, response never arrived within 1 s” — spread across many hosts with no single dominant target.
  • Pattern: the 1 s max-duration is too tight for HTTP traffic that legitimately uses long-polling. Late ≠ slow here.

TLS — Salesforce keep-alive sessions dominate

  • 206 idle-timeouts + 128 max-duration timeouts, top SNIs: na2.salesforce.com (61), umps2c2.salesforce.com + variants (56), c.na2.visual.force.com + iv.na2.visual.force.com (20) — Salesforce accounts for ~140 of 334 TLS timeouts (~42 %). Add safebrowsing.google.com (13), www.google.com (10), plus.google.com (8) and you get the rest.
  • Pattern: SaaS-style HTTPS keep-alives (Salesforce push / poll + Google Safe Browsing background) hold TCP connections idle between bursts. They are long-lived by design.

DHCP — unanswered broadcasts

  • All 18 DHCP timeouts target 255.255.255.255 (broadcast). p95 duration = 10 s suggests clients retry every few seconds within the 60 s window before giving up.
  • Pattern: either no DHCP server is serving these subnets, or the relay / server is not replying to INFORM (which is legitimate — many DHCP servers ignore INFORM).

Bottom line — three distinct phenomena hide behind “late / timeout”

  1. Real infrastructure issue — DNS resolver 172.16.128.202 ↔ specific upstream reverse zones (chiefly 96.120.66.0/24). Worth fixing.
  2. Threshold misfit, not failure — HTTP “late” is long-polling; TLS “timeout” is keep-alive. Either raise max-duration (HTTP → 60 s, TLS → 600 s) or scope these timeouts to requests rather than connections.
  3. Architectural quirk, not failure — DHCP INFORM broadcasts with no responder; this is environmental.
13UserSplit long-poll out

Split http-transaction into short-poll vs long-poll.

14Claudehttp-longpoll via Host-header regex

Added an http-longpoll session-type qualified by http.host value-regex on the known long-poll endpoints (Dropbox change-notify, Disqus realtime), declared before the generic http-transaction so the request packet (which carries the Host header) qualifies into long-poll first. Wider thresholds reflect intentional 60–120 s server holds:

<session-type name="http-longpoll" max-duration="180" idle-timeout="120">
  <qualify logic="or">
    <field value-regex="notify\d*\.dropbox\.com">http.host</field>
    <field value-regex="realtime\.services\.disqus\.com">http.host</field>
  </qualify>
  <key>
    <protocol><field>tcp.stream</field></protocol>
  </key>
</session-type>

What landed in http-longpoll

OutcomeCount
timeout::exceeded max duration of 180 s12
incomplete::no-close9

13 distinct tcp.streams, all matching the targeted set: notify3/12/17/22/24/32/33/34.dropbox.com and realtime.services.disqus.com. The qualifier hit precisely what was intended.

Notably 0 long-poll sessions completed with 2xx within 180 s — every captured long-poll either timed out the FXT max-duration (12) or had no response in the capture window (9). That is the signature of working long-polling: the server holds the connection for its full timeout, then either the capture ends or another request starts.

The caveat — multi-match shadows the http-transaction counts

FXT <qualify> supports <not-exists> for field absence but not value-regex negation, so with multi-match="true" (required so TCP RST can close app-layer sessions) each long-poll request packet qualifies for both session-types and creates two parallel sessions with different flow keys (http-longpoll::<stream> vs http-transaction::<stream>).

So:

  • http-longpoll is a clean, correctly-classified bucket of long-poll behaviour — useful for monitoring.
  • http-transaction still shadow-counts those streams as late / timeout. Its “short-poll” stats are not cleaned up by the split.
15UserAccept the double-count

The current double counting is acceptable as long as the team understands that http-transaction is the superset.

16ClaudeDocument the overlap in the FXT

Acknowledged. Leaving multi-match="true" and the overlapping classification in place. Added a self-documenting comment at the http-longpoll declaration so the next reader sees the relationship spelled out:

<!-- Long-poll HTTP endpoints (Dropbox change-notify, Disqus realtime).
     Declared BEFORE the generic http-transaction so the request packet
     (which carries http.host) qualifies into long-poll first; the
     tcp.stream key then binds the response leg to the same session.
     Wider thresholds reflect intentional 60-120s server holds.

     NOTE: with multi-match="true" (required so TCP RST can close
     app-layer sessions), each long-poll request also qualifies for
     http-transaction below and creates a parallel session there.
     http-transaction is therefore the SUPERSET (all HTTP, including
     long-poll); http-longpoll is a focused sub-view. Do not sum the
     two - they overlap by design. -->

Method. bigFlows.pcap obtained from tcpreplay sample captures and analysed end-to-end through the VisualEther MCP server with Claude. analyze_capture picked the CI/CD-relevant protocols; the resulting bigflows-sessions.fxt.xml was validated with validate_fxt, then driven through extract_sessions in three rounds — baseline, p95 thresholds, and the long-poll split. Each round's NDJSON output was sampled to derive the patterns (PTR SERVFAIL, Salesforce SNI, Dropbox notify Host header), and every conclusion is grounded in those per-session records.