Please generate sessions.fxt.xml from bigFlows.pcap. This file represents our typical CI/CD runs.
analyze_capture detects 20 protocols. For a CI/CD framing I narrowed to the five that matter for build, deploy and registry traffic: HTTP, TLS/HTTPS, SSH, DNS, DHCP. Each gets its own session-type in sessions.fxt.xml with:
multi-match="true"at the root, so a TCP FIN/RST scoped to an app-layer session-type can close that session without suppressing the app-layer message templates themselves.- Right key per protocol:
tcp.streamfor HTTP / TLS / SSH;dns.idfor DNS;dhcp.idfor DHCP. - Explicit
session-resultoutcomes on every stop template —success::http-2xx,success::http-redirect,failure::http-client-error,failure::http-server-error,failure::tcp-reset,success::tls-close-notify,failure::tls-alert-fatal,success::resolved,failure::dns-error,success::ack,failure::nak, etc. - Unscoped TCP-analysis templates at the bottom (retransmission, fast retx, duplicate ACK, zero window) so every TCP session is decorated with health markers without being tied to a single session-type.
validate_fxt against the pcap reports: FXT valid. 5 session type(s), 37 template(s). 5,689 messages matched from 100,000 packets scanned — 6 % coverage is expected because the remaining protocols in the capture (SIP, SNMP, VNC, SMB, sFlow, Syslog) are intentionally out of scope for a CI/CD lens.
FXT downloadable here: bigflows-sessions.fxt.xml.
Run extract_sessions with it.
Format auto-resolved to lazy (per-session PDFs render on-demand from the navigator). Counts:
| Session type | Outcome | Count |
|---|---|---|
| dns-resolve | success::resolved | 1,789 |
| dns-resolve | failure::dns-error | 301 |
| dns-resolve | incomplete::no-close | 17 |
| http-transaction | success::http-2xx | 6,109 |
| http-transaction | success::http-redirect | 1,163 |
| http-transaction | failure::http-client-error | 65 |
| http-transaction | failure::http-server-error | 3 |
| http-transaction | failure::tcp-reset | 27 |
| http-transaction | incomplete::no-close | 119 |
| tls-session | success::tcp-fin | 552 |
| tls-session | failure::tls-alert | 397 |
| tls-session | failure::tcp-reset | 74 |
| tls-session | incomplete::no-close | 421 |
| ssh-session | success::tcp-fin | 22 |
| dhcp-transaction | success::ack | 7 |
| dhcp-transaction | incomplete::no-close | 22 |
The session navigator opens to the per-protocol tree; per-session PDFs render on demand.
Rank the protocols by failure / incomplete rate. For each protocol with a non-trivial failure rate (anything above 5 %), pull one representative failing session and tell me the symptom in one sentence and the root cause in two. End with a one-paragraph executive summary an on-call engineer could paste into a ticket.
| Rank | Protocol | Total | Bad | Rate |
|---|---|---|---|---|
| 1 | dhcp-transaction | 29 | 22 | 75.9 % |
| 2 | tls-session | 1,444 | 892 | 61.8 % |
| 3 | dns-resolve | 2,107 | 318 | 15.1 % |
| 4 | http-transaction | 7,486 | 214 | 2.9 % |
| 5 | ssh-session | 22 | 0 | 0 % |
DHCP — incomplete::no-close (session 0x0446379d, 172.16.133.87)
- Symptom: client sent a DHCPINFORM broadcast to
255.255.255.255:67and no server reply (ACK or NAK) is in the capture. - Root cause: DHCPINFORM is a one-shot config request that terminates on ACK; the FXT marks sessions complete on ACK / NAK / Release / Decline. Most “incomplete” DHCP rows are unanswered INFORM broadcasts — either no DHCP server services INFORM on this segment, or the unicast reply was outside the capture point.
TLS — failure::tls-alert (session tcp.stream=560, 172.16.133.57 → 96.43.146.48:443)
- Symptom: a full SSL 3.0 handshake completes, application data flows, then the client sends a
TLS Alertand the session closes. - Root cause: the alert is sent after Change Cipher Spec, so its payload is encrypted; tshark only exposes the bare
tls.recordopcode and the FXT cannot distinguish a benignclose_notifyfrom a fatal alert — it conservatively classifies it as failure. The bulk of the 397failure::tls-alertsessions are almost certainly normal encryptedclose_notifys, not real handshake failures; the genuine handshake-time signalfailure::tls-alert-fatalis 0.
DNS — failure::dns-error (session 0x8df0, 172.16.133.6 → 8.8.8.8)
- Symptom: PTR query for
45.66.120.96.in-addr.arpareturns rcode 2 (SERVFAIL) from 8.8.8.8 after ~260 ms. - Root cause: the
96.120.66.45reverse-DNS zone has no authoritative PTR record (or its nameservers are unreachable from Google's resolver), so 8.8.8.8 propagates SERVFAIL. The 301 DNS errors are dominated by reverse lookups on third-party IPs — not a local resolver problem, just upstream zones without PTRs.
Exec summary — ticket-ready. CI/CD traffic analysis of bigFlows.pcap shows three protocols above the 5 % failure / incomplete threshold, but only one is a real concern. DHCP (75.9 %) is misleading — 22 of 29 sessions are unanswered DHCPINFORM broadcasts that never complete by design; investigate only if agents depend on INFORM responses. TLS (61.8 %) is largely a measurement artefact: 397 of 892 “failures” are encrypted post-handshake alerts (most are normal close_notify) that tshark cannot decrypt without keys; only 74 hard TCP resets and 0 fatal handshake alerts indicate genuine connectivity problems. DNS (15.1 %) shows 301 SERVFAILs concentrated on PTR / reverse lookups against third-party IPs through 8.8.8.8 — an upstream zone-config issue, not a pipeline failure. HTTP (2.9 %) and SSH (0 %) are healthy. Action: before treating TLS/DHCP as incidents, refine the FXT (decryption keylog for TLS, or scope DHCPINFORM as a non-failure outcome); the only real signal worth investigating is the cluster of 27 HTTP TCP-resets and 65 4xx responses.
Let us drill into the HTTP failures (2.9 %). What kind of failures are we seeing at the HTTP and TCP levels? Are we able to correlate errors across layers?
Failure mix (214 / 7,486 = 2.9 %)
| Category | Count | Layer |
|---|---|---|
| 4xx client error (404 ≈ 55, 403 ≈ 7, 400 ≈ 2) | 65 | HTTP |
| 5xx server error (one each: 500, 503, 504) | 3 | HTTP |
| TCP RST mid-transaction | 27 | TCP |
| Incomplete (request, no response) | 119 | both |
HTTP layer
- 404s dominate (~85 % of 4xx) — scattered across many hosts (
content.plymedia.com,i.ytimg.com, ad servers,monster.com). Content-not-found errors from third-party endpoints; noise from browser / app traffic, not pipeline failures. - 403s (~7) — auth / permission denials, again third-party.
- 5xx are rare and uncorrelated — three different upstreams:
content.plymedia.com(500 Internal),sametime.lotus.com(504 Gateway Timeout on a POST),jdn.monster.com(503 Service Unavailable). No shared host.
TCP layer
- 27 RSTs, several arriving in tight clusters on adjacent
tcp.streamIDs to the same destination: streams 17970/17971 →173.194.43.36(i.ytimg.com), two YouTube thumbnail GETs reset by the client ~70 ms in. Streams 8306–8313: 8 consecutive resets within a few hundred ms. Streams 5368–5372: similar burst. - RST direction is mixed — some server-originated (immediate, before response: stream 132 to
pandora.com, RST 254 µs after GET), some client-originated (stream 17970, client RSTs after no response in 70 ms). - No retransmissions / zero-windows appear in the failure NDJSONs — the unscoped TCP-analysis templates fired during render but did not co-occur within failing HTTP sessions. The RSTs are not congestion-driven; they are abrupt aborts.
Cross-layer correlation — yes, by tcp.stream
The FXT structure makes correlation cheap: every HTTP transaction and the TCP RST that ends it share the same flow.key = http-transaction::<tcp.stream>. Three patterns are visible:
- HTTP request → TCP RST in same session (the 27 RST failures): request line and RST appear in the same NDJSON file with the same
flow.key. Stream 132 (pandora.com) shows GET at frame 889, RST at frame 891 — 254 µs gap, server-side connection refusal. - HTTP 5xx → graceful FIN (no RST): all 3 5xx sessions end normally at TCP — the server sent the error and closed cleanly. No TCP-layer signal would have caught these.
- Burst RSTs to one host: streams 17970/17971 to
173.194.43.36(both YouTube thumbnails, same client, same second) — suggests a per-destination throttle or blocklist, not a generic network event.
Treat sessions whose duration exceeds the 95th percentile as “late,” and treat sessions with more than 1 minute of inactivity as “timed out.” Update sessions.fxt.xml with appropriate max-duration and idle-period values to reflect these thresholds, then run extract_sessions against the updated FXT file.
Computed p95 of observed session durations per type from the first run's NDJSONs, then set max-duration = p95 and idle-timeout = 60 s on every session-type:
| Session type | n | median | p95 | max | max-duration | idle-timeout |
|---|---|---|---|---|---|---|
| dns-resolve | 2,094 | 0.041 s | 2.95 s | 167.4 s | 3 s | 60 s |
| tls-session | 1,440 | 5.89 s | 175.7 s | 293.1 s | 176 s | 60 s |
| http-transaction | 7,371 | 0.070 s | 0.54 s | 183.5 s | 1 s | 60 s |
| ssh-session | 22 | 7.71 s | 11.84 s | 11.88 s | 12 s | 60 s |
| dhcp-transaction | 9 | 0.0 s | 294.0 s | 294.0 s | 300 s | 60 s |
Outcome shifts after re-extracting
- DNS: 169 sessions reclassified — 65
late::(slow reverse-DNS PTRs that eventually returned) + 104timeout::max-duration(no response within 3 s). The earlierfailure::dns-errorcount dropped from 301 → 236; the difference were SERVFAILs that arrived after 3 s and are nowlate. - TLS: 334 long-lived sessions reclassified — 128
timeout::max-duration+ 206timeout::idle 60s. Previously categorised assuccess::tcp-finorincomplete, but actually sat idle for minutes inside keep-alive. - HTTP: 311 reclassified — 182
late+ 129timeout. The p95 = 1 s threshold is tight on a noisy mixed-traffic capture. - DHCP: 18 of the previous 22
incomplete::no-closenow properly labeledtimeout::idle 60s— the 60 s gap caught unanswered INFORM / Discover broadcasts cleanly. - SSH: unchanged — all 22 sessions completed well under 12 s.
Do we see a pattern in the late and timeout sessions?
DNS — same reverse zone is dragging both buckets
- DNS late (65 sessions, 6–13 s) and DNS timeout (104 sessions, no response in 3 s) are the same problem at two latency cutoffs: the reverse-PTR
45.66.120.96.in-addr.arpaaccounts for 75 / 133 late queries and 65 / 122 timeouts. Internal resolver172.16.128.202carries 60–94 % of both buckets — it is retrying upstream for ~12 s before either giving up (timeout) or eventually returning SERVFAIL (late). 88 of 122 timeouts and 68 of the late ones are PTR records. - Pattern: one specific reverse zone is unauthoritative or unreachable upstream; the internal resolver keeps eating the retry budget.
HTTP — “late” is mostly long-poll, “timeout” is real
- HTTP late (182, p50 = 2.17 s, max = 167 s) is dominated by long-poll endpoints:
notify17/24/33.dropbox.com(16 sessions, Dropbox's HTTP long-polling notification channel),realtime.services.disqus.com, and high-latency content sites (LinkedIn × 35, WSJ × 11, TripAdvisor × 7). The 167 s max matches typical 60–120 s long-poll holds — these are working as intended, not slow. - HTTP timeout (129, duration = 0) are genuine “request sent, response never arrived within 1 s” — spread across many hosts with no single dominant target.
- Pattern: the 1 s
max-durationis too tight for HTTP traffic that legitimately uses long-polling. Late ≠ slow here.
TLS — Salesforce keep-alive sessions dominate
- 206 idle-timeouts + 128 max-duration timeouts, top SNIs:
na2.salesforce.com(61),umps2c2.salesforce.com+ variants (56),c.na2.visual.force.com+iv.na2.visual.force.com(20) — Salesforce accounts for ~140 of 334 TLS timeouts (~42 %). Addsafebrowsing.google.com(13),www.google.com(10),plus.google.com(8) and you get the rest. - Pattern: SaaS-style HTTPS keep-alives (Salesforce push / poll + Google Safe Browsing background) hold TCP connections idle between bursts. They are long-lived by design.
DHCP — unanswered broadcasts
- All 18 DHCP timeouts target
255.255.255.255(broadcast). p95 duration = 10 s suggests clients retry every few seconds within the 60 s window before giving up. - Pattern: either no DHCP server is serving these subnets, or the relay / server is not replying to INFORM (which is legitimate — many DHCP servers ignore INFORM).
Bottom line — three distinct phenomena hide behind “late / timeout”
- Real infrastructure issue — DNS resolver
172.16.128.202↔ specific upstream reverse zones (chiefly96.120.66.0/24). Worth fixing. - Threshold misfit, not failure — HTTP “late” is long-polling; TLS “timeout” is keep-alive. Either raise
max-duration(HTTP → 60 s, TLS → 600 s) or scope these timeouts to requests rather than connections. - Architectural quirk, not failure — DHCP INFORM broadcasts with no responder; this is environmental.
Split http-transaction into short-poll vs long-poll.
Added an http-longpoll session-type qualified by http.host value-regex on the known long-poll endpoints (Dropbox change-notify, Disqus realtime), declared before the generic http-transaction so the request packet (which carries the Host header) qualifies into long-poll first. Wider thresholds reflect intentional 60–120 s server holds:
<session-type name="http-longpoll" max-duration="180" idle-timeout="120">
<qualify logic="or">
<field value-regex="notify\d*\.dropbox\.com">http.host</field>
<field value-regex="realtime\.services\.disqus\.com">http.host</field>
</qualify>
<key>
<protocol><field>tcp.stream</field></protocol>
</key>
</session-type>
What landed in http-longpoll
| Outcome | Count |
|---|---|
| timeout::exceeded max duration of 180 s | 12 |
| incomplete::no-close | 9 |
13 distinct tcp.streams, all matching the targeted set: notify3/12/17/22/24/32/33/34.dropbox.com and realtime.services.disqus.com. The qualifier hit precisely what was intended.
Notably 0 long-poll sessions completed with 2xx within 180 s — every captured long-poll either timed out the FXT max-duration (12) or had no response in the capture window (9). That is the signature of working long-polling: the server holds the connection for its full timeout, then either the capture ends or another request starts.
The caveat — multi-match shadows the http-transaction counts
FXT <qualify> supports <not-exists> for field absence but not value-regex negation, so with multi-match="true" (required so TCP RST can close app-layer sessions) each long-poll request packet qualifies for both session-types and creates two parallel sessions with different flow keys (http-longpoll::<stream> vs http-transaction::<stream>).
So:
- http-longpoll is a clean, correctly-classified bucket of long-poll behaviour — useful for monitoring.
- http-transaction still shadow-counts those streams as late / timeout. Its “short-poll” stats are not cleaned up by the split.
The current double counting is acceptable as long as the team understands that http-transaction is the superset.
Acknowledged. Leaving multi-match="true" and the overlapping classification in place. Added a self-documenting comment at the http-longpoll declaration so the next reader sees the relationship spelled out:
<!-- Long-poll HTTP endpoints (Dropbox change-notify, Disqus realtime).
Declared BEFORE the generic http-transaction so the request packet
(which carries http.host) qualifies into long-poll first; the
tcp.stream key then binds the response leg to the same session.
Wider thresholds reflect intentional 60-120s server holds.
NOTE: with multi-match="true" (required so TCP RST can close
app-layer sessions), each long-poll request also qualifies for
http-transaction below and creates a parallel session there.
http-transaction is therefore the SUPERSET (all HTTP, including
long-poll); http-longpoll is a focused sub-view. Do not sum the
two - they overlap by design. -->
Method. bigFlows.pcap obtained from tcpreplay sample captures and analysed end-to-end through the VisualEther MCP server with Claude. analyze_capture picked the CI/CD-relevant protocols; the resulting bigflows-sessions.fxt.xml was validated with validate_fxt, then driven through extract_sessions in three rounds — baseline, p95 thresholds, and the long-poll split. Each round's NDJSON output was sampled to derive the patterns (PTR SERVFAIL, Salesforce SNI, Dropbox notify Host header), and every conclusion is grounded in those per-session records.