How do you build a sessions.fxt.xml for a mixed-protocol pcap like bigFlows.pcap?

Run analyze_capture to detect protocols, pick the CI/CD-relevant subset (HTTP, TLS, DNS, DHCP, SSH), and declare one session-type per protocol with the right key: tcp.stream for TCP-borne sessions (HTTP, TLS, SSH), dns.id for DNS, dhcp.id for DHCP. Set multi-match="true" so a TCP FIN/RST template can close app-layer sessions. Each session-type gets session-start templates on its protocol's request opcodes (e.g. http.request.method, tls Client Hello, dns query, DHCP Discover/Request/Inform) and session-stop templates on the closing opcodes with explicit session-result outcomes (success::http-2xx, failure::http-server-error, success::tls-close-notify, failure::tls-alert-fatal, success::ack, failure::nak, etc.). validate_fxt against the pcap confirms how many frames match before running extract_sessions.

How do you rank protocols by failure or incomplete rate from extract_sessions output?

Group the session_summary counts returned by extract_sessions per session-type, sum every outcome whose prefix is failure:: or incomplete::, and divide by the total sessions of that type. For bigFlows.pcap this produces: dhcp-transaction 75.9 %, tls-session 61.8 %, dns-resolve 15.1 %, http-transaction 2.9 %, ssh-session 0 %. The high DHCP and TLS rates are largely artefacts — unanswered DHCP INFORM broadcasts and encrypted TLS close_notify alerts that tshark cannot decrypt — not real CI/CD failures. The DNS 15.1 % is a real upstream-zone problem (reverse-DNS SERVFAIL). HTTP at 2.9 % is the only protocol whose failures are worth drilling into.

How do you correlate HTTP-layer and TCP-layer errors in the same session?

Key the HTTP session-type on tcp.stream so both the HTTP request, the HTTP response (if any), and the TCP RST that aborts the connection share the same flow key (http-transaction::N). With multi-match="true" a TCP RST template scoped to session-type="http-transaction" can mark session-stop with session-result="failure::tcp-reset" on the very stream that carried the GET. In bigFlows.pcap the 27 HTTP TCP-reset sessions each contain the originating GET and the RST in one NDJSON file — some server-initiated within microseconds, some client-initiated after no response within 70 ms. Server-side 5xx errors, by contrast, close gracefully at TCP and are invisible from the TCP layer alone.

What does setting max-duration to the p95 session duration accomplish, and why pick 60 s as idle-timeout?

max-duration is an absolute lifetime cap on a session; once exceeded VisualEther tags the outcome late::exceeded max duration of Xs if a stop arrives later, or timeout::exceeded max duration of Xs if no stop ever arrives. Setting it to the per-type p95 from a baseline run defines "late" as "slower than 95 % of observed sessions" without inventing arbitrary thresholds. idle-timeout caps inactivity between messages within a session; 60 s is the natural cutoff for a CI/CD run where genuine connections rarely sit idle a full minute, so an idle gap larger than that almost always indicates an abandoned connection (HTTP keep-alive holding a TCP socket, DHCP DISCOVER never answered, etc.). The two thresholds together let VisualEther emit late::, timeout::exceeded max duration, and timeout::exceeded idle timeout as three distinct outcome buckets.

Why are most TLS "failures" in bigFlows.pcap encrypted Alert records?

After TLS Change Cipher Spec, every record including the closing Alert is encrypted. tshark exposes the bare tls.record opcode without the Alert description, so a generic Alert template cannot distinguish a benign close_notify from a fatal alert and conservatively classifies it as failure. Of 387 failure::tls-alert sessions in bigFlows.pcap, 0 were classified as failure::tls-alert-fatal (which matches handshake-time fatal alerts before ciphering activates) — the bulk are almost certainly normal close_notifies, not real handshake failures. The TCP RST count (74) is the more reliable signal of actual connection trouble. Loading a TLS keylog via tshark_args="-o tls.keylog_file:…" would resolve the ambiguity for instrumented traffic.

How do you split HTTP into short-poll and long-poll session types in FXT?

Declare an http-longpoll session-type before the generic http-transaction with a qualifier that uses value-regex on http.host (e.g. notify\d*\.dropbox\.com and realtime\.services\.disqus\.com), key it on tcp.stream so the response leg binds to the same session, and give it wider thresholds (max-duration=180, idle-timeout=120) to reflect intentional 60–120 s server holds. The request packet that carries the Host header qualifies into long-poll first; the response packet shares the tcp.stream key and attaches to the same session. Because multi-match="true" is required to let TCP RST close app-layer sessions, the long-poll request also still qualifies for http-transaction below — so http-transaction remains the superset (all HTTP) and http-longpoll is a focused sub-view. The two buckets overlap by design and must not be summed.

bigFlows.pcap Session Analysis: HTTP, TLS, DNS, DHCP — Failures, Late & Long-Poll Splits

01UserInitial request

Please generate sessions.fxt.xml from bigFlows.pcap. This file represents our typical CI/CD runs.

02ClaudeFXT for the CI/CD-relevant subset

analyze_capture detects 20 protocols. For a CI/CD framing I narrowed to the five that matter for build, deploy and registry traffic: HTTP, TLS/HTTPS, SSH, DNS, DHCP. Each gets its own session-type in sessions.fxt.xml with:

multi-match="true" at the root, so a TCP FIN/RST scoped to an app-layer session-type can close that session without suppressing the app-layer message templates themselves.
Right key per protocol: tcp.stream for HTTP / TLS / SSH; dns.id for DNS; dhcp.id for DHCP.
Explicit session-result outcomes on every stop template — success::http-2xx, success::http-redirect, failure::http-client-error, failure::http-server-error, failure::tcp-reset, success::tls-close-notify, failure::tls-alert-fatal, success::resolved, failure::dns-error, success::ack, failure::nak, etc.
Unscoped TCP-analysis templates at the bottom (retransmission, fast retx, duplicate ACK, zero window) so every TCP session is decorated with health markers without being tied to a single session-type.

validate_fxt against the pcap reports: FXT valid. 5 session type(s), 37 template(s). 5,689 messages matched from 100,000 packets scanned — 6 % coverage is expected because the remaining protocols in the capture (SIP, SNMP, VNC, SMB, sFlow, Syslog) are intentionally out of scope for a CI/CD lens.

FXT downloadable here: bigflows-sessions.fxt.xml.

03UserRun extraction

Run extract_sessions with it.

04ClaudeSession counts — first pass

Format auto-resolved to lazy (per-session PDFs render on-demand from the navigator). Counts:

Session type	Outcome	Count
dns-resolve	success::resolved	1,789
dns-resolve	failure::dns-error	301
dns-resolve	incomplete::no-close	17
http-transaction	success::http-2xx	6,109
http-transaction	success::http-redirect	1,163
http-transaction	failure::http-client-error	65
http-transaction	failure::http-server-error	3
http-transaction	failure::tcp-reset	27
http-transaction	incomplete::no-close	119
tls-session	success::tcp-fin	552
tls-session	failure::tls-alert	397
tls-session	failure::tcp-reset	74
tls-session	incomplete::no-close	421
ssh-session	success::tcp-fin	22
dhcp-transaction	success::ack	7
dhcp-transaction	incomplete::no-close	22

The session navigator opens to the per-protocol tree; per-session PDFs render on demand.

05UserRank and drill

Rank the protocols by failure / incomplete rate. For each protocol with a non-trivial failure rate (anything above 5 %), pull one representative failing session and tell me the symptom in one sentence and the root cause in two. End with a one-paragraph executive summary an on-call engineer could paste into a ticket.

06ClaudeFailure / incomplete ranking

Rank	Protocol	Total	Bad	Rate
1	dhcp-transaction	29	22	75.9 %
2	tls-session	1,444	892	61.8 %
3	dns-resolve	2,107	318	15.1 %
4	http-transaction	7,486	214	2.9 %
5	ssh-session	22	0	0 %

DHCP — incomplete::no-close (session 0x0446379d, 172.16.133.87)

Symptom: client sent a DHCPINFORM broadcast to 255.255.255.255:67 and no server reply (ACK or NAK) is in the capture.
Root cause: DHCPINFORM is a one-shot config request that terminates on ACK; the FXT marks sessions complete on ACK / NAK / Release / Decline. Most “incomplete” DHCP rows are unanswered INFORM broadcasts — either no DHCP server services INFORM on this segment, or the unicast reply was outside the capture point.

TLS — failure::tls-alert (session tcp.stream=560, 172.16.133.57 → 96.43.146.48:443)

Symptom: a full SSL 3.0 handshake completes, application data flows, then the client sends a TLS Alert and the session closes.
Root cause: the alert is sent after Change Cipher Spec, so its payload is encrypted; tshark only exposes the bare tls.record opcode and the FXT cannot distinguish a benign close_notify from a fatal alert — it conservatively classifies it as failure. The bulk of the 397 failure::tls-alert sessions are almost certainly normal encrypted close_notifys, not real handshake failures; the genuine handshake-time signal failure::tls-alert-fatal is 0.

DNS — failure::dns-error (session 0x8df0, 172.16.133.6 → 8.8.8.8)

Symptom: PTR query for 45.66.120.96.in-addr.arpa returns rcode 2 (SERVFAIL) from 8.8.8.8 after ~260 ms.
Root cause: the 96.120.66.45 reverse-DNS zone has no authoritative PTR record (or its nameservers are unreachable from Google's resolver), so 8.8.8.8 propagates SERVFAIL. The 301 DNS errors are dominated by reverse lookups on third-party IPs — not a local resolver problem, just upstream zones without PTRs.

Exec summary — ticket-ready. CI/CD traffic analysis of bigFlows.pcap shows three protocols above the 5 % failure / incomplete threshold, but only one is a real concern. DHCP (75.9 %) is misleading — 22 of 29 sessions are unanswered DHCPINFORM broadcasts that never complete by design; investigate only if agents depend on INFORM responses. TLS (61.8 %) is largely a measurement artefact: 397 of 892 “failures” are encrypted post-handshake alerts (most are normal close_notify) that tshark cannot decrypt without keys; only 74 hard TCP resets and 0 fatal handshake alerts indicate genuine connectivity problems. DNS (15.1 %) shows 301 SERVFAILs concentrated on PTR / reverse lookups against third-party IPs through 8.8.8.8 — an upstream zone-config issue, not a pipeline failure. HTTP (2.9 %) and SSH (0 %) are healthy. Action: before treating TLS/DHCP as incidents, refine the FXT (decryption keylog for TLS, or scope DHCPINFORM as a non-failure outcome); the only real signal worth investigating is the cluster of 27 HTTP TCP-resets and 65 4xx responses.

07UserDrill into HTTP failures

Let us drill into the HTTP failures (2.9 %). What kind of failures are we seeing at the HTTP and TCP levels? Are we able to correlate errors across layers?

08ClaudeHTTP × TCP cross-layer correlation

Failure mix (214 / 7,486 = 2.9 %)

Category	Count	Layer
4xx client error (404 ≈ 55, 403 ≈ 7, 400 ≈ 2)	65	HTTP
5xx server error (one each: 500, 503, 504)	3	HTTP
TCP RST mid-transaction	27	TCP
Incomplete (request, no response)	119	both

HTTP layer

404s dominate (~85 % of 4xx) — scattered across many hosts (content.plymedia.com, i.ytimg.com, ad servers, monster.com). Content-not-found errors from third-party endpoints; noise from browser / app traffic, not pipeline failures.
403s (~7) — auth / permission denials, again third-party.
5xx are rare and uncorrelated — three different upstreams: content.plymedia.com (500 Internal), sametime.lotus.com (504 Gateway Timeout on a POST), jdn.monster.com (503 Service Unavailable). No shared host.

TCP layer

27 RSTs, several arriving in tight clusters on adjacent tcp.stream IDs to the same destination: streams 17970/17971 → 173.194.43.36 (i.ytimg.com), two YouTube thumbnail GETs reset by the client ~70 ms in. Streams 8306–8313: 8 consecutive resets within a few hundred ms. Streams 5368–5372: similar burst.
RST direction is mixed — some server-originated (immediate, before response: stream 132 to pandora.com, RST 254 µs after GET), some client-originated (stream 17970, client RSTs after no response in 70 ms).
No retransmissions / zero-windows appear in the failure NDJSONs — the unscoped TCP-analysis templates fired during render but did not co-occur within failing HTTP sessions. The RSTs are not congestion-driven; they are abrupt aborts.

Cross-layer correlation — yes, by tcp.stream

The FXT structure makes correlation cheap: every HTTP transaction and the TCP RST that ends it share the same flow.key = http-transaction::<tcp.stream>. Three patterns are visible:

HTTP request → TCP RST in same session (the 27 RST failures): request line and RST appear in the same NDJSON file with the same flow.key. Stream 132 (pandora.com) shows GET at frame 889, RST at frame 891 — 254 µs gap, server-side connection refusal.
HTTP 5xx → graceful FIN (no RST): all 3 5xx sessions end normally at TCP — the server sent the error and closed cleanly. No TCP-layer signal would have caught these.
Burst RSTs to one host: streams 17970/17971 to 173.194.43.36 (both YouTube thumbnails, same client, same second) — suggests a per-destination throttle or blocklist, not a generic network event.

09UserThresholds: p95 + 60 s idle

Treat sessions whose duration exceeds the 95th percentile as “late,” and treat sessions with more than 1 minute of inactivity as “timed out.” Update sessions.fxt.xml with appropriate max-duration and idle-period values to reflect these thresholds, then run extract_sessions against the updated FXT file.

10ClaudeApply p95 + idle-timeout=60

Computed p95 of observed session durations per type from the first run's NDJSONs, then set max-duration = p95 and idle-timeout = 60 s on every session-type:

Session type	n	median	p95	max	max-duration	idle-timeout
dns-resolve	2,094	0.041 s	2.95 s	167.4 s	3 s	60 s
tls-session	1,440	5.89 s	175.7 s	293.1 s	176 s	60 s
http-transaction	7,371	0.070 s	0.54 s	183.5 s	1 s	60 s
ssh-session	22	7.71 s	11.84 s	11.88 s	12 s	60 s
dhcp-transaction	9	0.0 s	294.0 s	294.0 s	300 s	60 s

Outcome shifts after re-extracting

DNS: 169 sessions reclassified — 65 late:: (slow reverse-DNS PTRs that eventually returned) + 104 timeout::max-duration (no response within 3 s). The earlier failure::dns-error count dropped from 301 → 236; the difference were SERVFAILs that arrived after 3 s and are now late.
TLS: 334 long-lived sessions reclassified — 128 timeout::max-duration + 206 timeout::idle 60s. Previously categorised as success::tcp-fin or incomplete, but actually sat idle for minutes inside keep-alive.
HTTP: 311 reclassified — 182 late + 129 timeout. The p95 = 1 s threshold is tight on a noisy mixed-traffic capture.
DHCP: 18 of the previous 22 incomplete::no-close now properly labeled timeout::idle 60s — the 60 s gap caught unanswered INFORM / Discover broadcasts cleanly.
SSH: unchanged — all 22 sessions completed well under 12 s.

11UserPattern hunting

Do we see a pattern in the late and timeout sessions?

12ClaudeThree distinct patterns, one real problem

DNS — same reverse zone is dragging both buckets

DNS late (65 sessions, 6–13 s) and DNS timeout (104 sessions, no response in 3 s) are the same problem at two latency cutoffs: the reverse-PTR 45.66.120.96.in-addr.arpa accounts for 75 / 133 late queries and 65 / 122 timeouts. Internal resolver 172.16.128.202 carries 60–94 % of both buckets — it is retrying upstream for ~12 s before either giving up (timeout) or eventually returning SERVFAIL (late). 88 of 122 timeouts and 68 of the late ones are PTR records.
Pattern: one specific reverse zone is unauthoritative or unreachable upstream; the internal resolver keeps eating the retry budget.

HTTP — “late” is mostly long-poll, “timeout” is real

HTTP late (182, p50 = 2.17 s, max = 167 s) is dominated by long-poll endpoints: notify17/24/33.dropbox.com (16 sessions, Dropbox's HTTP long-polling notification channel), realtime.services.disqus.com, and high-latency content sites (LinkedIn × 35, WSJ × 11, TripAdvisor × 7). The 167 s max matches typical 60–120 s long-poll holds — these are working as intended, not slow.
HTTP timeout (129, duration = 0) are genuine “request sent, response never arrived within 1 s” — spread across many hosts with no single dominant target.
Pattern: the 1 s max-duration is too tight for HTTP traffic that legitimately uses long-polling. Late ≠ slow here.

TLS — Salesforce keep-alive sessions dominate

206 idle-timeouts + 128 max-duration timeouts, top SNIs: na2.salesforce.com (61), umps2c2.salesforce.com + variants (56), c.na2.visual.force.com + iv.na2.visual.force.com (20) — Salesforce accounts for ~140 of 334 TLS timeouts (~42 %). Add safebrowsing.google.com (13), www.google.com (10), plus.google.com (8) and you get the rest.
Pattern: SaaS-style HTTPS keep-alives (Salesforce push / poll + Google Safe Browsing background) hold TCP connections idle between bursts. They are long-lived by design.

DHCP — unanswered broadcasts

All 18 DHCP timeouts target 255.255.255.255 (broadcast). p95 duration = 10 s suggests clients retry every few seconds within the 60 s window before giving up.
Pattern: either no DHCP server is serving these subnets, or the relay / server is not replying to INFORM (which is legitimate — many DHCP servers ignore INFORM).

Bottom line — three distinct phenomena hide behind “late / timeout”

Real infrastructure issue — DNS resolver 172.16.128.202 ↔ specific upstream reverse zones (chiefly 96.120.66.0/24). Worth fixing.
Threshold misfit, not failure — HTTP “late” is long-polling; TLS “timeout” is keep-alive. Either raise max-duration (HTTP → 60 s, TLS → 600 s) or scope these timeouts to requests rather than connections.
Architectural quirk, not failure — DHCP INFORM broadcasts with no responder; this is environmental.

13UserSplit long-poll out

Split http-transaction into short-poll vs long-poll.

14Claudehttp-longpoll via Host-header regex

Added an http-longpoll session-type qualified by http.host value-regex on the known long-poll endpoints (Dropbox change-notify, Disqus realtime), declared before the generic http-transaction so the request packet (which carries the Host header) qualifies into long-poll first. Wider thresholds reflect intentional 60–120 s server holds:

<session-type name="http-longpoll" max-duration="180" idle-timeout="120">
  <qualify logic="or">
    <field value-regex="notify\d*\.dropbox\.com">http.host</field>
    <field value-regex="realtime\.services\.disqus\.com">http.host</field>
  </qualify>
  <key>
    <protocol><field>tcp.stream</field></protocol>
  </key>
</session-type>

What landed in http-longpoll

Outcome	Count
timeout::exceeded max duration of 180 s	12
incomplete::no-close	9

13 distinct tcp.streams, all matching the targeted set: notify3/12/17/22/24/32/33/34.dropbox.com and realtime.services.disqus.com. The qualifier hit precisely what was intended.

Notably 0 long-poll sessions completed with 2xx within 180 s — every captured long-poll either timed out the FXT max-duration (12) or had no response in the capture window (9). That is the signature of working long-polling: the server holds the connection for its full timeout, then either the capture ends or another request starts.

The caveat — multi-match shadows the http-transaction counts

FXT <qualify> supports <not-exists> for field absence but not value-regex negation, so with multi-match="true" (required so TCP RST can close app-layer sessions) each long-poll request packet qualifies for both session-types and creates two parallel sessions with different flow keys (http-longpoll::<stream> vs http-transaction::<stream>).

So:

http-longpoll is a clean, correctly-classified bucket of long-poll behaviour — useful for monitoring.
http-transaction still shadow-counts those streams as late / timeout. Its “short-poll” stats are not cleaned up by the split.

15UserAccept the double-count

The current double counting is acceptable as long as the team understands that http-transaction is the superset.

16ClaudeDocument the overlap in the FXT

Acknowledged. Leaving multi-match="true" and the overlapping classification in place. Added a self-documenting comment at the http-longpoll declaration so the next reader sees the relationship spelled out:

<!-- Long-poll HTTP endpoints (Dropbox change-notify, Disqus realtime).
     Declared BEFORE the generic http-transaction so the request packet
     (which carries http.host) qualifies into long-poll first; the
     tcp.stream key then binds the response leg to the same session.
     Wider thresholds reflect intentional 60-120s server holds.

     NOTE: with multi-match="true" (required so TCP RST can close
     app-layer sessions), each long-poll request also qualifies for
     http-transaction below and creates a parallel session there.
     http-transaction is therefore the SUPERSET (all HTTP, including
     long-poll); http-longpoll is a focused sub-view. Do not sum the
     two - they overlap by design. -->

Method. bigFlows.pcap obtained from tcpreplay sample captures and analysed end-to-end through the VisualEther MCP server with Claude. analyze_capture picked the CI/CD-relevant protocols; the resulting bigflows-sessions.fxt.xml was validated with validate_fxt, then driven through extract_sessions in three rounds — baseline, p95 thresholds, and the long-poll split. Each round's NDJSON output was sampled to derive the patterns (PTR SERVFAIL, Salesforce SNI, Dropbox notify Host header), and every conclusion is grounded in those per-session records.