netobserv
diff --git a/‎Makefile‎
Lines changed: 2 additions & 2 deletions b/‎Makefile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎bpf/README.md‎
Lines changed: 50 additions & 0 deletions b/‎bpf/README.md‎
Lines changed: 50 additions & 0 deletions
diff --git a/‎bpf/flow.h‎
Lines changed: 30 additions & 27 deletions b/‎bpf/flow.h‎
Lines changed: 30 additions & 27 deletions
diff --git a/‎bpf/flows.c‎
Lines changed: 120 additions & 57 deletions b/‎bpf/flows.c‎
Lines changed: 120 additions & 57 deletions
diff --git a/‎docs/measurements.pptx‎
75.9 KB b/‎docs/measurements.pptx‎
75.9 KB
@@ -81,8 +81,8 @@ generate: prereqs
 .PHONY: docker-generate
 docker-generate:
 	@echo "### Creating the container that generates the eBPF binaries"
-	docker build . -f scripts/generators.Dockerfile -t $(LOCAL_GENERATOR_IMAGE)
-	docker run --rm -v $(shell pwd):/src $(LOCAL_GENERATOR_IMAGE)
+	$(OCI_BIN) build . -f scripts/generators.Dockerfile -t $(LOCAL_GENERATOR_IMAGE)
+	$(OCI_BIN) run --rm -v $(shell pwd):/src $(LOCAL_GENERATOR_IMAGE)
 
 .PHONY: build
 build: prereqs fmt lint test vendors compile
 
@@ -0,0 +1,50 @@
+## Flows v2: An improved version of Netobserv eBPF Agent
+
+### What Changed?
+At the eBPF/TC code, the v1 used a ringbuffer to export flow records to the userspace program.
+Based on our measurements, ringbuffer can lead to a bottleneck since each a record for each packet in the data-path needs to be sent to the userspace, which eventually results in loss of records.
+Additionally, this leads to high CPU utilization since the userspace program would be constantly active to handle callback events on a per-packet basis.  
+Refer to the [Measurements slide-deck](../docs/measurements.pptx) for performance measurements.  
+To tackle this and achieve 100% monitoring coverage, the v2 eBPF/TC code uses a Per-CPU Hash Map to aggregate flow-based records in the eBPF data-path, and pro-actively send the records to userspace upon flow termination. The detailed logic is below:
+
+#### eBPF Data-path Logic:
+1) Store flow information in a per-cpu hash map. A separate per-cpu hash map is maintained for ingress and egress to avoid performance bottlenecks.
+One design choice that needs to be concretized with performance measurements is to whether v4 and v6 IPs need to be maintained in the same map or a different one.  
+On a higher level note, need to check if increasing the map size (hash computation part) affect throughput.  
+2) Upon Packet Arrival, a lookup is performed on the map.  
+  * If the lookup is successful, then update the packet count, byte count, and the current timestamp.  
+  * If the lookup is unsuccessful, then try creating a new entry in the map.  
+
+3) If entry creation failed due to a full map, then send the entry to userspace program via ringbuffer.  
+4) Upon flow completion (tcp->fin/rst event), send the flow-id to userspace via ringbuffer.
+
+##### Hash collisions
+One downside of using hash-based map is, When flows are hashed to the per-cpu map, there is a possibility of hash collisions occuring which would make multiple different flows map into the same entry. As a result, it might lead to inaccurate flow entries. To handle hash collisions we do the following :
+1) In each flow entry, we additionally maintain the full key/id.
+2) Before a packet's id is updated to map, the key is additionally compared to check if there is another flow residing in the map.
+3) If there is another flow, we do want to update the entry wrongly. Hence, we send the new packet entry directly to userspace via ringbuffer after updating a flag to inform of collision.
+
+To detect and handle
+#### User-space program Logic: (Refer [tracer.go](../pkg/ebpf/tracer.go))
+The userspace program has three active threads:  
+
+1) **Trace** :     
+a) If the received flow-id is a flow completion (indicated via the flags) from eBPF data-path via ringbuffer and does the following:  
+* ScrubFlow : Performs lookup of the flow-id in the ingress/egress map and aggregates the metrics from different CPU specific counters. Then deletes the entry corresponding to the flow-id from the map.  
+* Exports the aggregated flow record to the accounter pipeline.  
+b) If the received flow-id is not a flow completion event, then just forward this record to accounter pipeline. It will be aggregated in future by accounter upon flow completion.
+
+2) **MonitorIngress** :
+This is a periodic thread which wakes up every n seconds and does the following :  
+a) Create a map iterator, and iterates over each entry in the map.  
+b) Evict an entry if the condition is met :
+  * If the timestamp of the last seen packet in the flow is more than m seconds ago.  
+  * There are other options for eviction that can be implemented, either based on the packets/bytes observed. Or a more aggressive eviction if the map is k% full. These are further improvements that can be performed to fine-tune the map usage based on the scenario and use-case.
+
+c) The evicted entry is aggregated into a flow-record and forwarded to the accounter pipeline.
+
+3) **MonitorEgress** :  
+This is a period thread, which does the same task as MonitorIngress, but only the map is egress.
+
+##### Hash Collision handling in user-space
+Inspite of handling hash collisions in the eBPF datapath, there is still a chance of multiple flows mapping to the same map, since per-cpu map maintains a separate entries per-cpu. Hence, its possible that multiple flows from different CPUs can map into the same entry, but are in different buckets. Hence, during aggregation of entries, we check the key before aggregating the entries per-flow. Upon detection of such entries, we export the entry to accounter. Now since the flow key is stored along with each entry, we can recover such collided entries and send to accounter.
@@ -10,37 +10,40 @@ typedef __u16 u16;
 typedef __u32 u32;
 typedef __u64 u64;
 
-// L2 data link layer
-struct data_link {
+typedef struct flow_metrics_t {
+    u32 packets;
+    u64 bytes;
+    // Flow start and end times as monotomic timestamps in nanoseconds
+    // as output from bpf_ktime_get_ns()
+    u64 start_mono_time_ts;
+    u64 end_mono_time_ts;
+} __attribute__((packed)) flow_metrics;
+
+// Attributes that uniquely identify a flow
+typedef struct flow_id_t {
+    u16 eth_protocol;
+    u8 direction;
+    // L2 data link layer
     u8 src_mac[ETH_ALEN];
     u8 dst_mac[ETH_ALEN];
-} __attribute__((packed));
-
-// L3 network layer
-// IPv4 addresses are encoded as IPv6 addresses with prefix ::ffff/96
-// as described in https://datatracker.ietf.org/doc/html/rfc4038#section-4.2
-struct network {
+    // L3 network layer
+    // IPv4 addresses are encoded as IPv6 addresses with prefix ::ffff/96
+    // as described in https://datatracker.ietf.org/doc/html/rfc4038#section-4.2
     struct in6_addr src_ip;
     struct in6_addr dst_ip;
-} __attribute__((packed));
-
-// L4 transport layer
-struct transport {
+    // L4 transport layer
     u16 src_port;
     u16 dst_port;
-    u8 protocol;
-} __attribute__((packed));
-
-// TODO: L5 session layer to bound flows to connections?
-
-// contents in this struct must match byte-by-byte with Go's pkc/flow/Record struct
-struct flow {
-    u16 protocol;
-    u8 direction;
-    struct data_link data_link;
-    struct network network;
-    struct transport transport;
-    u64 bytes;
-} __attribute__((packed));
-
+    u8 transport_protocol;
+    // OS interface index
+    u32 if_index;
+} __attribute__((packed)) flow_id;
+
+// Flow record is a tuple containing both flow identifier and metrics. It is used to send
+// a complete flow via ring buffer when only when the accounting hashmap is full.
+// Contents in this struct must match byte-by-byte with Go's pkc/flow/Record struct
+typedef struct flow_record_t {
+    flow_id id;
+    flow_metrics metrics;
+} __attribute__((packed)) flow_record;
 #endif
@@ -1,10 +1,33 @@
+/*
+    Flows v2. A Flow-metric generator using TC.
+
+    This program can be hooked on to TC ingress/egress hook to monitor packets
+    to/from an interface.
+
+    Logic:
+        1) Store flow information in a per-cpu hash map.
+        2) Upon flow completion (tcp->fin event), evict the entry from map, and
+           send to userspace through ringbuffer.
+           Eviction for non-tcp flows need to done by userspace
+        3) When the map is full, we send the new flow entry to userspace via ringbuffer,
+            until an entry is available.
+        4) When hash collision is detected, we send the new entry to userpace via ringbuffer.
+*/
+
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
 #include <linux/ip.h>
+#include <linux/if_ether.h>
 #include <linux/ipv6.h>
-#include <linux/in.h>
-#include <linux/tcp.h>
+#include <linux/icmp.h>
+#include <linux/icmpv6.h>
 #include <linux/udp.h>
-#include <linux/bpf.h>
-#include <linux/types.h>
+#include <linux/tcp.h>
+#include <string.h>
+
+#include <stdbool.h>
 #include <linux/if_ether.h>
 
 #include <bpf_helpers.h>
@@ -19,43 +42,51 @@
 #define INGRESS 0
 #define EGRESS 1
 
-// TODO: for performance reasons, replace the ring buffer by a hashmap and
-// aggregate the flows here instead of the Go Accounter
+// Common Ringbuffer as a conduit for ingress/egress flows to userspace
 struct {
     __uint(type, BPF_MAP_TYPE_RINGBUF);
     __uint(max_entries, 1 << 24);
-} flows SEC(".maps");
+} direct_flows SEC(".maps");
+
+// Key: the flow identifier. Value: the flow metrics for that identifier.
+// The userspace will aggregate them into a single flow.
+struct {
+    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+    __type(key, flow_id);
+    __type(value, flow_metrics);
+} aggregated_flows SEC(".maps");
 
 // Constant definitions, to be overridden by the invoker
 volatile const u32 sampling = 0;
 
 const u8 ip4in6[] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xff, 0xff};
 
 // sets flow fields from IPv4 header information
-static inline int fill_iphdr(struct iphdr *ip, void *data_end, struct flow *flow) {
+static inline int fill_iphdr(struct iphdr *ip, void *data_end, flow_id *id) {
     if ((void *)ip + sizeof(*ip) > data_end) {
         return DISCARD;
     }
 
-    __builtin_memcpy(flow->network.src_ip.s6_addr, ip4in6, sizeof(ip4in6));
-    __builtin_memcpy(flow->network.dst_ip.s6_addr, ip4in6, sizeof(ip4in6));
-    __builtin_memcpy(flow->network.src_ip.s6_addr + sizeof(ip4in6), &ip->saddr, sizeof(ip->saddr));
-    __builtin_memcpy(flow->network.dst_ip.s6_addr + sizeof(ip4in6), &ip->daddr, sizeof(ip->daddr));
-    flow->transport.protocol = ip->protocol;
-
+    __builtin_memcpy(id->src_ip.s6_addr, ip4in6, sizeof(ip4in6));
+    __builtin_memcpy(id->dst_ip.s6_addr, ip4in6, sizeof(ip4in6));
+    __builtin_memcpy(id->src_ip.s6_addr + sizeof(ip4in6), &ip->saddr, sizeof(ip->saddr));
+    __builtin_memcpy(id->dst_ip.s6_addr + sizeof(ip4in6), &ip->daddr, sizeof(ip->daddr));
+    id->transport_protocol = ip->protocol;
+    id->src_port = 0;
+    id->dst_port = 0;
     switch (ip->protocol) {
     case IPPROTO_TCP: {
         struct tcphdr *tcp = (void *)ip + sizeof(*ip);
         if ((void *)tcp + sizeof(*tcp) <= data_end) {
-            flow->transport.src_port = __bpf_ntohs(tcp->source);
-            flow->transport.dst_port = __bpf_ntohs(tcp->dest);
+            id->src_port = __bpf_ntohs(tcp->source);
+            id->dst_port = __bpf_ntohs(tcp->dest);
         }
     } break;
     case IPPROTO_UDP: {
         struct udphdr *udp = (void *)ip + sizeof(*ip);
         if ((void *)udp + sizeof(*udp) <= data_end) {
-            flow->transport.src_port = __bpf_ntohs(udp->source);
-            flow->transport.dst_port = __bpf_ntohs(udp->dest);
+            id->src_port = __bpf_ntohs(udp->source);
+            id->dst_port = __bpf_ntohs(udp->dest);
         }
     } break;
     default:
@@ -65,28 +96,29 @@ static inline int fill_iphdr(struct iphdr *ip, void *data_end, struct flow *flow
 }
 
 // sets flow fields from IPv6 header information
-static inline int fill_ip6hdr(struct ipv6hdr *ip, void *data_end, struct flow *flow) {
+static inline int fill_ip6hdr(struct ipv6hdr *ip, void *data_end, flow_id *id) {
     if ((void *)ip + sizeof(*ip) > data_end) {
         return DISCARD;
     }
 
-    flow->network.src_ip = ip->saddr;
-    flow->network.dst_ip = ip->daddr;
-    flow->transport.protocol = ip->nexthdr;
-
+    id->src_ip = ip->saddr;
+    id->dst_ip = ip->daddr;
+    id->transport_protocol = ip->nexthdr;
+    id->src_port = 0;
+    id->dst_port = 0;
     switch (ip->nexthdr) {
     case IPPROTO_TCP: {
         struct tcphdr *tcp = (void *)ip + sizeof(*ip);
         if ((void *)tcp + sizeof(*tcp) <= data_end) {
-            flow->transport.src_port = __bpf_ntohs(tcp->source);
-            flow->transport.dst_port = __bpf_ntohs(tcp->dest);
+            id->src_port = __bpf_ntohs(tcp->source);
+            id->dst_port = __bpf_ntohs(tcp->dest);
         }
     } break;
     case IPPROTO_UDP: {
         struct udphdr *udp = (void *)ip + sizeof(*ip);
         if ((void *)udp + sizeof(*udp) <= data_end) {
-            flow->transport.src_port = __bpf_ntohs(udp->source);
-            flow->transport.dst_port = __bpf_ntohs(udp->dest);
+            id->src_port = __bpf_ntohs(udp->source);
+            id->dst_port = __bpf_ntohs(udp->dest);
         }
     } break;
     default:
@@ -95,59 +127,90 @@ static inline int fill_ip6hdr(struct ipv6hdr *ip, void *data_end, struct flow *f
     return SUBMIT;
 }
 // sets flow fields from Ethernet header information
-static inline int fill_ethhdr(struct ethhdr *eth, void *data_end, struct flow *flow) {
+static inline int fill_ethhdr(struct ethhdr *eth, void *data_end, flow_id *id) {
     if ((void *)eth + sizeof(*eth) > data_end) {
         return DISCARD;
     }
-    __builtin_memcpy(flow->data_link.dst_mac, eth->h_dest, ETH_ALEN);
-    __builtin_memcpy(flow->data_link.src_mac, eth->h_source, ETH_ALEN);
-    flow->protocol = __bpf_ntohs(eth->h_proto);
-    // TODO: ETH_P_IPV6
-    if (flow->protocol == ETH_P_IP) {
+    __builtin_memcpy(id->dst_mac, eth->h_dest, ETH_ALEN);
+    __builtin_memcpy(id->src_mac, eth->h_source, ETH_ALEN);
+    id->eth_protocol = __bpf_ntohs(eth->h_proto);
+
+    if (id->eth_protocol == ETH_P_IP) {
         struct iphdr *ip = (void *)eth + sizeof(*eth);
-        return fill_iphdr(ip, data_end, flow);
-    } else if (flow->protocol == ETH_P_IPV6) {
+        return fill_iphdr(ip, data_end, id);
+    } else if (id->eth_protocol == ETH_P_IPV6) {
         struct ipv6hdr *ip6 = (void *)eth + sizeof(*eth);
-        return fill_ip6hdr(ip6, data_end, flow);
+        return fill_ip6hdr(ip6, data_end, id);
+    } else {
+        // TODO : Need to implement other specific ethertypes if needed
+        // For now other parts of flow id remain zero
+        memset (&(id->src_ip),0, sizeof(struct in6_addr));
+        memset (&(id->dst_ip),0, sizeof(struct in6_addr));
+        id->transport_protocol = 0;
+        id->src_port = 0;
+        id->dst_port = 0;
     }
     return SUBMIT;
 }
 
-// parses flow information for a given direction (ingress/egress)
-static inline int flow_parse(struct __sk_buff *skb, u8 direction) {
 
+static inline int flow_monitor(struct __sk_buff *skb, u8 direction) {
     // If sampling is defined, will only parse 1 out of "sampling" flows
     if (sampling != 0 && (bpf_get_prandom_u32() % sampling) != 0) {
         return TC_ACT_OK;
     }
-
-    void *data = (void *)(long)skb->data;
     void *data_end = (void *)(long)skb->data_end;
+    void *data = (void *)(long)skb->data;
 
-    struct flow *flow = bpf_ringbuf_reserve(&flows, sizeof(struct flow), 0);
-    if (!flow) {
+    flow_id id;
+    u64 current_time = bpf_ktime_get_ns();
+    struct ethhdr *eth = data;
+    if (fill_ethhdr(eth, data_end, &id) == DISCARD) {
         return TC_ACT_OK;
     }
+    id.if_index = skb->ifindex;
+    id.direction = direction;
 
-    struct ethhdr *eth = data;
-    if (fill_ethhdr(eth, data_end, flow) == DISCARD) {
-        bpf_ringbuf_discard(flow, 0);
+    flow_metrics *aggregate_flow = bpf_map_lookup_elem(&aggregated_flows, &id);
+    if (aggregate_flow != NULL) {
+        aggregate_flow->packets += 1;
+        aggregate_flow->bytes += skb->len;
+        aggregate_flow->end_mono_time_ts = current_time;
+
+        bpf_map_update_elem(&aggregated_flows, &id, aggregate_flow, BPF_EXIST);
     } else {
-        flow->direction = direction;
-        flow->bytes = skb->len;
-        bpf_ringbuf_submit(flow, 0);
+        // Key does not exist in the map, and will need to create a new entry.
+        flow_metrics new_flow = {
+            .packets = 1,
+            .bytes=skb->len,
+            .start_mono_time_ts = current_time,
+            .end_mono_time_ts = current_time,
+        };
+        
+        if (bpf_map_update_elem(&aggregated_flows, &id, &new_flow, BPF_NOEXIST) != 0) {
+            /*
+                When the map is full, we directly send the flow entry to userspace via ringbuffer,
+                until space is available in the kernel-side maps
+            */
+            flow_record *record = bpf_ringbuf_reserve(&direct_flows, sizeof(flow_record), 0);
+            if (!record) {
+                return TC_ACT_OK;
+            }
+            record->id = id;
+            record->metrics = new_flow;
+            bpf_ringbuf_submit(record, 0);
+        }
     }
     return TC_ACT_OK;
-}
 
-SEC("tc/ingress_flow_parse")
-static inline int ingress_flow_parse(struct __sk_buff *skb) {
-    return flow_parse(skb, INGRESS);
 }
-
-SEC("tc/egress_flow_parse")
-static inline int egress_flow_parse(struct __sk_buff *skb) {
-    return flow_parse(skb, EGRESS);
+SEC("tc_ingress")
+int ingress_flow_parse (struct __sk_buff *skb) {
+    return flow_monitor(skb, INGRESS);
 }
 
-char __license[] SEC("license") = "GPL";
+SEC("tc_egress")
+int egress_flow_parse (struct __sk_buff *skb) {
+    return flow_monitor(skb, EGRESS);
+}
+char _license[] SEC("license") = "GPL";