-
Notifications
You must be signed in to change notification settings - Fork 26
OpenLI Tutorial 03: Packet Capture for LI
You may either watch the tutorial lesson on YouTube by clicking on the image above, or you can download the slides and read them alongside the transcript provided below.
Hi everyone and welcome back to this series of training lectures on the OpenLI lawful intercept system.
In this lesson, I’m going to explain one of the fundamental elements of any lawful interception system: the packet capture technology that will grab network traffic off the wire and deliver those packets into the intercept system.
It’s important to note here that OpenLI does not directly implement packet capture itself; rather, it leverages existing packet capture technologies and libraries to do that work on its behalf. However, there are many different capture technologies available, each with their own pros and cons, and the choice of which technology to use is something that we leave up to each user.
The main goal of this lesson is to give you the knowledge you need to make an informed decision about what packet capture methods will be suitable for your network, with an emphasis on how to successfully capture at high throughputs. I’ll introduce some of the capture technologies that OpenLI supports and give you an expert’s insight into what you should consider when deciding what to use for your OpenLI deployment.
So, as I’ve alluded to already, the capture of network traffic is one of the core activities that your LI system will be required to perform. The communications that you will be charged with intercepting will traverse your network in the form of packets (or frames) and you will need to use some method to capture or intercept those packets. The LI system can then examine the headers on each packet to determine if it should be forwarded on to an interested LEA.
Once a captured packet has been determined to belong to an active interception target, then the LI system can continue with the rest of its process -- that is, to first appropriately encapsulate the packet with the ETSI-required headers and then mediate the resulting record to the requesting LEA. But the first, and key, step is to capture the packet from the underlying network.
Of course, I anticipate that most, if not all, people following this series of tutorials will have some familiarity with pcap. Pcap is a relatively simple capture format that has been around for a very long time and is supported by just about anything and everything. The most common tool used for interfacing with pcap is a command-line tool called tcpdump, which can be used to capture packets from any live interface. You can then use tcpdump to either display the contents of the packets to a terminal or to write the packets to a pcap file. Files can be stored for later analysis or shared with someone else. tcpdump can also be given a pcap file as input and it will parse and display the packets that are present in the file.
If you’ve worked with networks for any length of time, then you’ve probably used tcpdump (or perhaps it’s GUI-equivalent, wireshark) to inspect traffic on a network link, most likely for troubleshooting purposes.
We’ve already established that the ETSI standards require a very specific output format for delivering intercepted packets to LEAs. Neither the pcap file format nor the terminal output produced by tcpdump is suitable for that purpose, as they lack many of the key fields that the ETSI standards mandate. So we definitely can’t use tcpdump by itself to solve our LI problems.
However, tcpdump is just a tool that is wrapped around a software library called libpcap, which implements the packet capture methods used by tcpdump and exposes them as an API that anyone can use. For instance, you may have come across a number of other tools that also use libpcap to do interesting things with pcap captures, such as tcptrace, tcpslice or even bigger projects such as snort or zeek.
So given that libpcap exists and is a proven and well known capture technology, why don’t we just write a libpcap for lawful intercept and be done with it?
Well, to answer that question we must first delve a little deeper into the mechanics of packet capture and the trade-offs that take place within each capture method. In our LI system, we have already established one key requirement: all packets sent or received by an intercept target must be intercepted -- no packets should be missed or dropped by our deployed system.
That means, first and foremost, that our LI system must be able to capture every packet that it sees, as any packet on the network could be potentially interceptable. However, when you consider the environments where you might have used tcpdump to do packet capture in the past, you might realise that packet capture was not the number one priority for those systems. For instance, packet capture on a router would be less important than the router’s core tasks of routing packets and maintaining a route table. Similarly, the networking stack on a generic Linux host would have far higher priorities than ensuring flawless packet capture, such as managing TCP sessions or delivering packet content to applications.
So in most situations, you can expect packet capture to be a lower priority task and potentially have access to few resources (at least in a default configuration). But our LI system needs the opposite to be the case -- capture performance must be optimised and have access to as many resources as it needs. So, how do we achieve the level of performance necessary for a production LI system to function?
The biggest influence on the performance of any packet capture system is the number of packets that it has to handle at any given time. Each individual packet received requires some amount of processing effort, even packets that your system may end up subsequently ignoring (if, say for example, they don’t belong to an interception target).
The best thing you can do to limit the likelihood of packet loss within any packet capture system is to ensure that only relevant traffic is being presented on the capture interface -- any filtering that you can do to discard uninteresting traffic before it reaches your capture device will provide massive pay-offs in terms of the capability of your system to cope under peak load
In the context of LI, you will most likely need to mirror customer traffic into your LI system to be captured -- rather than naively mirroring your entire multi-gigabit core link into your LI system, you may consider using some intelligent features on your routers to branch just a smaller subset of the traffic which is guaranteed to encompass the communications of the intercept target. This will save the LI system the effort of inspecting packets, just to realise that they are not useful and need to be discarded.
One important thing to emphasise here is that the key metric is the packet rate, in terms of packets observed per second. Traffic volume in bytes is relatively less important.
To illustrate, consider a 10Gbps link. If we saturate the link with 1500 byte packets, anyone capturing that traffic will see 820,000 packets per second -- which is a lot, but not an impossible workload to deal with. However, in the worst case scenario where the link is saturated with 64 byte packets, a capture system would now have to deal with nearly 15 million packets per second; an 18 time increase in the workload. A realistic traffic model would fall somewhere between these two extremes, of course, but this example shows that a particular bitrate can mean vastly different workloads depending on the underlying packet rate.
Once the incoming packet rate begins to exceed a certain level, it becomes important to be able to both receive and process captured packets in a way that is natively parallel. As you are most likely aware, processing power clock speed has been relatively static while the number of cores available on a chip has grown significantly -- this means that the number of packets that can be captured and processed by a single CPU core is limited. But if the capture process can be parallelised then the growing number of available cores can be leveraged to capture greater and greater volumes to network traffic.
Which is great, but is somewhat dependent on there being some native support for parallel capture and processing within the capture method or library itself. Specifically, the incoming packets must be assigned to different CPU cores as soon as they arrive, and must remain on their core throughout the processing. Moving a packet from one core to another is an expensive process and would invalidate most of the performance gain from using parallel threads in the first place. This requires the capture method to be aware of the need for parallelism and to not lump all received packets into the same memory buffer; rather, each thread would need its own receive buffer which lived on memory associated with that thread’s corresponding CPU core.
Another thing to bear in mind, especially when using a Linux host to perform packet capture, is that the kernel networking stack is designed for general purpose use; that is, to service the applications that are using the network to communicate with each other. For these applications, any received packet will need to trigger an interrupt, be copied into a socket buffer, have the TCP/IP headers stripped and eventually passed onto the application via a system call. This, of course, can introduce a lot of extra overhead for each received packet.
A packet capture application doesn’t really need all of that extra preparation work, as it could just as easily do its job on the original packet content as received by the network interface. So any packets that are captured after passing through the kernel stack have effectively just been through a needless obstacle course and that has a direct impact on the eventual achievable capture rate.
Nowadays, there are specialised packet capture techniques that are able to insert modules or drivers into your kernel to bypass the networking stack and therefore achieve better capture performance. Rather than walking through the network stack, packets are intercepted early on and fed directly into your userspace application. The downside of these techniques is that they each have their own specific API and the capture application must use that API to be able to take advantage of the kernel bypass technique -- for instance, a program that currently works with pcap would need significant redevelopment to be able to use a kernel bypass method.
So with all of that in mind, let’s revisit our idea of having a pcap-based LI application. And the simple answer is no, we can’t really rely on pcap for any use case where we require high performance packet capture and the reasoning is obvious once we think about the elements that influence the performance of a packet capture application. Libpcap is a user space library that doesn’t do any form of kernel bypass, so we’ve got the networking stack overhead to contend with. The library itself predates multi-core computing and was not designed with parallel packet capture in mind.
Overall, any high packet rate capture performed using libpcap is very likely to drop some packets and that’s not going to work in our LI context. If our application were to support libpcap, it would be a handy ubiquitous method for testing and troubleshooting our LI system but you wouldn’t want to rely on it in production. So we must start to consider alternative, more specialised packet capture techniques.
The first technique I’m going to talk about is the AF_PACKET socket, which is a capture method that has been in the Linux kernel for quite some time now. An AF_PACKET socket is a raw socket that can gain access to any packets sent to or received by a given network interface. The main benefit of the AF_PACKET approach is that it uses a ring buffer to hold any captured packets in physical memory and then maps that buffer directly into virtual memory for the userspace application to have direct access to the packets, without requiring an expensive copy of each packet into userspace.
AF_PACKET also supports a “fan out” mode whereby captured packets can be spread among multiple buffers (say, one buffer per processing thread) to achieve a decent amount of parallelism.
But while the AF_PACKET method will generally outperform an equivalent libpcap approach, it will still struggle at the higher packet rates because the capture process does not fully bypass the kernel networking stack. However, because the method is readily available on any Linux host and does not require any special hardware or drivers, it is a good choice for situations where you are confident that the packet rate will be relatively low.
Next we have DPDK, which stands for “Data Plane Development Kit”. This is a project that began within Intel, but has since spun off into its own open-source entity. At the risk of oversimplifying, DPDK replaces your standard networking drivers with custom ones that are optimised primarily for high speed packet capture. These drivers will bypass the kernel networking stack and deliver the packets directly into a user space process, although the program code for that process must be written to use the DPDK APIs to be able to access the packets.
One of the big advantages of DPDK is that it is proven to be able to capture traffic at 10Gbps line rate (or 15 million packets per second) with minimal tweaking or tuning -- I know this because I have managed it myself. The other advantage is that the project has been around for several years and is therefore reasonably mature and has an active and experienced user base.
DPDK is not perfect, however. Firstly, you need to make sure that your network card is compatible with DPDK -- the safest bet is to buy a standalone Intel 10G NIC rather than relying on the interfaces that come with your motherboard by default. Another downside is that any interface which is running the DPDK drivers is no longer available to the kernel for conventional communications, so you effectively lose that interface (which is another good reason to have a dedicated NIC for DPDK capture).
The last issue, and one that most people with DPDK experience will be quick to warn you about, is that DPDK has never been particularly user-friendly. The learning curve can be steep and the configuration process involves more environment variables and /proc table entries than would be ideal. Having said that, the usability has been improving over time and the project has come a long way from the early days at Intel.
The next capture method that we’re going to look at is the new kid on the block: the eXpress Data Path or XDP. It’s essentially a next-generation raw socket, like AF_PACKET, that is designed specifically for high speed capture. The XDP raw socket is available in Linux from kernel 4.18 onwards and most modern Linux distributions will now have XDP enabled in their kernel by default.
Again, without wanting to get into too much detail, XDP utilises the eBPF programming technology to intercept and redirect packets before they can be passed onto the default kernel networking stack. The packet will still trigger an interrupt, but the rest of the networking overhead can be bypassed.
What this means in practice is that we can use a simple eBPF program to present packets to a user space application without any wasteful copying of the packet in memory. This, combined with the removal of most of the networking stack overhead, means that XDP can comfortably handle 10Gbps line rate much like DPDK can. The other big bonus is that, because XDP is already integrated into the Linux kernel, it is generally a lot less fiddly and annoying to use than DPDK.
The main downsides to XDP are related to its relative immaturity. For instance, not all NIC drivers may support XDP natively and in those cases the kernel may have to fall back to a slower software implementation instead. The other big downside is that there are relatively few resources and examples available for people approaching XDP for the first time, especially if they don’t already have a background in packet capture. This will likely improve over time, but it might be a bit rough starting out until XDP becomes a bit more mainstream.
A couple of other packet capture options are worth mentioning here. The first is the Endace DAG card, which is a highly specialised hardware packet capture card that offers excellent performance with extremely accurate timestamping. Endace has been the state-of-the-art in terms of packet capture for the past two decades and I would have no problem with recommending a DAG card for any packet capture situation. However, this level of performance does come at a price, literally, and that price is probably outside of your budget if you’re considering using OpenLI for your LI system. Endace themselves are also moving away from selling individual capture cards, preferring to focus their efforts on appliances instead, so the cards can be very difficult to come by.
Another option, and one that I am only downplaying because I don’t have much experience with it, is PF_RING. It is another kernel bypass technique that is implemented as a kernel module, and has been developed by the same team behind ntop. The module isn’t in the mainline Linux kernel but PF_RING itself is very mature and has been around for a long time. PF_RING is widely supported by a variety of NICs and Unix-based operating systems and has a reputation for being straightforward to use (and has very good documentation from what I have seen).
Each of the packet capture methods I’ve just described provide their own software APIs for accessing the captured packets. This makes it difficult to switch between them, as software written specifically for one capture method would require significant re-development to support another. This places a lot of pressure on the user to choose the right capture method from the outset. But there is a solution...
Libtrace is a library that is designed to simplify the development and use of tools for packet capture and analysis. Libtrace abstracts away the specifics of interfacing with the programming APIs of each supported packet capture method, placing it all behind a relatively simple API of its own. Developers write code using the libtrace API and their software is then portable across multiple different packet capture methods, without requiring any code modification. Libtrace also includes many helper methods for decoding the captured packets and their headers, such as getting direct access to a TCP or IP header.
I do feel obligated to mention, though, that libtrace does not simplify all aspects of running a packet capture system -- for instance, the issues with configuring DPDK still apply even when using libtrace unfortunately.
Libtrace supports all of the capture methods that I’ve mentioned in this lesson and OpenLI uses the libtrace API for all packet capture and processing tasks. Therefore, if you deploy OpenLI, you have the flexibility to choose which packet capture method suits your situation best, and to easily change to another method if your situation changes -- OpenLI will be able to transparently support whatever option you choose.
So you’ve got the freedom to choose, but how do you actually make a good decision even with everything that I’ve just told you? Here’s my personal quick guide:
AF_PACKET is just fine if you’re dealing with relatively small amounts of traffic, such as in a testing lab environment or if you are wanting to just capture the traffic for a particular session management protocol (such as SIP or RADIUS). I would say anything up to 1 gigabit per second of capture is feasible using AF_PACKET in conjunction with 4 or more CPU cores. Anything more than that and you need to start looking at the more sophisticated approaches.
Of these, DPDK is the one that I’ve promoted the most in the past and we definitely have successful OpenLI deployments that are using DPDK for their main packet capture interface. DPDK is reasonably mature and proven to work well, but you do have to be prepared to battle a bit to get over the learning curve. Right now, DPDK is still my main recommendation.
However, XDP is definitely catching up and I expect that this will become my preferred option in the near future. XDP has a lot going for it; it’s generally easier to use but still offers great performance and support for it is built directly into modern kernels. I wouldn’t discourage anyone who is thinking of jumping on board with XDP right now -- it’s a good option and it is only likely to get better.
So let’s sum up. This whole lesson has been about the topic of packet capture, and specifically the key factors that can influence the performance of the packet capture system that will ultimately be driving any LI deployment. We’ve talked about the importance of packet rate (as opposed to the data rate), having the ability to receive and process packets in parallel and the impact of the operating system kernel overheads on packet throughput.
With those factors in mind, I’ve introduced a number of packet capture techniques ranging from conventional pcap through to the specialised kernel-bypass methods, such as DPDK and XDP. Hopefully I’ve given you enough details about the pros and cons of each method to help decide which technique is best suited to your situation, or at least leave you better informed than you were before you started this lesson.
Looking ahead, you should now be equipped with the background knowledge that you need for us to be able to start talking more about OpenLI directly. In our next lesson, I’ll introduce you to the three components that are part of every OpenLI deployment and explain their role within the overall system. I’ll also go over some general principles regarding where you might want to put each component within your network and how to go about estimating what hardware you should dedicate to them. As part of this, I’ll also touch on security and some best practices to help minimise the risk of unauthorized access to your LI system.
Until next time, take care and stay out of trouble! See you soon.