PSARC/2009/232 Berkeley Packet Filter for OpenSolaris
Kais Belgaied
Kais.Belgaied at sun.com
Fri Apr 10 14:15:53 PDT 2009
Hi Darren,
while the architecture looks sound, it has so many pieces and
interactions with other subsystems
(the pluggale sockets, MAC, etc), way beyond what's suitable for a
fasttrack.
This should be a full case.
Kais.
On 04/10/09 14:01, Darren Reed wrote:
> This is a self sponsored fast track, timeout set for 2 weeks...
>
> Abstract
> ========
> This case seeks to build on the Crossbow (PSARC/2006/357[7])
> infrastructure
> and provide a new (to OpenSolaris) mechanism for capturing packets: the
> use of the Berkeley Packet Filter (BPF). The goal of this project is to
> provide a method to capture packets that has higher performance than
> what we have to offer today on Solaris (DLPI based schemes.) It also has
> the added benefit of increasing our compatibility with other software
> that has been built to use BPF.
>
> Release Binding
> ---------------
> This case seeks to obtain approval for minor release binding.
>
> Background
> ==========
> Packet capture on Solaris is currently built around the use of DLPI.
> Whilst the introduction of libdlpi (PSARC/2006/436[1]) has made it easier
> to program using DLPI and the IP Observability Project
> (PSARC/2006/475[2])
> introduced the means by which packets that are local to the host could
> be intercepted, neither did anything to address the primary problem
> with DLPI: compared to other mechanisms, it is slow, the in-kernel
> filtering is either not used or very primitive and provides very
> little useful information about the packet capturing itself by way
> of statistics.
>
> Introduction
> ============
>
> The architecture of BPF lends itself to more efficient means of doing
> packet capture, where a single read can transfer large numbers of packets
> per call. It also allows the sniffer to choose how much data from each
> packet they wish to copy, be it the entire packet or just the first 128
> bytes to capture headers.
>
> Internal Architecture
> ~~~~~~~~~~~~~~~~~~~~~
> Internally, the architecture of BPF is very simple: it has a lower
> half that receives packets from the NIC drivers, copying matching
> packets into a static buffer and an upper half that implements a
> character pseudo-device.
>
> Buffers
> -------
> The backing for the pseudo-device operating as a character device
> is a buffer allocated by the driver for storing packet data in.
> The buffersize used by the device for storing copied packet data
> in is set by the application. By default libpcap sets this size
> the the same size as the driver's default: 32k. The maximum this
> project allows is 16M.
>
> Two buffers of this size are allocated by the driver: an "active"
> buffer and a "hold" buffer. This supports applications doing
> sleeping reads, if they aren't using poll, and reading an entire
> buffer of data whilst the system continues to catch new packets.
>
> Applications can set the buffer size using libpcap or with the
> BIOCSBLEN ioctl (see man page.)
>
> List of Interfaces
> ------------------
> BPF maintains an internal list of network interfaces that it supports
> capturing packets for. What distinguishes this list from that either
> in the mac or ip modules is that it uses the datalink type as a part
> of the key for determining what is an identical entry. Additionally,
> on OpenSolaris the device structure used inside of the ip module is
> different to the mac module, preventing either one being used as a
> master list by BPF. Answering queries such as returning the complete
> list of datalink types supported by a device (BIOCGDLTLIST), would
> be much more complicated without that internal list.
>
> Packet Capture
> --------------
> When BPF is called from the mac layer, it is handed the packet as
> it is received from the NIC driver as part of the promiscuous
> callback handling in the mac layer. It is the same mblk_t for the
> packet that will later be passed on though the stack and has
> neither the mblk_t's nor dblk_t's duplicated. Thus the capturing
> of the packet becomes part of the execution of the datapath for
> each packet.
>
> Interactions with existing technology in Solaris
> ================================================
> This section goes into detail about what impact this project has on
> other areas of Solaris or what impact they have on this project.
>
> Vanity Naming
> ~~~~~~~~~~~~~
> The Vanity Naming Project[6] introduced the means by which link names
> could be changed to be a different name than the underlying mac name.
> This project will only support packet capture on interfaces using the
> interface name allocated by the dls module that was delivered by the
> vanity naming project.
>
> IP Observability
> ~~~~~~~~~~~~~~~~
> The IP observerability project introduced the ability to capture packets
> from within IP, presenting them through devices files in /dev/ipnet for
> libdlpi to use. This project will update some of the interfaces
> introduced
> by IP observability.
>
> Updating IPNET
> --------------
> Unfortunately the mechanism used to do this is bound up within IP. To
> build upon the work done here, this project will change the mechanism
> by which IP observability works within IP to use the netinfo
> (PSARC/2008/219[3]), enabling external kernel modules to easily subscribe
> to the packet events that make up this feature. A single new event
> will be added, NH_OBSERVING. Callbacks that are activated with this
> event will receive a pointer to a hook_pkt_observe_t structure as
> the hook_data_t value.
>
> This project will introduce a version 2 of the IPNET "protocol", with an
> updated header. Selection of version 1 (the existing version of IPNET
> headers) or version 2 will be achieved using the DLIOCIPNETINFO ioctl
> on the /dev/ipnet STREAMS device.
>
> The version bump with IPNET is required to maintain backward
> compatibility
> with the current interface delivered and described with lo0(7d) -
> it is currently a Committed interface.
>
> mac networking layer
> ~~~~~~~~~~~~~~~~~~~~
> The design of BPF on BSD places it in the MAC layer where it has
> easy access to functions to enable/disable promiscuous mode and
> the data structures used to represent network interfaces.
>
> Promiscuous callbacks
> ---------------------
> This project does not add or change any of the existing promiscuous
> callbacks that exist in the mac module.
>
> On the receive side, the promiscuous callbacks have been placed in the
> mac module early in the receive path, before any classification work
> is done on the packet. This means that packet sniffing will happen at
> line rate but still in accordance to how the NIC has been programmed
> to queue packets on its rings.
>
> On the transmit side, the promiscuous callbacks are activated before a
> packet packet is classified and put in the appropriate descriptor ring
> for transmission.
>
> Existing Interfaces
> ------------------
> With the mac layer presented by Crossbow, most of the necessary features
> are provided, albeit via private interfaces:
>
> dls_mgmt_get_linkid
> mac_client_close
> mac_client_handle_t
> mac_client_name - returns the name associated with the mac_client object
> mac_client_open
> mac_handle_t - value returned by a successful call to mac_open that is
> used with calls to other functions in the mac layer
> mac_multicast_add - add another multicast address to the interface
> mac_multicast_remove - remove a multicast address from the interface
> mac_name - renames the name of the MAC device (can be different to the
> name returned by mac_client_name)
> mac_open
> mac_promisc_add - enable receipt of packets "promiscuously" via callbacks
> mac_promisc_handle_t
> mac_promisc_remove - disable callbacks for packets
> mac_sdu_get - returns the MTU for a MAC
> mac_tx - deliver a packet
>
> New Interfaces
> --------------
> BPF requires notification about each NIC that gets added, its datalink
> type and length of the hardware address. It uses this information to
> maintain an internal list of network interfaces that are available for
> capturing packets on. To support this, it is necessary to call into BPF
> from mac inside of mac_register() and mac_unregister(). This is done
> using a function pointer that is set by calling a new function called
> mac_set_bpfattach(). This function will take two parameters, one is
> the function to call when attaching (mac_register), the other is the
> function to call when detaching (mac_unregister). To account for the
> fact that drivers may have already called mac_register() before
> mac_set_bpfattach() is called, mac_set_bpfattach() will walk through
> all existing attached devices and call bpfattach() for each one of
> those. This approach is necessary due to the lack of a mechanism
> equivalent to netinfo's NE_PLUMB/NE_UNPLUMB events for IP, which will
> be used to support IPNET. Whilst the interface mentioned here is
> private, it's listed here for completeness.
>
> mac_set_bpfattach - set function pointers to be called when drivers call
> mac_register
>
> An additional change is required in the mac module to improve
> performance:
> the introduction of a new promiscuous callback flag:
> MAC_PROMISC_FLAGS_NO_COPY.
>
> By default, the current mode of operation when registering a callback
> with the mac module today using mac_promisc_add will perform a copymsg()
> on every packet passed in. When peek'ing at packets, such as what BPF
> does, it isn't necessary to create a new mblk_t as BPF does not perform
> any destructive operations on the packet. When this flag is presented
> with MAC_PROMISC_FLAGS_VLAN_TAG_STRIP, the behaviour is to fall back
> to doing the copymsg().
>
> MAC_PROMISC_FLAGS_NO_COPY - promiscuous callback is well behaved and does
> not need a copy of the packet
>
> libpcap
> ~~~~~~~
> This project will update the copy of libpcap delivered by
> PSARC/2008/288[5]
> to use the BPF interface delivered with this project. A follow on project
> may look at extending the libpcap filtering langauge to easily allow
> filtering on the IPNET header fields.
>
> snoop
> ~~~~~
> While implementation of snoop will continue to use DLPI via libdlpi
> to retrieve packets, it will be updated to understand version 2 of
> the IPNET packet headers and by default will use this version when
> communicating with devices in /dev/ipnet. Support for understanding
> the existing version 1 headers will remain.
>
> libnet
> ~~~~~~
> The implementation of libnet delivered into the SFW consolidation by
> PSARC/2008/409 will not be modified as a part of this case.
>
> etherstubs
> ~~~~~~~~~~
> When an etherstub is being used as the locus for a vnic, it will be
> possible to see the network traffic on the vnic that:
> - is broadcast or multicast at the link layer
> - is moving between zones that are using vnic's on top of the etherstub
> driver to support an exclusive instance of IP
>
> It will not be possible to see link layer traffic for IP traffic
> between two local zones that are using a shared stack, or even
> multiple vnics in the global zone. To obesrve that traffic, the
> IPNET link layer must be specified.
>
> New Interfaces
> ==============
> This section looks at each of the new interfaces being introduced.
> Those that are described in the man page, bpf.7d, are not discussed
> here.
>
> Loopback DLT type
> ~~~~~~~~~~~~~~~~~
> With the provision of access to loopback data by adapting the mechanism
> used for ipnet, access to packets on the loopback interface as well as
> those inside of IP moving between zones using a shared stack instance
> model can be achieved.
>
> For each network interface that is used with the IP protocols, a tap
> point will be created with a new different datalink type. The datalink
> type for this will be DLT_LOOP_SOLARIS. Both the structure contents
> and datalink type name will be registered with the tcpdump project[4].
>
> The packet header structure used with DLT_LOOP_SOLARIS will be the
> same as used with version 2 of the IP Observability devices in the
> /dev/ipnet directory. The proposed structure will be called
> dl_ipnetinfo_v2_t and the details pertaining to it can be found below.
>
> /dev/bpf
> ~~~~~~~~
> This driver ships using /dev/bpf as the device file for applications
> to open. Whilst the creation of the device cannot be as a clone (it
> is not a STREAMS driver), it is still possible for the driver to
> assign a new minor number each time the device is opened. Thus
> even though the driver isn't a clone driver, it is not necessary
> to have /dev/bpf0-15 for proper BPF semantics.
>
> driver.conf file
> ~~~~~~~~~~~~~~~~
> This project intends to use the driver.conf as the means by which the
> default and maximum buffer sizes can be changed. By default these are
> 32k and 16M repectively. They can be changed using the names "buf_size"
> and "max_buf_size". It is not envisaged that these will ever need
> changing but scope is provided for those that either need or wish to.
>
> Whilst there are numerous new avenues being explored for datalink and
> IP administration, it needs to be remembered that whilst this device is
> centered around networking, it is neither a datalink nor an IP interface
> and thus isn't managed by dladm and friends. The need to change
> (increase)
> the value from that shipped is expected to be a rare event.
>
> <net/bpf.h>
> ~~~~~~~~~~~
> This file contains all of the structure and ioctl definitions that make
> up the programming interface for BPF. Four of the ioctls listed below
> are being introduced as "Project Private" as they form the foundation
> for supporting 32bit applications running against 64bit kernels. This
> is necessary because some of the structures exchanged between the bpf
> driver and applications contain pointers.
>
> <net/bpfdesc.h>
> ~~~~~~~~~~~~~~~
> This is file contains the definitions for structures used in the kernel
> and is thus private to the project.
>
> Manual page
> ~~~~~~~~~~~
> The manual page for bpf will be delivered to section 7d.
> A copy from BSD is provided with this case.
>
> Interface Table
> ~~~~~~~~~~~~~~~
> Interface Commitment Comments
> --------------------- ---------------- -------------
> usr/kernel/drv/bpf Project Private
> usr/kernel/drv/bpf.conf Project Private
> /dev/bpf Uncommitted
> <net/bpf.h> Committed
> <net/bpfdesc.h> Project Private
> BPF_MAJOR_VERSION Committed <net/bpf.h>
> BPF_MINOR_VERSION Committed <net/bpf.h>
> BIOCGBLEN Committed <net/bpf.h>
> BIOCSBLEN Committed <net/bpf.h>
> BIOCSETF Committed <net/bpf.h>
> BIOCFLUSH Committed <net/bpf.h>
> BIOCPROMISC Committed <net/bpf.h>
> BIOCGDLT Committed <net/bpf.h>
> BIOCGETIF Committed <net/bpf.h>
> BIOCSETIF Committed <net/bpf.h>
> BIOCSORTIMEOUT Committed <net/bpf.h>
> BIOCGORTIMEOUT Committed <net/bpf.h>
> BIOCGSTATS Committed <net/bpf.h>
> BIOCIMMEDIATE Committed <net/bpf.h>
> BIOCVERSION Committed <net/bpf.h>
> BIOCSTCPF Committed <net/bpf.h>
> BIOCSUDPF Committed <net/bpf.h>
> BIOCGHDRCMPLT Committed <net/bpf.h>
> BIOCSHDRCMPLT Committed <net/bpf.h>
> BIOCSDLT Committed <net/bpf.h>
> BIOCGDLTLIST Committed <net/bpf.h>
> BIOCGSEESENT Committed <net/bpf.h>
> BIOCSSEESENT Committed <net/bpf.h>
> BIOCSRTIMEOUT Committed <net/bpf.h>
> BIOCGRTIMEOUT Committed <net/bpf.h>
> BBIOCSETF32 Project private <net/bpf.h>
> BIOCGDLTLIST32 Project private <net/bpf.h>
> BIOCSRTIMEOUT32 Project private <net/bpf.h>
> BIOCGRTIMEOUT32 Project private <net/bpf.h>
> struct bpf_dltlist Committed <net/bpf.h>
> struct bpf_hdr Committed <net/bpf.h>
> struct bpf_insn Committed <net/bpf.h>
> struct bpf_program Committed <net/bpf.h>
> struct bpf_stat Committed <net/bpf.h>
> struct bpf_timeval Committed <net/bpf.h>
> struct bpf_version Committed <net/bpf.h>
> NH_OBSERVING Committed
> hook_pkt_observe_t Committed
> DLT_LOOP_SOLARIS Committed
> dl_ipnetinfo_v2_t Committed
>
>
> Structure Definitions
> ~~~~~~~~~~~~~~~~~~~~~
> struct bpf_dltlist
> ------------------
> struct bpf_dltlist {
> u_int bfl_len; /* number of bfd_list array */
> u_int *bfl_list; /* array of DLTs */
> };
>
> struct bpf_hdr
> --------------
> struct bpf_hdr {
> struct bpf_timeval bh_tstamp; /* time stamp */
> uint32_t bh_caplen; /* length of captured portion */
> uint32_t bh_datalen; /* original length of packet */
> uint16_t bh_hdrlen; /* length of bpf header (this
> struct
> plus alignment padding) */
> };
>
> struct bpf_insn
> ---------------
> struct bpf_insn {
> uint16_t code; /* Instruction */
> u_char jt; /* Jump true */
> u_char jf; /* Jump false */
> uint32_t k; /* space for constant */
> };
>
> struct bpf_stat
> ---------------
> struct bpf_stat {
> uint64_t bs_recv; /* number of packets received */
> uint64_t bs_drop; /* number of packets dropped */
> uint64_t bs_capt; /* number of packets captured */
> uint64_t bs_padding[13];
> };
>
> struct bpf_timeval
> ------------------
> struct bpf_timeval {
> int32_t tv_sec;
> int32_t tv_usec;
> };
>
> struct bpf_version
> ------------------
> struct bpf_version {
> u_short bv_major;
> u_short bv_minor;
> };
>
> struct bpf_program
> ------------------
> struct bpf_program {
> u_int bf_len; /* length of program to load */
> struct bpf_insn *bf_insns; /* pointer to program to load */
> };
>
> Structure supplied with NH_OBSERVE events
> -----------------------------------------
> typedef struct hook_pkt_observe_s {
> uint8_t hpo_version;
> uint8_t hpo_family;
> uint16_t hpo_htype;
> uint32_t hpo_pktlen;
> uint32_t hpo_ifindex;
> uint32_t hpo_grifindex;
> uint32_t hpo_zsrc;
> uint32_t hpo_zdst;
> mblk_t *hpo_pkt;
> } hook_pkt_observe_t;
>
> hpo_family - protocol family (AF_INET/AF_INET6)
> hpo_zsrc - zone identifier for the source of the packet
> hpo_zdst - zone identifier for the destination of the packet
> hpo_ifindex - interface index number
> hpo_grifindex - group interface index number (for IPMP interfaces)
> hpo_htype - hook type (in, out, local)
> hpo_pkt - start of the mblk_t chain with the packet
>
>
> struct dl_ipnetinfo_v2 {
> uint8_t dli_version;
> uint8_t dli_family;
> uint16_t dli_htype;
> uint32_t dli_pktlen;
> uint32_t dli_ifindex;
> uint32_t dli_grifindex;
> uint32_t dli_zsrc;
> uint32_t dli_zdst;
> };
> typedef struct dl_ipnetinfo_v2 dl_ipnetinfo_v2_t;
>
> dli_version - version number (2)
> dli_family - protocol family (AF_INET, AF_INET6, etc)
> dli_htype - hook type (in, out, local)
> dli_pktlen - length of the packet excluding this header
> dli_ifindex - interface index number
> dli_grifindex - group interface index number (for IPMP interfaces)
> dli_zsrc - zone identifier for the source of the packet
> dli_zdst - zone identifier for the destination of the packet
>
> References
> ==========
> [1] http://sac.eng.sun.com/sac/PSARC/2006/436
> [2] http://sac.eng.sun.com/sac/PSARC/2006/475,
> http://opensolaris.org/os/project/clearview/ipnet/
> [3] http://arc.opensolaris.org/caselog/PSARC/2008/219
> [4] http://www.tcpdump.org/
> [5] http://arc.opensolaris.org/caselog/PSARC/2008/288
> [6] http://opensolaris.org/os/project/clearview/uv/
> [7] http://sac.eng.sun.com/sac/PSARC/2006/357
>
>
More information about the opensolaris-arc
mailing list