Development inside Linux: Linux Kernel Development II - ARP replicating network driver

Here is the second article in Linux Kernel development series. Majority of this article is from my old blog. I was complaining about scarcity of information on Linux networking in the old article but that was before I read Understanding Linux network internals. I still think similar since some parts of the book is outdated but there are other books I haven't read yet like Rami Rosen's book which may be filling the gap. Also there seems to be a lot more blog posts and wiki articles, like mine, talking about various topics. So if you know how everything works in general, it is quite possible to find information on a specific mechanism or implementation of a concept nowadays.

I will be talking about the fakeARP driver which you can access from GitHub [source]. It is a network driver which replies incoming ARP requests with fake ARP replies. (The link directs you to tutorial tag in Git, development continues on development and master branches. If you are interested you can check out current status too.)

What you need to know and target audience

Network drivers are quite different from char drivers we considered in last post.

Interestingly you don't need to know any kind of socket programming to write a network device driver.
I'll be talking about an ethernet device driver so you need to know how layer 2 packets are transmitted in ethernet protocol. You also need to know how ARP works since our driver is about ARP.
It is best if you know the IP stack up to layer 4. You may want to recall encapsulation and header structures. A reference book (I read Can Okan Dirican's network book many years ago, I recommend it for Turkish readers) or Wireshark may be useful for that.

The intended audience is also different from previous post. You need to understand interrupt handling and concurrency on kernel side to be able to follow this one (and as I repeated many times in previous post, you need to read an actual book on kernel development to understand those). It will be easier for you if you wrote a char device driver before. Also I may compare some stuff to how they are handled in char devices (which are accessed using files unlike network drivers)

The driver we are considering is as crappy as the "Hello, world!" char driver. Again it lacks so many things it should be doing and again this is not the correct way to write network device drivers. (I am writing a proper version of this driver though.)

This post will be mostly about receiving and giving network packets (or more precisely layer 2 frames) to kernel. As I said I read LDD3 to learn how to write drivers and unfortunately most of the info about network drivers is outdated there. I wasted so much time trying to find some sources on NAPI and polling. So I am hoping to help people who are struggling with the same topics.

It was really hard to find up-to-date and useful information about Linux kernel network API on web when I wrote this. One time I felt like I was back in nineties, when search engines were new and you needed to follow links from web pages to other web pages. There exists some good documentation but it is scattered therefore it feels like solving a puzzle sometimes.

The outdated resources also make it harder. Both Linux's networking subsystem's inner workings and the networking API undergone many changes since the beginning of 2.6 series. I guess those were hard times for driver programmers. That also resulted in different documentation layers remaining from different network subsystem and API versions which makes finding useful documentation harder for us.

Other than that I read most of the related function and structure definitions in the code. There are good and explanatory comments in those source files. I also read the bridge driver (controlled by brctl in user-space), good old Realtek 8139 and Intel's e1000e Gbit PCIe ethernet card driver source.

I suggest you to do the same, read the struct and function definitions in the source and check out some working driver code you might be familiar with. See the API at work. Don't forget: The most accurate source for Linux kernel is the Linux kernel source.

Just as I said in the char driver tutorial, things I write here may be wrong or misleading. I may have taken some outdated info here or I might have understand some stuff wrong. Please warn me in the comments if you see something fallacious.

Let's talk about network drivers a little and then what this module does and how it works specifically.

Preliminaries

What is a network driver?

I said there are different kinds of drivers in my post about writing a char driver. Network drivers are one of those which don't work with direct system calls. They take and give information (frames,datagrams, packets etc.) through special kernel mechanisms for networking.

There is a special struct called sk_buff (and abbreviated skb as a variable name) you need to know well. It represents a packet (or a datagram, a frame whatever you are using in your layer) and all data transmission between kernel's network sub-system and your device is done with sk_buff structs.

(By the way I will be referring to all packets as packets from now on regardless of their layer.)

You can think most network device drivers as a pipe which has two sides. One side looks to the kernel. The other side looks to the physical port of the device, preferably with a cable (or another medium) connecting your device to another network device outside your computer. Network drivers take each packet arriving at one side and pass it to the other side after making some adjustments on it if necessary. In this context, rx is when you take a packet from the cable(via hardware) and give it to the kernel. Tx is when you take a packet from the kernel and give it to the hardware (and it sends the packet over its cable).

What does our module do?

Our device is again a virtual one and unfortunately cannot enjoy a physical port which connects it to the outside world. The benefit is that you don't have to know about DMA, buses, interrupts etc. It simplifies the code very much which is convenient for a tutorial like this.

Still, we will act as if we are driving a device with a cable and trick the kernel into thinking that we are giving it packets coming from the outside world. Our target is IPv4 ARP packets because their length and ingredients are constant. You can find lots of resources on ARP and the structure of ARP packets. We will intercept the ARP request packets given to us by the kernel to send over our cable. Then we will change a few octets (network people call bytes "octets" FYI), create a valid ARP response packet to the request and give it back to the kernel as if it was coming from outside.

We'll drop any other packets.

By the way, I'm planning to make fakeARP into a driver which is capable of faking ICMP ping requests and TCP handshakes too. I think it is possible for a driver to fake a whole network behind it.

I guess we can move onto the data structures and module code now.

Related data structures and associating functions

1 - struct net_device

First data structure we will consider is the struct net_device. It is defined in include/linux/netdevice.h under source tree. It has certain registration and initialization functions just like struct cdev for char devices. It also employs pointers to other structs associated with it. Note that drivers written for real hardware also employ a device struct for the bus they are communicating to CPU, like struct pci_dev.

Here is a web reference that might be useful. [Outdated Chinese page] It explains some functions, mostly outdated.

ndo functions

struct net_device_ops is also defined in include/linux/netdevice.h and similar to file operations struct in char devices. It contains certain function hooks for various network operations (most of which you are familiar from ifconfig). Just like file operations struct this struct provides an interface to the net_device.

It has hooks (function pointers) for functions which interact with kernel's internal network sub-system.

It also has hooks to change device configuration by using utilities like ifconfig, iproute2 or ethtool.

Finally it has methods to return information about the device, for example stats including number of rx/tx packets, number of dropped packets etc.

Info like what is expected from the functions and where they are used is provided just before the net_device_ops struct definition in source code [Link to 3.14]. I will write about the ones we will use but read the comments in source code too, they are a good resource.

priv part

Another struct associated with netdevice is netdev_priv, which is a custom struct provided by us, module developers. It is used to hold device specific information and allocated at the same time with the net_device struct automatically. Yes, we just define the struct and rest is handled by the net_device functions per device. When we need to access it we call netdev_priv(struct net_device *dev); to get a pointer to the struct.

interface stats

We will use the old struct net_device_stats which is defined as stats in net_device for received and transmitted packets. It is read automatically by ifconfig but you can write a special function to pass a net_device_stats struct which is first processed by the module. You need to put a pointer to your function to ndo_get_stats in net_device_ops struct to do that. You can also use newer ndo_get_stats64 function with rtnl_link_stats64 struct too.

ethersetup and destructor

Let's talk about registration and initialization functions a little bit. You can look at module init and exit functions.

We first allocate the net_device struct along with a private section struct using the alloc_etherdev() function. It is defined in net/eth/eth.c. It calls alloc_etherdev_mqs() with tx and rx queue count equal to 1. And that calls alloc_netdev_mqs() which is the generic function for allocating network devices.

alloc_etherdev() names our device "eth%d" and we change it after allocation. %d means a number will be assigned. For example our device name will be eth0 if there was no device named eth before and it will be eth3 if there were 3 network devices named eth before our device.

It uses ethersetup initialization function right after allocating memory. It sets up basic properties of the device according to the ethernet requirements. You can read what it changes in ether_setup function definition in net/eth/eth.c. [Link to 3.14]

As you can see, we can take care of registration and initialization in just one function call for ethernet devices.

We remove our device from the system using unregister_netdev() function. We can also set net_device->destructor function hook to take care of freeing our device automatically when the unregister function is called. If we set it to free_netdev function it will free the memory allocated to net_device struct and the private part. You can of course add any other jobs that should be taken care of before removing the device.

2 - struct sk_buff (skb)

struct sk_buff is the other fundemental struct in network subsystem of Linux kernel. It represent basically all the packets/fragments/datagrams. There are some good references for struct sk_buff online. One reference that helped me tremendously is this .pdf file [Link]. Sk_buff functions and structures are defined in net/core/skbuff.c and include/linux/skbuff.h

I will only cover the parts of sk_buff we will use for our driver but you can be sure they are up-to-date (as of 3.14). For any other function and data structure reference, check out the source code. As I said sk_buff plays a central role in network subsystem and apparently network subsystem has changed many times.

I will talk about the journey of a sk_buff in later sections, let's investigate some fields and functions we are going to use in our driver.

Data/payload fields

Data container part of an sk_buff is some kind of a double ended queue made of unsigned chars. You can reserve space in both the beginning and the end. You can also add new data to both ends. One can see why and how this kind of a data structure makes life easier when working on network packets.

head/end: The beginning and end of the whole reserved area.
data/tail: The beginning and end of the area in use ie. packet's payload. There may be written data before skb->data and after skb->tail but it may not be used. For example when a packet is first read from a device skb->data points to the beginning of ethernet header (or a few octets back due to alignment). When the sk_buff is processed by the driver skb->data is set to point to the end of the ethernet header before it is fed to the network sub-system (by using eth_type_trans()). When it is transferred to the IP layer skb->data is set to point to the end of the IP header and so on. What I mean is, there may be actual written data before skb->data and after skb->tail but they are most likely irrevelant in the current layer sk_buff is being processed.
len/size: len = tail - data which gives the amount of useful bytes in sk_buff. size = head - end which gives the total reserved space in the sk_buff (you can't use more than this unless you reserve/allocate more space by using related functions). Note that the packets will be memory aligned according to the network device's specification. For example in my case received packets are always aligned to 32 bits (ie. size is always a multiple of 4).
There are some other fields concerning fragmented packets. For example skb->data_len gives the total length of an IP packet which our sk_buff is a fragment of. That is one of the reasons why we are working with ARP packets, they are not fragmented in any case. Putting fragmented IP packets is handled in above layers by the kernel and it can be quite complicated.

Header Information
A network device takes an incoming packet (from another network device outside the host) in raw binary form and the driver writes whole information to the data section of the sk_buff we mentioned above. The information is in layer 1 encapsulation (in other words maximum encapsulation) in a sense. Therefore we cannot know which protocols the packet has passed through, how many headers it contains nested in each other or where the real payload is. However as the sk_buff structure is processed by various network layers inside the kernel the headers are stripped off one by one. Pointers to where each header is located is recorded inside some fields of sk_buff structure.

Similarly when some data fed to a socket to be sent from userspace, it is just data. It is encapsulated by various headers as it travels down network layers.

The pointers to headers concerning TCP/IP were called h, nh and mac for sometime and many documents still mention them. Later with patch b0e380b1d their names and types are changed. They are called transport_header, network_header and mac_header, respectively. Then they are turned to offsets from skb->data with patch 2e07fa9cd. Right now pointers are generated with inline functions below:

skb_transport_header(skb) = tcp/udp header (layer 4 header - old skb->h)
skb_network_header(skb) = ip header (layer 3 header - old skb->nh)
skb_mac_header(skb) = ethernet header (layer 2 header - old skb->mac)

Pointers to headers of other protocols (other than TCP/ IP) are recorded in the same places according to their layer. Headers of protocols higher than layer 4 are processed by the applications in user-space. Some other fields exist to record pointers to encapsulated packet's header in case of tunneling another connection.

Also note that these fields are empty in fresh skbs, they are filled as the packets move up and down the network sub-system.

Below is a diagram for an HTTP packet carrying some HTML.

There are two other fields which are filled just before the packet is handed to the network subsystem by the driver, namely pkt_type and protocol. I will introduce them in eth_type_trans() function while explaining packet reception. But in short, pkt_type holds the purpose of a packet inside kernel, like a packet to send to another host over a NIC, a packet specifically sent to our host from outside or a packet to use inside the host between userspace programs. Protocol is the L2 protocol, which is usually ethernet in today's world. But note that there are multiple ethernet protocols, including tunnel protocols which serve ethernet over ethernet/IP.

sk_buff functions

There are lots of functions acting on skbs but I will only talk about the ones useful for our driver.

Alloc/free and copy functions

Allocation and copy functions can be quite confusing because you have the option to clone skbs. Also there are functions to copy/clone and allocate header space.

Copying means making deep copies. It allocates new memory for reserved space and data sections and copies them from head to end from the original skb. All pointers like skb->head, skb->data, skb->tail, skb->end and header pointers are re-arranged to the new allocated space. The comments in the source code states that fragmented skbs are united to a whole as a side effect of this process.

Cloning means copying all skb members including the pointers to original skbs reserved space and data area. No new allocation is made for reserved and data spaces and those sections are not copied. Both original and clone sk_buff structs share the same data, ie. they are in fact representing one and same packet.

Therefore one should call copy if he is to write or modify the data or header of the packet. Cloning is useful since most of the time you just need to read stuff. When an skb is cloned its reference count is incremented and skb_cloned(skb) function returns true.

alloc_skb(size, gfp_mask) allocates an skb with data section size bytes long. Of course full size of the skb is larger. Initially it has head = data = tail and end = head + size. So basically it consists of tailroom = size. If you use dev_alloc_skb(size) it always allocates with GFP_ATOMIC.
dev_kfree_skb_any(skb) frees the skb. "_any" means the function may be called in either normal time or interrupt time. If it was called in interrupt time, it defers the actual job of freeing to a tasklet. If the skb was cloned before this function just reduces reference count, skb is freed when reference count reaches zero.
skb_clone(skb, gfp_mask) clones the skb as I described above. Allocation of the new sk_buff structure is done according to gfp_mask.
skb_copy(skb, gfp_mask) copies the skb and its data secion as I described above. Allocation is done according to gfp_mask.

There are other functions for copying lists of skbs and for copying/cloning headers. You can read them in net/core/sk_buff.c in the source code.

data field functions

These functions act on the reserved space and data area. One should be careful not to overflow data area over reserved space. If skb->data pointer gets in front of skb->head or skb->tail pointer goes beyond skb->end then you can get a kernel panic. It seems like a harsh punishment for an error in just one packet but developers explain in the source code how it can be critical.

skb_headroom(): returns the length of headroom space (reserved place in the front) which is equal to skb->data - skb->head.
skb_tailroom(): returns the length of tailroom space (reserved place in the back) which is equal to skb->tail - skb->end.
skb_push(): pulls back the skb->data pointer and opens space in the front of the packet for writing. It can be used to write headers in front of packets when the packet is to be given to a lower layer. You should be careful not to take skb->data in front of skb->head.
skb_pull(): pushes data pointer forward and stretchs headroom. It is usually used to pass the layer's header before giving the packet to an upper layer so that upper layer can access the payload for that layer directly using skb->data. For example kernel uses skb_pull(skb, ETH_HLEN) on layer2 packets before giving them to the IP layer.
skb_put(): pushes skb->tail forward, creating free space to write in the back of the packet. It causes tailroom to shrink of course. You can get kernel panic if you push skb->tail beyond skb->end.
skb_trim(): Similar to skb_put() but instead of pushing skb->tail size bytes it moves skb->tail to create a packet equal to len = size. Usually used to shrink the packet's data section to size bytes, hence it is called skb_trim. But you can use it to enlarge the data section too.
skb_reserve(): pushes both the skb->data and skb->tail pointers forward but does not copy the contents. For empty skbs it moves data section forward while preserving skb->len. It should not be used in skbs which contain data inside.

There are other functions for handling data inside skb lists and for handling header space. But I won't write about them since I don't know them well and they are beyond this article's scope.

I am planning to post other part(s) in December. I will talk about how packet reception/transmission from/to NICs happen with NAPI. And I will also describe how my driver works as a tutorial.

Development inside Linux

Saturday, November 28, 2015

Linux Kernel Development II - ARP replicating network driver - part I