Sunday, May 1, 2016

Linux Kernel Development II - ARP replicating network driver - part II

The good thing about this "fakeARP" driver is we don't need to mess with interrupt, PCI, DMA related stuff.

I also wanted to write a longer post about how tx and rx works in general but it was taking so long I had to put that aside.

I will instead go from down to up on code.

Init


Our module creates a static net_device instance and adds it to network. Normally drivers just init some global stuff and then kernel calls the related function to create a new net_device struct as NICs are discovered in the system (like when you plug in a USB ethernet card).
fakedev = alloc_etherdev(sizeof(struct fake_priv)); //just like alloc_dev but uses ether_setup afterwards
We first allocate the net_device struct we will use.

As I mentioned in the previous post, alloc_etherdev uses alloc_netdev_mqs() which is the generic function to create a network device and then calls ether_setup() to customize our network device as an ethernet device.
fakedev->destructor = free_netdev;              //called by unregister_device func. frees mem after unregistering
fakedev_ndo.ndo_start_xmit = &fakeARP_tx;       //function to transmit packets to the other side of the cable
fakedev_ndo.ndo_open = &fakeARP_open;           //function used to "up" the device, ie. when user types ifconfig fkdev0 up
fakedev_ndo.ndo_stop = &fakeARP_stop;           //function used to "down" the device, ie. when user types ifconfig fkdev0 down
fakedev->netdev_ops = &fakedev_ndo;
Then we link our destructor and ndo functions.

Destructor is called when a device is removed from the system to do the housekeeping. We use default free_netdev which frees our priv part too.

struct net_device_ops fakedev_ndo is just like struct file_operations for char devices. They define which functions to call when kernel needs to interact with the device. 
  • ndo_start_xmit is the function kernel uses to give driver a packet (or more in case of hardware segmentation offload) to send. Driver transfers it to the device and the device transmits the packet over the cable.
  • ndo_open is the function to enable the device when it is set "up" as in "ip link set eth0 up" or "ifconfig eth0 up".
  • ndo_stop is the opposite of ndo_open which is called when the device is set down.
  • We also have a packet receive function but that is not set inside net_device_ops struct.
tmp_priv = netdev_priv(fakedev);
Then we allocate our private section.

We have the following in private section of our net_device:
struct napi_struct napi;        //napi_struct is held in priv
struct sk_buff *fakeskb;        //fake packet buffer
struct sk_buff *fakeskb_copy;   //the copy we give to napi
int packet_ready;               //1 if we have a packet to give to NAPI, 0 otherwise 
Here napi is the napi context for our net_device. The socket buffers and packet ready flag are used to create fake ARP replies.
tmp_priv->fakeskb = alloc_skb(42, GFP_KERNEL); //allocate the empty skb we'll arrange as our fake reply
Right after that we allocate the fakeskb buffer which acts as our device's "buffer". I will come back to these later.
//register the device to NAPI system for receive polling
netif_napi_add(fakedev, &(tmp_priv->napi), &fakeARP_poll, 16); //16 is weight used for 10M eth
And finally we add our device to NAPI polling list so that kernel can ask for packages when we have some. I will talk more about NAPI in rx.
//everything is set, register the device
ret = register_netdev(fakedev);
Now we have everything set for our one and only net_device and we can register it in the network subsystem.

On exit we just call unregister_netdev() and our destructor free_netdev() also takes care of removing stuff allocated by network subsystem for our device and freeing the net_device and priv structs.

Tx


Basically kernel calls the function hooked to ndo_start_xmit to tell our driver to transmit a packet.

For normal ethernet cards the process goes like this:
  • Kernel puts the packets in a transmit queue and the packets in the queue are given to the driver for transmission by the softirq NET_TX_SOFTIRQ. 
    • In fact a packet is given to the driver directly with sch_direct_xmit() if no packets are queued. 
  • The driver hands it off to the device over DMA. 
  • Kernel continues giving packets to driver for delivery as they come. 
  • If the network hardware is busy sending packets given earlier and its buffers are full then the driver tells the kernel to stop sending packets by calling netif_stop_queue(). 
  • Kernel continues to queue packets in the transmit queue but does not deliver them to the driver. 
  • When the network hardware tells the driver that it can send packets again with an interrupt, the driver tells the kernel that it can start sending packets again.
If you look at the source code it is much more complicated. Understanding Linux Network Internals book has the basics right but most of the code is changed for optimization. You can read about some of them here .

Also generally segmentation offload is utilized. If a packet is consisting of multiple skbs, they are serialized into one big packet. If a big packet needs to be divided into smaller chunks because of MTU, it is not divided. Instead of giving multiple small packets to the driver (and in turn to the hardware) the big packet is given as-is and the network hardware is expected to divide the packet into smaller chunks. (wikipedia) (LWN)

Our process goes like this:
  • Check if the packet is an ARP request. 
  • Create a fake ARP reply to it. 
  • Feed it back to the kernel as if it coming from outside (in NAPI rx function).
We work on one packet at a time (for the sake of simplicity).

Let's see what happens when kernel calls our fakeARP_tx function and hands us an skb.
if(tmp_priv->packet_ready) {
    printk(KERN_ALERT "we are waiting for kernel to take our previous ARP reply right now, give the packet back\n");
    netif_stop_queue(dev);
    return NETDEV_TX_BUSY; //tell the kernel we could not process the packet and it should resend it sometime later.
}
We first check if we already have a fake ARP reply ready. Because if we do, we have to give it first so that we can start working on the next one. If that is the case, we tell the kernel that our buffers are full and it should not give us any packets until we say it can. When we do that kernel takes the skb back and requeues it in the tx queue.
//check if the packet is an ARP request packet
if(data[12]==0x08 && data[13]==0x06) { //after 12 octets of MAC addrs comes the 2 octet long type part. 0x0806 is ARP
    if(data[20]==0x00 && data[21]==0x01) { //opcode 0x0001 is request, 0x0002 is reply
If we are free to work on a fake ARP reply, we first check that if the packet is an ARP request.
netif_stop_queue(dev); //tell kernel we won't be able to take new packets, we can forge only one at a time

if(!fakeARP(skb)) {
    printk(KERN_ALERT "fake arp reply could not be forged because of some error\n"); //no error case in tutorial
    dev_kfree_skb_any(skb); //free the original packet (will be freed in next net_tx_action if we are in irq context)
    netif_wake_queue(dev);
    return NETDEV_TX_OK; //normally we should return NETDEV_TX_BUSY here since tx failed
    //but who cares, if it is really important they'll send another one
}
If it is an ARP request, we first tell the kernel to stop sending packets and start our forging function, fakeARP().

I won't go into details of fakearp here but basically we just create a fake ARP packet which says that the MAC address CC:CC:CC:CC:CC:CC is owning the IP asked in the ARP request and tell kernel we have received a new packet.

If everything goes well we free the original skb and tell the kernel the transmit operation was successful and it can give us packets to send again.
dev->stats.tx_packets++;
dev->stats.tx_bytes += skb->len;

//if the packet is not an ARP request we are not interested in it
dev_kfree_skb_any(skb); //oops, we dropped it :D
If it is not an ARP request, we just increase our stats as if we transmitted the packet, free the skb and tell the kernel we have successfully transmitted the packet (by returning NETDEV_TX_OK).

fakearp function


I think the comments inside the code are pretty clear. We take the original packet and change/swap a few octets on it. We dump the result to console with print_hex_dump().
tmp_priv->fakeskb_copy = skb_copy(tmp_priv->fakeskb, GFP_ATOMIC);
Then we copy the data section of the skb we modified. We create a new skb instead of using the original one because old data structure has lots of modified fields. May be it is still being used in another part of the kernel, may be it has a clone. Creating a fresh skb is simpler than resetting the old one.
tmp_priv->fakeskb_copy->protocol = eth_type_trans(tmp_priv->fakeskb_copy, fakedev);
After copying the packet we call eth_type_trans function.

It is a critical function normally called in packet receive interrupt handler of ethernet drivers. The driver creates an skb, puts the raw frame data taken from the hardware inside and calls eth_type_trans to decide the packet protocol, packet type and do some adjustments on the skb.

Protocol is of course ethernet but it turns out there were multiple versions of ethernet protocol back then.

Packet type defines what to do with this packet. In our case eth_type_trans just decides whether the packet is a broadcast or multicast packet or a packet to another host which ended up in our network device by mistake.
tmp_priv->packet_ready = 1;

napi_schedule(&(tmp_priv->napi)); //tell napi system we have received packets and it should poll our device some time.
printk(KERN_ALERT "napi scheduled, waiting for poller to take the fake ARP reply\n");
Finally we set packet_ready variable to 1 and tell NAPI that we have some packets to give. NAPI will then call our polling function fakeARP_poll next time it is polling the devices.

I will continue with rx and wrap up this article series in next post.

No comments:

Post a Comment