-
Notifications
You must be signed in to change notification settings - Fork 0
Home
8/26/2017
#########
- How do we write kernel space services?
- How do they differ from user space code?
1. Statically appending a service and part of bZimage.
2. Dynamically creating a modules. – module programming
Most of service devs, FS , protocols etc prefer this interface of adding
kmods (dynamically). Without knowing much existing of source layout and
using specific set of rules , we can insert a new service into the kernel.
kmod goes into ring 0. So no APIs, no syscalls, no library invocations.
Include files of particular modules should be from kernel headers.
Userspace are typically from /usr/include.
The files being included are in the following directory:
/lib/modules/4.12.0-041200-generic/build/include/linux/module.h
kmod subsystem is responsible for cleanup and maintain the modules. Using
comments is not mandatory. Module is just mechanism (pkg) to insert code
into kernel space.
Makefile is not standalone Makefile.
Take mod.c (mod.o) and make a mod out of it.
EXPORT_SYMBOL_GPL – can’t be used by proprietary modules.
sysfs – logical file system and you will find called “module”.
We are interested in actually module body.
LDD-1
#####
Insert the kernel services instead of a function.
Drivers are virtualized services where applications interact with driver.
DD in GPFS kernel is basically 2 pieces of code.
- Interface to the application (kernel specific)
Need to be aware of kernel architecture.
- Hardware specific (device/bus specific).
Need to be aware of how physically the h/w is connected to cpu.
What protocol to use, directly connected ? via bus?
Kernel specific driver can be implemented in 3 different ways
These are 3 different models / ways of writing code. In other wods
3 ways in which aplication can interact with the driver.
1. Character driver approach (Sync devices).
2. Block driver approach (Storage devices, flash, usb).
3. Network driver (Network, WiFi etc).
1. Char drivers : Implemeted on all the devices where data exchange happens
synchronously. For eg mouse, keyboard. Not used for bulk data transfer.
Create a device file and allow the application to interact with the driver
using file operations. Kernel is going to view the driver like FS (vfs_ops).
Common API can be used using file* operations.
Device files – 2 types
1. Character devices
2. Block devices
Check /proc/devices – for the available major,minor.
sudo mknod /dev/veda_cdrv c 300 0
crw-r—r— 1 root root 300, 0 Aug 26 13:49 /dev/veda_cdrv
Aquiring major number should be done dynamically. Otherwise
while porting may fail. Probe /proc/devices list and whichever
maj,min is free should be used to create a device file. This
will be a requirement in driver portability.
8/27/2017
#########
8/28/2017
#########
#What is kernel?
Kernel is set of routines that service the 2 generated events.
Kernel code can be grouped in 2.
1. Process events/context.
2. Device/Interrupt events/context. Higher priority and cpu scheduler
is disabled.
Process context routine is of low priority and can be pre-empted.
The example chr_drv_skel.c – the open is in process context.
Single threaded driver.
#Deep drive of char dd open.
open→sys_open→driver_open→char_dev_open()
(normal file) > vfs open > driver specific open.
When drivers are implemented, identify if it’s concurrent i.e. available
to ‘n’ number of applications parallely. Policy has to be coded that way.
What driver open should consider during the implementation of service.
1. Verify the caller request and enforce driver policy (concurrent or not).
2. Verify if the target device (status) is ready or not.
(Push to wakeup if needed).
3. Check for need of allocation of resoureces within the driver.
(Data structures, Interrupt service routines, memory etc)
4. Use-count to track the # of driver invocations if needed. May be useful
during debugging drivers. “current” is pointer to current pcb. So use that
to print pid of the caller etc.
#Deep drive of char dd close.
close→sys_close→driver_close→char_dev_release()
(normal file close) > vfs close > driver specific close.
1. Release of the resoureces
2. Reduce the use-count etc.
3. Undo basically all the operations done in open.
Devices can be accessed via ports or memory (mapped, graphics card etc).
#Deep drive of char dd read.
read→sys_read→driver_read→char_dev_read()
(normal file read) > vfs read> driver specific read.
1. Verify read arguments. Do validation on the read request.
2. Read data from the device.(can be from port, device memory, bus etc).
3. Whatever data device returns may need to be processed and then
return to the appliation.
4. Transfer the data to app buffer.
copy_to_user will copy transfer the data from kernel to user address
space.
5. Return # of bytes transferred.
#Deep drive of char dd write.
write→sys_write→driver_write→char_dev_write()
(normal file write) > vfs write> driver specific write.
1. Verify / validate the write request.
2. Copy data from app/userspace to driver specific buffer.
3. Process data (may be format?)
4. Write data to hardware.
5. Increment *ppos
6. Return nbytes (num of bytes that were written).
Many frame buffer drivers (char) – they implement seek/lseek. It’s not
possible to move beyond the device space.
- concurrency
When drivers are concurrent, there is a question of parallel access. Take
care of saftey of shared data/resources. So need to make driver routines
re-entrant. There will be locking functions, etc.
When drivers and kernel services – when multiple applications request
request same driver functions parallely. When concurrent function
guarrantees safty of shared data, functions are called re-entrant.
Even n++ is not guarateed to be atomic on all processors. Atomic
instructions take memory lock and they are CPU specific. The atomic macros
expand into CPU specific instructions.
2 main types of locking mechanism
1. Wait type which inturn has 2 – Semaphore and Mutex
If the lock is held and there is a contention on the lock by another
thread, that thread is put to wait state.
2. Poll type – Spinlocks.
They keep spinning or keep polling without any wait.
mutex is more lightweight than semaphore.
If there is a contention on a share resource between process context code
ofthe driver and interrupt context code of the driver (for eg: Interrupt
handler), use a special variant of spin_lock calls spin_lock_irqsave,
spin_lock_irq, spin_lock_bh.
Normal spin_lock need to be used only during contention between driver code
within process context.
8/29/2017
#########
#ioctl – Used for special operations on device file.
Special operations are device and vendor specific. So can’t pre-declare
the function pointers for such special operations.
Up to 255 different cases/functionalities for single IOCTL.
ioctl_list – will list all the available ioctls
For implementing ioctl on the device – do the following
1. Identify the special operations on the device.
2. For each special ops, create a request command.
Also create a header file so that applications can use ioctl.
3. Implement all the cases (for request).
Use ioctl encoding macros to create unique ioctl-request,command touple.
There are 4 encoding macros that can be used.
To arrive at unique numbers easily we use the following macros
_IO(type, nr);
_IOW(type, nr, dataitem) the fourth item calculated using sizeof
_IOR(type, nr, dataitem)
_IOWR(type, nr, dataitem)
inparam – application writes, driver reads. (IOW,application writes)
outparm – application reads, driver writes. (IOR,application reads)
8/30/2017
#########
#ioctl – continued aquire big_kernel_lock()
ioctl > sys_ioctl > do_ioctl -———————→ fops > ioctl > chr_drv_ioctl
unlock()
Because do_ioctl() aquires a big_kernel_lock(), ioctl is rendered single
threaded. There is no way 2 apps/threads will enter ioctl function
parallely.
unlocked_ioctl = this is function pointer in the file_operations is used
to make ioctl concurrent.
/usr/src/linux-headers-4.12.0-041200/include/linux/capability.h – this
defines various priviledge levels. They are called CAP* constants.
capable() kernel macro – within ioctl implementation can check for the
privilege levels being allowed or not.
8/31/2017
#########
#Semaphores.
Semaphore value=0 means semaphore is locked.
value = 1, means semaphore is unlocked.
down_interruptible – will try to aquire the lock and if it doesn’t get the
lock,it will go into interruptible wait state. It will create a wait-queue
and pcb for the calling context will be enqued to the wait-queue.
up() – in the write will make the semaphore available. (Unlock call of
semaphore).
The wait-queues that are part of semaphore are FIFO. This type of locking
using semaphores is discouraged because can lead to confusion.
#completion locks
Another way to let the callers block if the resource is not available.
#Wait-queue
# wait_event_interruptible –
WIll push the caller into interruptible wait state. Calling process’s pcb
is enqueued into wait-queue.
wake_up_interruptible() is the routine to wake up.
This will send a signal to all the PCBs that are in the wait-queue.
9/1/2017
########
Linux provides a way that driver delivers messages to applications
asynchronously. Poll and select are 2 ways that applications can get the
status of the device.
#async notifications:
async is not used in all applications. When IO is not high priority, then
this method is used. This will require applicatons to register with driver
for async messages to be delivered.
Applications get SIGIO from the driver and they can handle with special
signal handler for SIGIO.
kill_fasync() of sending SIGIO can be put in Interrupt Servie Routine (ISR).
poll_wait() is a kernel routine that puts calling application into wait
state (in the wait queue).
#Introducing delays in kernel routines.
For kernel debugging, we may need to introduce delays.
Busy wait – Infinite loop and cpu is busy.
Yield cpu time – schedule() to relinquish the cpu time and is often used.
this doesn’t waste the cpu cycles.
Timeouts – How long we want to be in delay by providing expiry timer.
Process resumes after timer expires.
- Interrupts
Hardware devices are connected to interrupt controller via special line
called IRQ. These devices are triggering interrupts using that IRQ line.
X86 provides 16 lines. You can have up to 256 lines.
IRQ line assigned to particular to device need to be known to driver writer.
IRQ descriptor table is linked list of IRQ descriptor.
9/2/2017
########
lspci -v will list the pci devices.
request_irq() is a kernel function that will register the IRQ handler
with a specified IRQ line. It basically creates an instance of ‘irqaction’
that is hanging off from IRQ descriptor table.
__init my_init() will call request_irq()
__exit my_exit() will call free_irq();
do_irq() is responsible for doing ISR routine.
Refer interrupt.h
Lists the possible flags while writing ISRs.
If we want our ISR routine to be high priority, we can do IRQF_DISABLED can
be used so that kernel won’t do pre-emption.
Exceptions are also sychronous interrupts in CPU level.
Device Interrupts are asynchronous.
Vector table will contain list of descriptors.
0-32 – Are exceptions, (page faults etc).
0-19 – are Watchdogs.
20-31 – Intel reserved.
IRQ0 = 32, IRQ1 = 33 (External device interrupts).
For interrupts between 32-127, Linux has one response function called
do_irq(). do_irq() is low level routine. For particular interrupt do_irq()
check if there is a handler or not. ISRs are called from do_irq().
#How interrupts work in Linux.
do_irq() is function of process 0. Process 0 (kernel) will respond to
interrupt. Process 1 (init) is responsible for creation and management of
processes.
When interrupts occur, do_irq() queries the interrupt controller and
finds which IRQ line is triggered. Linux kernel configures do_irq() routine
as a default response function for all external interrupts.
do_irq() is routine of process 0, which is responsible for allocation of
interrupt stack and invoking appropriate ISR routines. This interrupt stack
is of 2 types. If kernel is configured to use 8k stack, then there is no
separate stack. If kernel is configured with 4K stack, then interrupt stack
is allocated. With 4k, there will be performance hit. By default 8K is the
size of kernel stack. In Embedded linux and if there is a memory constraint
4K may be needed.
1. Find interrupt request line on which interrupt signal was triggered (by
querying interrupt controller).
2. Lookup IRQ descriptor table for addresses of registered interrupt
service routines.
3. Invoke registered ISRs.
4. Enable IRQ line.
5. Execute other priority work.
[More to come here]
6. Invoke process scheduler. Until step6, process scheduler is disabled.
Interrupt latency is total time spent by the system in response to
interrupt. If interrupt latency is high, applications performance may be
impacted. High priority will starve since system is spending more time in
interrupts. One device may block other devices. When timer interrupt goes
off, other interrupts are disabled.
Interrupt latency == Total amount of time system spends on the interrupt.
#Factors contributing to interrupt latency.
a] H/W latency: Amount of time a processor is taking ack interrupt and
invoke ISR.
b] Kernel latency: In Linux/Windows/or when process 0 is responding, how
much time process 0 takes to start an ISR. This is called kernel latency.
c] ISR Latency: When ISR routine invoked, how much time it takes.
ISR are usually referred as INterrupt handlers.
d] Soft interrupt latency (bottom half).
e] Scheduler latency.
e.1] Check if any high pri tasks waiting in queue.
e.2] Signal handlers for the signals pending.
e.3] Giving cpu to high pri task.
RTOS has fixed time latency for interrupts. GPOS do not have fixed time
latency.
#For NIC, both reception and transmission of pkt will trigger an interrupt.
#Sample pseudo-code for interrupt hander for NIC on
#reception of pkt (on network device).
1. Allocate buffer to hold the packet.
2. Copy from NIC buffer to kernel or vice-versa.
2. Process the packet, specially phsyical header.
3. Queue pkt handing over to upper protocol layers.
9/3/2017
########
While designing ISRs the following issues are to be considered.
#Don’t while writing ISR routine.
1. Avoid calling dynamic memory allocation routines.
2. Avoid transferring data (synchronous) between 2 buffer blocks.
3. Avoid contending for access on global data structures because you need
to go through locks.
4. Avoid operations on user space addresses.
5. Avoid call to scheduler. While ISR is running scheduler is disabled.
Hence a call to scheduler may result in deadlock and need to be avoided.
6. Avoid calls to operations which are non-atomic.
#Do’s while writing ISR routine.
1. Use pre-allocated buffers. (skb buffer in network drivers).
2. Consider using DMA whenever data needs to be transferred between device
and memory.
3. Consider using per CPU data wherever needed.
4. Identify non-critical work and use appropriate deferred routines to
execute them when system is idle or other scheduled time.
If you are doing anything h/w specific within ISR is critical.
Anything other than this is non-critical from Interrupt perspective.
Linux has bottom halves or RTOS call BH as “deferred functions”. Basically
step 3 and 4 (processing and enque of the packets to higher layers) can be
deferred and run. It need not be run with interrupts disabled on the cpu.
ISR
{
schedule_bh(func);
}
func() //Soft IRQ.
{
body;
}
Soft IRQ == Piece of code that runs with IRQ enabled but scheduler disabled.
THis is run in interrupt context.
ISR terminates, IRQ enabled, scheduler enabled – Bottom half.
Bottom half can be of 2 types. They can be running soft interrupts context.
They are called Soft IRQs or Work queue.
9/4/2017
########
#SOFTIRQs
Soft IRQ, tasklet and workqueue are 3 ways to do deffered execution of a
routine while servicing an interrupt. Non critical work can be scheduled.
#Linux 2.6 bottom halves.
1. Soft IRQ: It’s a way where routine can be scheduled for deffered execution.
It’s available for static drivers and not dynamic drivers. This is a
bottom half which is runs with interrupts enabled, but pre-emption
disabled.
Refer include/linux/interrupt.h
All softirqs are instances of softirq_action structure.
Maximum of 32 soft irqs.
#Execution of softirq#
#
1. There are maximum 32 allowed softirqs. There is a per-cpu softirq list.
If ISR is running on cpu1, then softirq will run on cpu1 and so on. When
raise_softirq() is called, specified softirq instance is enqued to the list
of pending softirqs (per cpu).
Both top-half and bottom-half need to be on same cpu otherwise there will be
cache-line problem.
2. Pending softirq list is cleared (run one by one) by do_irq() routine
immediately AFTER isr is terminated with interrupt lines enabled. softirq
run with pre-emption disabled but IRQ enabled. If interrupt goes off during
softirq, then do_irq() is run as re-entrant and softirq will get
pre-empted. Softirq pre-emption will happen for nested interrupts.
3. when softirqs can run?
They can be run in 3 different ways.
3.1. softirq can run in the context of do_irq().
3.2. softirq can also run in the after spin_unlock_bh().
do_irq() can’t run softirq, if the spin_lock_bh() is held.
3.3. Softirq may execute in the context of ksoftirqd (Per cpu kernel
thread). This is in process context of ksoftirqd – daemon.
static int __init() my_init()
{
open_softirq(mysoftirq, func)
request_irq()
}
ISR {
raise_softirq(mysoftirq);
}
func {
..
..
}
- Another example of softirq.
- Both need to run on same cpu.
func()
{
/* write to a list */
while(1)
raise_softirq(mysoftirq); <<< This is legal, though bad.
}
read() {
spin_lock_bh()
/* read from list */
spin_unlock_bh();
}
Linux kernel allows softirq routine to reschedule itself within itself.
It’s possible to do raise_softirq() within softirq(). ksoftirqd – will also
clear the pending softirqs when it gets the cpu slice.
#Question1: Why do we allow to reschedule softirq?
Consider 2 cpus as below
cpu1. cpu2:
read() func()
{ {
spin_lock_bh() spin_lock_bh()
/read from list */ /write to the list */
spin_unlock_bh() spin_unlock_bh()
} }
Suppose cpu1 has aquired the lock and is reading from the list, and
interrupt is delivered to cpu2. cpu2 has some bh (softirq) func() and it
tries to aquire the lock. The spin_lock_bh() in func() is going to spin on
the cpu2 for lock to become available. This cpu2 is in interrupt context
and going to waste cpu cycles which is not correct. Hence rescheduling of
softirq is permitted from bh.
Rescheduling bh is required to relinquish cpu from within bottom half when
critical resource (like lock) is rquired for bottom half execution is not
available.
#Question2: Will softirq (bh) can run on 2 cpus run paralley?
#Yes, because the interrupt can be delivered to different cpus.
Use appropriate mutual exclusion locks while writing softirqs.
#Limitations of softirqs.
1. Softirqs are concurrent, i.e. same softirq could run on ‘n’ cpus
parallely.
2. While implementing softirq code using mutual exclusion locks is
mandatory wherever needed.
3. Using locks in interrupt context code will result in variable interrupt
latency.
These are the reasons why softirqs are not available for modules.
Concurrent execution in bottom half, only then consider softirqs.
2. Tasklets.
It’s a softirq where tasklet is guarrateed to be serial. Because of serial
execution,there is no need for locks. Tasklets are dyanmic softirqs that
can be used from within module drivers without concurrency. They are
always executed serially.
Tasklets = Dyanmic softirqs – concurrency.
3. Work Queue
These are 3 different ways of deferring routines later during interrupt
processing.
9/5/2017
########
#TASKLETS
Built on top of softirqs
Represented by 2 softirqs
tasklet_struct structure
Created both statically and dynamically
Less restrictive sync routines.
#Steps involved in tasklets.
1) Declare tasklet
DECLARE_TASKLET(name,func, data)
OR
struct tasklet_struct mytasklet;
tasklet_init(mytasklet, func, data);
2) Implement BH routine
void func(unsigned long data);
3) Schedule tasklet
tasklet_schedule(&mytasklet); <<< Low priority
OR
tasklet_hi_schedule(&mytasklet); <<< High priority – use if you want high.
# When tasklets are run? – Execution policy. Tasklets are executed using
same policy that is applied for softirqs since interrupt subystem of kernel
views a tasklet either instance of type high_softirq or tasklet_softirq.
Interrupt subsystem guarrantees the following with regards to execution of
tasklet.
- Tasklets
-multithreaded analogue of BHs. (From interrupt.h file).
- If tasklet_schedule() is called, then tasklet is guaranteed
to be executed on some cpu at least once after this. - If the tasklet is already scheduled, but its execution is still not
started, it will be executed only once. - If this tasklet is already running on another CPU (or schedule is called
from tasklet itself), it is rescheduled for later. - Tasklet is strictly serialized wrt itself, but not
wrt another tasklets. If client needs some intertask synchronization,
he makes it with spinlocks.
2 different tasklets between 2 different drivers can be run parallel and
there may be need for synchronization. Tasklets are strictly serialized but
not wrt to other tasklets. Inside a tasklet if you are accessing global
data structure, then locking is required.
They can be run in 3 different ways.
3.1. softirq can run in the context of do_irq().
3.2. softirq can also run in the after spin_unlock_bh().
do_irq() can’t run softirq, if the spin_lock_bh() is held.
3.3. Softirq may execute in the context of ksoftirqd (Per cpu kernel
thread). This is in process context of ksoftirqd – daemon.
- Workqueues
Workqueues are 2.6 kernel only. Tasklets and Softirqs were there in 2.0.
Workqueue are instances of work structure.
- Timer bottomhalf – Executes whenever assigned timeout elapses.