Why write a kernel driver?
I wanted to understand what actually happens when usermode calls into the OS. Reading documentation explains the interface, but building the implementation reveals the mechanics. The specific question I was trying to answer: how does data actually flow from a usermode application through the kernel, and what primitives does Windows provide for high-performance IPC? Existing tools like Process Hacker showed me what's possible, but I needed to build it myself to understand the "why" behind every API call.
Architecture overview
WdFilterDrv has two main components: the kernel-mode driver itself and a usermode client that communicates with it. The split looks like this:
- Kernel component — driver loading, IRP handling, memory operations, shared memory setup
- Usermode client — driver I/O, shared memory consumer, request orchestration
Shared memory IPC
The core communication channel is a shared memory region mapped into both kernel and usermode address spaces. I chose this over IOCTL for latency reasons — shared memory eliminates the system call overhead for data transfer, which matters when monitoring game memory at high frequency.
The setup involves creating an MDL (Memory Descriptor List) for a non-paged
pool buffer, locking it into physical memory, and then mapping it into the
requesting process's usermode address space via MmMapLockedPagesSpecifyCache.
The lifetime is tied to the driver handle: when the usermode process closes
the device handle, the driver cleans up both mappings and frees the underlying
buffer.
Manual module mapping
Manual mapping loads a DLL into memory without using the Windows loader, which is useful for injecting code without leaving obvious loader traces. I parse the PE headers manually: reading the DOS header to find the NT headers, extracting section headers, allocating memory with the correct permissions, and copying sections to their preferred addresses. The hard part is relocation fixing when the image can't load at its preferred base, and import resolution which requires walking the IAT and resolving each function from the target DLLs. I don't handle TLS callbacks yet — that's still on the todo list because the documentation is sparse.
Process-level memory operations
The driver exposes several memory primitives: reading and writing arbitrary
process memory via MmCopyVirtualMemory, virtual-to-physical
address translation by walking page tables, and direct physical memory
access for hardware analysis. Implementing these taught me about Windows
address space layout: each process has its own page tables, kernel space
is shared, and the memory manager's view of "valid" memory differs from
what user-mode APIs allow. The most surprising discovery was how much
kernel code trusts callers to validate addresses before passing them in.
Synchronization and stability
Kernel code has no safety net. A bad pointer dereference doesn't raise an exception — it blue-screens the machine. The defensive patterns I use:
- IRQL discipline — always know what IRQL you're at and what operations are legal
- ProbeForRead/Write — validate usermode buffers before touching them
- __try/__except — wrap dangerous memory operations in SEH blocks
- Non-paged pool for shared memory — avoiding page faults at high IRQL
- Reference counting — tracking driver handle lifetimes to prevent use-after-free
The bugs I hit were mostly my own fault: forgetting to check IRQL before
calling ExAllocatePool, passing kernel pointers where usermode
expected user pointers, and the classic "forgot to initialize a spinlock"
race condition. WinDbg with kernel symbols saved me every time —
analyzing crash dumps revealed exactly which line caused the BSOD.
The debugging process
Kernel debugging is a different discipline from usermode debugging.
I started with local kernel debugging (requires test signing mode), but
quickly moved to two-machine debugging via serial cable for reliability.
The workflow: reproduce the crash, note the bug check code, open the
crash dump in WinDbg, run !analyze -v for automatic diagnosis,
then examine the stack with k and registers with r.
Learning to read Windows kernel structures like _EPROCESS and
_ETHREAD was essential for understanding driver context.
; WinDbg commands I used frequently
; !analyze -v ; automatic crash analysis with probable cause
; k ; display call stack
; .reload /f ; force reload of symbols
; dt nt!_EPROCESS ; display process structure layout
; !process 0 0 ; list all processes
; !pool ; pool allocation analysis
What I got wrong
My biggest assumption was that kernel APIs would validate my inputs.
They don't. Passing a kernel-mode pointer to ProbeForRead
causes an immediate bug check. I also underestimated the complexity of
PE relocations — my first manual mapper worked on EXEs that loaded
at their preferred base but crashed on anything else. The fix involved
understanding the delta calculation and applying fixups to every
relocation entry. Finally, I learned the hard way that DriverUnload
must clean up everything, or you'll leak memory and have to reboot to reload.
What I'd do differently
If I rebuilt this today, I'd use proper IOCTLs for control operations
and reserve shared memory only for bulk data transfer. The current
"everything through shared memory" approach is harder to debug.
I'd also implement a proper logging framework using ETW or a circular
buffer, instead of scattered DbgPrint statements that only
show up in the debugger. Finally, I'd add a usermode test harness
that mocks the kernel interface, making development faster than the
current "compile, sign, install, test, crash, analyze" loop.