WdFilterDrv Architecture — Saku Grossarth

Why write a kernel driver?

I wanted to understand what actually happens when usermode calls into the OS. Reading documentation explains the interface, but building the implementation reveals the mechanics. The specific question I was trying to answer: how does data actually flow from a usermode application through the kernel, and what primitives does Windows provide for high-performance IPC? Existing tools like Process Hacker showed me what's possible, but I needed to build it myself to understand the "why" behind every API call.

Architecture overview

WdFilterDrv has two main components: the kernel-mode driver itself and a usermode client that communicates with it. The split looks like this:

Kernel component — driver loading, IRP handling, memory operations, shared memory setup
Usermode client — driver I/O, shared memory consumer, request orchestration

Shared memory IPC

The core communication channel is a shared memory region mapped into both kernel and usermode address spaces. I chose this over IOCTL for latency reasons — shared memory eliminates the system call overhead for data transfer, which matters when monitoring game memory at high frequency.

The setup involves creating an MDL (Memory Descriptor List) for a non-paged pool buffer, locking it into physical memory, and then mapping it into the requesting process's usermode address space via MmMapLockedPagesSpecifyCache. The lifetime is tied to the driver handle: when the usermode process closes the device handle, the driver cleans up both mappings and frees the underlying buffer.

The trickiest part of shared memory IPC is synchronization. If the kernel writes while usermode is reading, you get data corruption or worse. I used a simple spinlock in the shared memory header combined with request/response sequence numbers. The usermode client polls for response sequence updates, and the kernel only writes when the lock is held. It's not the most elegant solution, but it taught me why user/kernel boundaries require careful concurrency design.

Manual module mapping

Manual mapping loads a DLL into memory without using the Windows loader, which is useful for injecting code without leaving obvious loader traces. I parse the PE headers manually: reading the DOS header to find the NT headers, extracting section headers, allocating memory with the correct permissions, and copying sections to their preferred addresses. The hard part is relocation fixing when the image can't load at its preferred base, and import resolution which requires walking the IAT and resolving each function from the target DLLs. I don't handle TLS callbacks yet — that's still on the todo list because the documentation is sparse.

Process-level memory operations

The driver exposes several memory primitives: reading and writing arbitrary process memory via MmCopyVirtualMemory, virtual-to-physical address translation by walking page tables, and direct physical memory access for hardware analysis. Implementing these taught me about Windows address space layout: each process has its own page tables, kernel space is shared, and the memory manager's view of "valid" memory differs from what user-mode APIs allow. The most surprising discovery was how much kernel code trusts callers to validate addresses before passing them in.

Synchronization and stability

Kernel code has no safety net. A bad pointer dereference doesn't raise an exception — it blue-screens the machine. The defensive patterns I use:

IRQL discipline — always know what IRQL you're at and what operations are legal
ProbeForRead/Write — validate usermode buffers before touching them
__try/__except — wrap dangerous memory operations in SEH blocks
Non-paged pool for shared memory — avoiding page faults at high IRQL
Reference counting — tracking driver handle lifetimes to prevent use-after-free

The bugs I hit were mostly my own fault: forgetting to check IRQL before calling ExAllocatePool, passing kernel pointers where usermode expected user pointers, and the classic "forgot to initialize a spinlock" race condition. WinDbg with kernel symbols saved me every time — analyzing crash dumps revealed exactly which line caused the BSOD.

The debugging process

Kernel debugging is a different discipline from usermode debugging. I started with local kernel debugging (requires test signing mode), but quickly moved to two-machine debugging via serial cable for reliability. The workflow: reproduce the crash, note the bug check code, open the crash dump in WinDbg, run !analyze -v for automatic diagnosis, then examine the stack with k and registers with r. Learning to read Windows kernel structures like _EPROCESS and _ETHREAD was essential for understanding driver context.

; WinDbg commands I used frequently
; !analyze -v           ; automatic crash analysis with probable cause
; k                     ; display call stack
; .reload /f            ; force reload of symbols
; dt nt!_EPROCESS       ; display process structure layout
; !process 0 0          ; list all processes
; !pool                 ; pool allocation analysis

What I got wrong

My biggest assumption was that kernel APIs would validate my inputs. They don't. Passing a kernel-mode pointer to ProbeForRead causes an immediate bug check. I also underestimated the complexity of PE relocations — my first manual mapper worked on EXEs that loaded at their preferred base but crashed on anything else. The fix involved understanding the delta calculation and applying fixups to every relocation entry. Finally, I learned the hard way that DriverUnload must clean up everything, or you'll leak memory and have to reboot to reload.

What I'd do differently

If I rebuilt this today, I'd use proper IOCTLs for control operations and reserve shared memory only for bulk data transfer. The current "everything through shared memory" approach is harder to debug. I'd also implement a proper logging framework using ETW or a circular buffer, instead of scattered DbgPrint statements that only show up in the debugger. Finally, I'd add a usermode test harness that mocks the kernel interface, making development faster than the current "compile, sign, install, test, crash, analyze" loop.