While digging around NUMA(software and hardware) I found something very interesting. According to BOL soft-NUMA affects directly the number of lazy-writer processes per memory node so if you experience a bottleneck in lazy-writes, you should play with this configuration. Well, according to a blog post of the CSS engineers of Microsoft, this is not exactly true. The lazy-writer process is one per hardware NUMA group i.e. single process per single memory node. The part that is affected by soft-NUMA is the I/O Completion ports.
The next question is what is that I/O Completion port. I/O completion ports provide an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system. When a process creates an I/O completion port, the system creates an associated queue object for requests whose sole purpose is to service these requests. Processes that handle many concurrent asynchronous I/O requests can do so more quickly and efficiently by using I/O completion ports in conjunction with a pre-allocated thread pool than by creating threads at the time they receive an I/O request.