I play World of Warcraft. Being a Linux user requires me to run this in Wine, a Windows API translation layer. While Wine is continuously improving, its performance in Direct3D games still leaves much to be desired.
Having very little familiarity with Direct3D or the Wine codebase, I decided to spend a weekend diving straight into the deep end to make things better.
We have a bit of a head start regarding where to look- since the
game runs well on Windows, we know that our bottleneck is likely to
be CPU-side or synchronization related (either in NVIDIA’s GL
driver, or in wine). Using tools such as nvidia-smi
reveals that our GPU utilization is fairly low (30-40%), which
reaffirms my suspicions.
But before touching a single line of code, we need visibility. What problem are we trying to solve? Where is the slowness coming from?
One of my favourite tools for answering these questions is perf, a
performance counter based profiler for Linux. perf
gives us insight into what the distribution of CPU time is-
importantly, which functions are the hottest.
That’s all we need to get started here. I opened WoW and traveled
to an area of the game that performed much worse under Wine- running
at ~14fps, versus ~40fps in Windows. perf
sheds a light
into where our time was spent:
Let’s analyze some of the top offenders, and figure out at a high level how to make things better.
Firstly, wined3d_cs_run
leads us
head first into how Wine’s Direct3D implementation works internally.
wined3d
is the library responsible for translating
Direct3D calls into OpenGL calls. Newer versions of wine use a
command stream to execute the OpenGL calls after
translation from their D3D equivalents. Think of this as a separate
thread that executes a queue of draw commands and state updates from
other threads. This not only solves the issue of the multithreaded
OpenGL story being nightmarish, but lets us parallelize more
effectively and ensures that the resulting GL calls are executed in
some serialized ordering.
Right, back to wined3d_cs_run
. This is the core of
the command stream- it’s just a function that busy-waits on a queue
for commands from other threads. Some brief analysis of the source
code indicates that it does no real work of interest, other than
invoke op handlers for the various commands. But at least we know
about command streams now!
wined3d_resource_map
is where
things get interesting. Fundamentally, it’s a function that maps a
slice of GPU memory into the host’s address space, typically for
streaming geometry data or texture uploads. Given a “resource”
(which is typically just a handle into some kind of GPU memory), it
does the following;
glMapBufferRange
.Intuitively, this makes a lot of sense. We need to wait for the command stream to finish before this function can return- otherwise, where would we get our pointer from? We can’t execute any OpenGL commands to map the resource off the command thread, so we’re forced to wait.
I needed to learn more about how WoW uses its buffer maps to figure out what we could do here.
Enter apitrace. By
using apitrace
to wrap the execution of WoW under wine,
I could intercept the D3D9 calls it was making prior to hitting
wined3d
.
Using this data, I was able to construct a simple model of how WoW renders dynamic geometry;
D3DLOCK_NOOVERWRITE
flag, which promises not to
overwrite any data involved in an in-flight draw call.
D3DLOCK_DISCARD
flag, which invalidates the buffer’s
contents.
memcpy
.DrawIndexedPrimitive
on the new segment of
vertex data in the buffer.This technique works very well for ensuring that we should never have to wait for the GPU. In fact, Microsoft recommends it.
But, the story with wine is less than ideal. In practice, this is what happens:
This is a textbook example of a pipeline stall.
By waiting for the command stream thread and the mapping thread to
synchronize, we’re wasting time that could instead be spent
dispatching more draw calls to the GPU. If the command stream thread
is busy, the D3D thread could be waiting a nontrivial amount of time
for a response! Not only this, but glMapBufferRange
-
the actual OpenGL call used to map a buffer- is
slow. Actually mapping the buffers, even when
synchronization is explicitly disabled, takes a long time.
We could solve our problem handily by not having to wait for the CS thread. The question is- how?
Suppose we had access to a large, persistently mapped buffer in the host address space. We never had to unmap it to make a draw call, and writes to it were coherently visible to the GPU without any GL calls.
D3DLOCK_NOOVERWRITE
was provided, then
we can return the address of the last persistent mapping for that
buffer.
D3DLOCK_NOOVERWRITE
fundamentally lets us
ignore synchronization.D3DLOCK_DISCARD
was provided, then we
can remap the buffer to an unused section of persistently mapped GPU
memory.
Enter the holy grail: ARB_buffer_storage
. This lets
us allocate an immutable section of GPU memory, and allow persistent
(always mapped) and coherent (write-through) maps of it. We’re
effectively replacing the role of the driver here, which would
handle DISCARD
(INVALIDATE
in GL) and
NOOVERWRITE
(UNSYNCHRONIZED
in GL) buffer
maps itself.
This is an AZDO (approaching zero driver overhead) style GL extension. If you’re interested, check out this article by NVIDIA.
wine-pba
(short for persistent buffer allocator) is
a set of patches I’ve written that leverages
ARB_buffer_storage
to implement a GL-free GPU heap
allocator, vastly improving the speed of buffer maps.
At device initialization, a very large OpenGL buffer is allocated. This buffer is governed by a simple re-entrant heap allocator that allows both the command stream thread and D3D thread to make allocations and recycle them independently from each other.
When a D3DLOCK_DISCARD
map is made, the D3D thread
immediately asks the allocator for a new slice of GPU memory and
returns it. The command stream thread is sent an asynchronous
message informing it of the discard, with information on the new
buffer location so that future draw commands on the command stream
thread are aware. The command stream thread returns its old buffer
to the heap allocator when this happens, with a fence to ensure that
the buffer isn’t reused until it is no longer being used by the
GPU.
When a D3DLOCK_NOOVERWRITE
map is made, we can just
return the buffer’s mapped base address plus the offset desired.
Sweet!
Otherwise, the old synchronous path is undergone- except this
time, without requiring a call to glMapBufferRange
(only waiting on a fence).
So, what does this look like?
Unfortunately, WoW does not have great public-facing benchmarking
functionality. I settled for using the console command
/timetest 1
, which measures the average FPS when taking
a flight path- no ground NPCs or players are loaded while in-flight
to reduce variation in test runs. Additionally, I eyeballed the
average idle FPS in various common in-game locations.
These benchmarks were performed on patch 7.3.5 running with 4x SSAA at a resolution of 2560x1440. The CPU is an i5-3570k, and the GPU is a GTX 1070. The graphics preset “7” was chosen (as recommended by the game).
I’m fairly satisfied with the results demonstrated by this benchmark. I hope to update this post with additional frame timing data from a GL intercept tool.
You can find an early prototype of wine-pba at github.com/acomminos/wine-pba. This is far from production quality, and makes several (erroneous) assumptions regarding implementation capabilities. I hope to mainline this once the patchset becomes more mature.
I hope you found this post valuable. I look forward to digging deeper into how to use AZDO techniques to improve wine performance, particularly for uniform updates and texture uploads.
← Home