As we will immediately attempt to replace it with an inactive when
moving the data to the GPU, short-circuit that replacement.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
These only existed to work around an include order problem, when kgem
was intended to be entirely separable from sna. Moving the function
pointer into kgem simplifies matters.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we check for retirement everytime we wakeup, it is seldom useful to
check again until we know we have invoked an operation that may block.
But when we do check, we do not want to scan the entire active list
looking for flushing candidates, so track those on a separate list.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
So that we can simply query it from each of the Zaphod instances without
blocking. Requires a fixed kernel...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
And bump configure.ac to require dri2proto >= 2.6, because
DRI2BufferStencil and DRI2BufferHiz were introduced in that version.
When a client requests DRI2BufferHiz or DRI2BufferStencil,
I830DRI2CreateBuffer() now returns a Y-tiled buffer. The stencil buffer is
handled as a special case due its quirky pitch requirements.
CC: Eric Anholt <eric@anholt.net>
CC: Ian Romanick <idr@freedesktop.org>
CC: Kristian Høgsberg <krh@bitplanet.net
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Chad Versace <chad@chad-versace.us>
Before this commit, if a client were to request an unrecognized DRI2
buffer, such as DRI2BufferStencil, then I830DRI2CreateBuffer() allocated
and returned an X-tiled buffer by accident. The problem was that
unrecognized tokens were caught by the default case of a switch statement.
Now, when given unrecognized DRI2 tokens, I830DRI2CreateBuffers() returns
null.
This shouldn't break older Mesa versions, because they never query (via
DRI2GetBuffersWithFormat) for the drawable's DRI2BufferStencil.
CC: Eric Anholt <eric@anholt.net>
CC: Ian Romanick <idr@freedesktop.org>
CC: Kenneth Graunke <kenneth@whitecape.org>
CC: Kristian Høgsberg <krh@bitplanet.net
Signed-off-by: Chad Versace <chad@chad-versace.us>
A left-over from before the surface was embedded into the tail of the
batch, we were only checking for room against the total size of the
batch buffer. So under the wrong set of circumstances we ended up
overwriting surface data with batch and triggering a GPU hang on gen4+.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Zaphod support is a rudimentary method for creating an Xserver with
multiple screens from a single device. The Device is instantiated, with
a duplication of its resources, as many as required up to a maximum of
the number of its outputs, and each instance is attached to a Screen
and added to the ServerLayout. A Device can be bound to a selection of
outputs using a comma separated list of RandR names.
Note: in general, this is not the preferred solution! And will be
superseded by per-crtc-pixmaps in RandR-1.4.
For example, the following xorg.conf fragment creates an XServer with
two screens, one attached to the LVDS panel on the laptop, and the other
to any external output:
Section "Device"
Identifier "Intel0"
Driver "intel"
BusID "PCI:0:2:0"
Option "ZaphodHeads" "LVDS1"
Screen 0
EndSection
Section "Device"
Identifier "Intel1"
Driver "intel"
BusID "PCI:0:2:0"
Option "ZaphodHeads" "DVI1,VGA1"
Screen 1
EndSection
Section "Screen"
Identifier "Screen0"
Device "Intel0"
EndSection
Section "Screen"
Identifier "Screen1"
Device "Intel1"
EndSection
Section "ServerLayout"
Identifier "default"
Screen "Screen0"
Screen "Screen1"
EndSection
Based on a patch by Ben Skegs <bskeggs@redhat.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
830/845 cannot directly sample from an x8r8g8b8 source, but if we know
that we are only sampling from within the confines of the source then we
force the alpha channel to one. (Outside of the source we require the
sampler to return a==0.)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Valgrind detected that I missed initialised a couple of fields for
use with the generic state emission paths:
==28683== Conditional jump or move depends on uninitialised value(s)
==28683== at 0x83BE646: gen6_get_blend (gen6_render.c:251)
==28683== by 0x83BF769: gen6_emit_state (gen6_render.c:818)
==28683== by 0x83C38ED: gen6_emit_copy_state (gen6_render.c:2280)
==28683== by 0x83C3C89: gen6_render_copy_boxes (gen6_render.c:2356)
==28683== Conditional jump or move depends on uninitialised value(s)
==28683== at 0x83C15C3: gen6_rectangle_begin (gen6_render.c:1458)
==28683== by 0x83C177D: gen6_get_rectangles (gen6_render.c:1502)
==28683== by 0x83C3D16: gen6_render_copy_boxes (gen6_render.c:2363)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
... and so avoid having to move it the GPU, as seen in the wild. It
looks like I will actually need to handle mixed Render/Core operations
on the frontbuffer.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We were always terminating the batch with the non-pipelined op, and not
just at the end of a BLT sequence.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Unlike the previous commit removing this style of code, the code in
this one was originally wrong, and would fail to clip in the second
pass of clipping when y was > pbox->y2.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=37233
Reviewed-by: Keith Packard <keithp@keithp.com>
We were clipping each span against the bounds of the clip, throwing
out the span early if it was all clipped, and then walked the clip box
clipping against each of the cliprects. We would expect spans to
typically be clipped against one box, and not thrown out, so we were
not saving any work there. For multiple cliprects, we were adding
work. Only for many spans clipped entirely out of a complicated clip
region would it have saved work, and it clearly didn't save bugs as
evidenced by the many fix attempts here.
Reviewed-by: Keith Packard <keithp@keithp.com>
This saves a copy in the typical PutImage to frontbuffer favoured by
flash. And we also happen to fix a bug if we should be requested to
PutImage outside of the clip region...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
... also make sure that we flush if we change the blend mode for the CA pass.
Reported-by: Ivan Bulatovic <combuster@archlinux.us>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=37946
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
gen4 dies hard if it has two rectangles in the pipeline, and despite the
stringent and crippling efforts to prevent us from efficiently using the
GPU, I missed a flush before submitting the CA rectangle.
Reported-and-tested-by: Fryderyk Dziarmagowski <fdziarmagowski@gmail.com>
References: https://bugs.freedesktop.org/show_bug.cgi?id=28768
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The premise is that switching between rings (i.e. the BLT and
RENDER rings) on SandyBridge imposes a large latency overhead whilst
rendering. The cause is that in order to switch rings, we need to split
the batch earlier than is desired and to add serialisation between the
rings. Both of which incur large overhead.
By switching to using a pure 3D blit engine (ok, not so pure as the BLT
engine still has uses for the core drawing model which can not be easily
represented without a combinatorial explosion of shaders) we can take
advantage of additional efficiencies, such as relative relocations, that
have been incorporated into recent hardware advances. However, even
older hardware performs better from avoiding the implicit context
switches and from the batching efficiency of the 3D pipeline...
But this is X, and PolyGlyphBlt still exists and remains in use. So for
the operations that are not worth accelerating in hardware, we introduce a
shadow buffer mechanism through out and reintroduce pixmap migration.
Doing this efficiently is the cornerstone of ensuring that we do exploit
the increased potential of recent hardware for running old applications and
environments (i.e. so that the latest and greatest chip is actually faster
than gen2!)
For the curious, sna is SandyBridge's New Acceleration. If you are
running older chipsets and welcome the performance increase offered by
this patch, then you may choose to call it Snazzy instead.
Speedups
========
gen3 firefox-fishtank 1203584.56 (1203842.75 0.01%) -> 85561.71 (125146.44 14.87%): 14.07x speedup
gen5 grads-heat-map 3385.42 (3489.73 1.44%) -> 350.29 (350.75 0.18%): 9.66x speedup
gen3 xfce4-terminal-a1 4179.02 (4180.09 0.06%) -> 503.90 (531.88 4.48%): 8.29x speedup
gen4 grads-heat-map 2458.66 (2826.34 4.64%) -> 348.82 (349.20 0.29%): 7.05x speedup
gen3 grads-heat-map 1443.33 (1445.32 0.09%) -> 298.55 (298.76 0.05%): 4.83x speedup
gen3 swfdec-youtube 3836.14 (3894.14 0.95%) -> 889.84 (979.56 5.99%): 4.31x speedup
gen6 grads-heat-map 742.11 (744.44 0.15%) -> 172.51 (172.93 0.20%): 4.30x speedup
gen3 firefox-talos-svg 71740.44 (72370.13 0.59%) -> 21959.29 (21995.09 0.68%): 3.27x speedup
gen5 gvim 8045.51 (8071.47 0.17%) -> 2589.38 (3246.78 10.74%): 3.11x speedup
gen6 poppler 3800.78 (3817.92 0.24%) -> 1227.36 (1230.12 0.30%): 3.10x speedup
gen6 gnome-terminal-vim 9106.84 (9111.56 0.03%) -> 3459.49 (3478.52 0.25%): 2.63x speedup
gen5 midori-zoomed 9564.53 (9586.58 0.17%) -> 3677.73 (3837.02 2.02%): 2.60x speedup
gen5 gnome-terminal-vim 38167.25 (38215.82 0.08%) -> 14901.09 (14902.28 0.01%): 2.56x speedup
gen5 poppler 13575.66 (13605.04 0.16%) -> 5554.27 (5555.84 0.01%): 2.44x speedup
gen5 swfdec-giant-steps 8941.61 (8988.72 0.52%) -> 3851.98 (3871.01 0.93%): 2.32x speedup
gen5 xfce4-terminal-a1 18956.60 (18986.90 0.07%) -> 8362.75 (8365.70 0.01%): 2.27x speedup
gen5 firefox-fishtank 88750.31 (88858.23 0.14%) -> 39164.57 (39835.54 0.80%): 2.27x speedup
gen3 midori-zoomed 2392.13 (2397.82 0.14%) -> 1109.96 (1303.10 30.35%): 2.16x speedup
gen6 gvim 2510.34 (2513.34 0.20%) -> 1200.76 (1204.30 0.22%): 2.09x speedup
gen5 firefox-planet-gnome 40478.16 (40565.68 0.09%) -> 19606.22 (19648.79 0.16%): 2.06x speedup
gen5 gnome-system-monitor 10344.47 (10385.62 0.29%) -> 5136.69 (5256.85 1.15%): 2.01x speedup
gen3 poppler 2595.23 (2603.10 0.17%) -> 1297.56 (1302.42 0.61%): 2.00x speedup
gen6 firefox-talos-gfx 7184.03 (7194.97 0.13%) -> 3806.31 (3811.66 0.06%): 1.89x speedup
gen5 evolution 8739.25 (8766.12 0.27%) -> 4817.54 (5050.96 1.54%): 1.81x speedup
gen3 evolution 1684.06 (1696.88 0.35%) -> 1004.99 (1008.55 0.85%): 1.68x speedup
gen3 gnome-terminal-vim 4285.13 (4287.68 0.04%) -> 2715.97 (3202.17 13.52%): 1.58x speedup
gen5 swfdec-youtube 5843.94 (5951.07 0.91%) -> 3810.86 (3826.04 1.32%): 1.53x speedup
gen4 poppler 7496.72 (7558.83 0.58%) -> 5125.08 (5247.65 1.44%): 1.46x speedup
gen4 gnome-terminal-vim 21126.24 (21292.08 0.85%) -> 14590.25 (15066.33 1.80%): 1.45x speedup
gen5 firefox-talos-svg 99873.69 (100300.95 0.37%) -> 70745.66 (70818.86 0.05%): 1.41x speedup
gen4 firefox-planet-gnome 28205.10 (28304.45 0.27%) -> 19996.11 (20081.44 0.56%): 1.41x speedup
gen5 firefox-talos-gfx 93070.85 (93194.72 0.10%) -> 67687.93 (70374.37 1.30%): 1.37x speedup
gen4 evolution 6696.25 (6854.14 0.85%) -> 4958.62 (5027.73 0.85%): 1.35x speedup
gen3 swfdec-giant-steps 2538.03 (2539.30 0.04%) -> 1895.71 (2050.62 62.43%): 1.34x speedup
gen4 gvim 4356.18 (4422.78 0.70%) -> 3276.31 (3281.69 0.13%): 1.33x speedup
gen6 evolution 1242.13 (1245.44 0.72%) -> 953.76 (954.54 0.07%): 1.30x speedup
gen6 firefox-planet-gnome 4554.23 (4560.69 0.08%) -> 3758.76 (3768.97 0.28%): 1.21x speedup
gen3 firefox-talos-gfx 6264.13 (6284.65 0.30%) -> 5261.56 (5370.87 1.28%): 1.19x speedup
gen4 midori-zoomed 4771.13 (4809.90 0.73%) -> 4037.03 (4118.93 0.85%): 1.18x speedup
gen6 swfdec-giant-steps 1557.06 (1560.13 0.12%) -> 1336.34 (1341.29 0.32%): 1.17x speedup
gen4 firefox-talos-gfx 80767.28 (80986.31 0.17%) -> 69629.08 (69721.71 0.06%): 1.16x speedup
gen6 midori-zoomed 1463.70 (1463.76 0.08%) -> 1331.45 (1336.56 0.22%): 1.10x speedup
Slowdowns
=========
gen6 xfce4-terminal-a1 2030.25 (2036.23 0.25%) -> 2144.60 (2240.31 4.29%): 1.06x slowdown
gen4 swfdec-youtube 3580.00 (3597.23 3.92%) -> 3826.90 (3862.24 0.91%): 1.07x slowdown
gen4 firefox-talos-svg 66112.25 (66256.51 0.11%) -> 71433.40 (71584.31 0.14%): 1.08x slowdown
gen4 gnome-system-monitor 5691.60 (5724.03 0.56%) -> 6707.56 (6747.83 0.33%): 1.18x slowdown
gen3 ocitysmap 3494.05 (3502.44 0.20%) -> 4321.99 (4524.42 2.78%): 1.24x slowdown
gen4 ocitysmap 3628.42 (3641.66 9.37%) -> 5177.16 (5828.74 8.38%): 1.43x slowdown
gen5 ocitysmap 4027.77 (4068.11 0.80%) -> 5748.26 (6282.25 7.38%): 1.43x slowdown
gen6 ocitysmap 1401.61 (1402.24 0.40%) -> 2365.74 (2379.14 4.12%): 1.69x slowdown
[Note the performance regression for ocitysmap comes from that we now
attempt to support rendering to and (more importantly) from large
surfaces. By enabling such operations is the only way to one day be
faster than purely using the CPU, in the meantime we suffer regression
due to the increased migration and aperture thrashing. The other couple
of regressions will be eliminated with improved span and shader support,
now that the framework for such is in place.]
The performance increase for Cairo completely overlooks the other
critical aspects of the architecture:
World of Padman:
gen3 (800x600): 57.5 -> 96.2
gen4 (800x600): 47.8 -> 74.6
gen6 (1366x768): 100.4 -> 140.3 [F15]
144.3 -> 146.4 [drm-intel-next]
x11perf (gen6);
aa10text: 3.47 -> 14.3 Mglyphs/s [unthrottled!]
copywinwin10: 1.66 -> 1.99 Mops/s
copywinpix10: 2.28 -> 2.98 Mops/s
And we do not have a good measure for how much improvement the reworking
of the fallback paths give, except that xterm is now over 4x faster...
PS: This depends upon the Xorg patchset "Remove the cacheing of the last
scratch PixmapRec" for correct invalidations of scratch Pixmaps (used by
the dix to implement SHM operations, used by chromium and gtk+ pixbufs.
PPS: ./configure --enable-sna
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Using AC_CHECK_FILE will cause cross-builds to fail picking the right file;
instead use compile/preprocessor checks properly, and check for
xf86driproto earlier.
Reviewed-by: Rémi Cardona <remi@gentoo.org>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
To minimise lag in those every so critical games, we want to ensure that
the copy happens as soon as it is received, so we need to flush the
batch after processing a swap event and before we go to sleep.
References: https://bugs.freedesktop.org/show_bug.cgi?id=37068
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In order to avoid video lag and jerky playback we need to ensure that
any queued video is flushed before we go to sleep.
Fixes regression from 6f104189bb.
Reported-and-tested-by: Edward Sheldrake <ejsheldrake@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=37068
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This should fix the seven-fold repetition of "SandyBridge" in the list
of supported chipsets during start-up... And be more useful in bug
reports!
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Bring intel_module.c into line with the kernel whitespacing rules abided
by everywhere else in the tree.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This is one less place the new hardware enabler has to spam the
chipset in. The PciChipset is just a match structure from PciId to
the SymTabRec entry token, and our SymTabRec entry tokens are just the
PciId, so it's trivial to construct.
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
We need to have this array anyway for the xf86 interfaces, apparently,
so just store the name in one location. This drops the i852/i855
subdevice distinction in the name printed, but I haven't seen us ever
care about that.
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Currently, we require that a batch containing a dirty bo be submitted
before we mark the device as requiring a flush. So if we never submit a
batch between block handlers, we can end up sleeping without ever
flushing either the partial batch or the rendering to the scanout.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=36776
Tested-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>