An attempt to workaround the incoherency in gen2 chipsets, we avoid
using dynamic reallocation as much as possible.
The first step is to disable allocation of pixmaps using GEM and simply
create them in system memory without a backing buffer object. This
forces all rendering to use S/W fallbacks.
The second step is to allocate a shadow front buffer and assign that to
the Screen pixmap. This ensure that the front buffer remains in the GTT
and pinned for scanout. The shadow buffer will be rendered to in the
normal fashion via the Screen pixmap, and be marked dirty. In the block
handler, the dirty shadow buffer is then blitted (using the GPU) over
the front buffer. This should completely avoid having to move pages
around in the GTT and avoid incurring the wrath of those early chipsets.
Secondly, performance should be reasonable as we avoid the ping-pong
caused by the small aperture and weak GPU forcing software fallbacks.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This avoids a memory leak on server reset.
Signed-off-by: Keith Packard <keithp@keithp.com>
[ickle: Added comments from Keith that explain the necessity of
destroying the pixmap ourselves and why chaining up in this instance is
not the correct approach.]
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Now with streaming uploads and downloads for composite operations in
place, shared memory pixmaps are no longer that dire performance wise.
With careful use these can in fact be the most efficient means of
transfer between a wholly software renderer in the client and a backing
store. For instance, Chromium renders internally to an ARGB32 image
buffer and uses a shared pixmap to composite dirty regions into the
backing store. Thereby using the GPU to either perform the blit or the
format conversion. Enabling shared pixmaps, reduces our CPU overhead
whilst scrolling by a factor of 5 or so.
And this is achieved simply by deleting obsolete code!
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We need to install the acceleration functions so that they are wrapped
by the Damage layer. This fixes the corruption under a compositing WM
introduced in commit 8700673157.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-and-tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
This is wildly optimistic, but it should work in a surprising number of
error situations and some output in those cases will be hopefully be
better than none...
If we submit a batchbuffer and the kernel reports the GPU is hung (which
will be caused by an earlier execbuffer, and so the kernel should have
had enough time to determine whether or not it could reset the GPU) then
disable any further attempt to accelerate gfx and force fallbacks to map
the buffers and use the CPU. We cannot normally map any more buffers if
the GPU is hung, so only those already mapped prior to the hang can be
written to, or those allocated in system memory. However, we can expect
that the framebuffer is already mapped, and so have a reasonable
expectation to continue to see the display update.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Rewrite glyph rendering to avoid the intermediate buffer, accumulating
the glyph rectangles directly in the backend composite routines. And
modify the glyph cache routines to fully utilise the allocated size of
the tiled buffer on older hardware. To do this we alias all glyph sizes
into the same texture using a technique suggested by Keith Packard.
PineView:
885/856-> 1150/1110 kglyph/s (aa/rgb)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Store the cache position directly on the glyph using a devPrivate rather
than an through auxiliary hash table.
x11perf on PineView:
650/638 kglyphs/s -> 701/686 kglyphs/s [aa/rgb]
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The cost of performing relocations outweigh the advantages of using the
blitter for solids with lots of rectangles.
References:
Bug 22127 - [UXA] 50% performance regression for XRenderFillRectangles
https://bugs.freedesktop.org/show_bug.cgi?id=22127
By using the 3D pipeline we improve our performance by around 4x on
i945, measured by the jxbench microbenchmark, and a factor of 10x by
short-cutting to the 3D pipeline for blended rectangles.
Before, on a i945GME:
19982.412060 Ops/s; rects (!); 15x15
9599.131693 Ops/s; rects (!); 75x75
3803.654743 Ops/s; rects (!); 250x250
6836.743772 Ops/s; rects blended; 15x15
1443.750000 Ops/s; rects blended; 75x75
495.335821 Ops/s; rects blended; 250x250
23247.933884 Ops/s; rects composition (!); 15x15
10993.073048 Ops/s; rects composition (!); 75x75
3595.905172 Ops/s; rects composition (!); 250x250
After:
87271.145975 Ops/s; rects (!); 15x15
32347.744361 Ops/s; rects (!); 75x75
5884.177215 Ops/s; rects (!); 250x250
73500.000000 Ops/s; rects blended; 15x15
33580.882353 Ops/s; rects blended; 75x75
5858.811749 Ops/s; rects blended; 250x250
25582.317073 Ops/s; rects composition (!); 15x15
6664.728682 Ops/s; rects composition (!); 75x75
14965.909091 Ops/s; rects composition (!); 250x250 [suspicious]
This has no impact on Cairo, but I have a suspicion from watching xtrace
that Qt likes to blit thousands of 1x1 rectangles with the same colour.
However, we are still around 2-3x slower than the reported figures for
EXA!
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Maintain a small cache of pixmaps to hold SolidFill pictures. Currently
we create a pixmap the size of the damaged region and fill that using
pixman before downloading it to the GPU and compositing. Needless to say
this is extremely expensive compared to simply emitting the solid
colour. To mitigate this cost, we maintain a small cache of 1x1R
pictures which is recognised by the driver as being a solid, but at the
very least is maintained as a GPU ready pixmap.
This gives a good boost to cairo-xcb (which uses solid fills) on a gm45:
Before:
gnome-terminal-vim: 41.9s
After:
gnome-terminal-vim: 31.7s
Compare with using a cache of 1x1R pixmaps in cairo-xcb:
gnome-terminal-vim: 31.6s
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We've talked about doing this since the start of the project, putting it off
until "some convenient time". Just after removing a third of the driver seems
like a convenient time, when backporting's probably not happening much anyway.
Don't do it, treat this the same as every other prepare access call in uxa.
Reviewed-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Owain Ainsworth <zerooa@googlemail.com>
This avoids prepare/finish_access_gc overhead when we're not changing things
(since GCTile is already handled) and get us the RW flag for the prepare on
of the stipple pixmap so thing will be synced correctly.
This eliminates the cost of EXA migration management while providing full
pixmap allocation control to the driver. The goal is to make something
useful for UMA drivers.