Further testing and the balance of doubt swings in favour of using the
3D pipeline for copies.
For small copies the BLT unit is faster,
2.14M/sec vs 1.71M/sec for comppixwin10
And for large copies the RENDER pipeline is faster,
13000/sec vs 8000/sec for comppixwin500
I think the implication is that we are not efficiently utilising the EU
for small primitives - i.e. something that we might be able to improve.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
For whatever reason, this produces a 30% improvement with the fish-demo
(500 -> 660 fps on i7-3730qm at 1024x768). However, it does cause about
a 5% regression in aa10text. We can appear to alleviate that by only
doing the flush when the composite op != PictOpSrc.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The original vblank interface only understood 2 pipes (primary and
secondary) and so selecting the third pipe (introduced with IvyBridge)
requires use of the HIGH_CRTC. Using the second pipe where we meant the
third pipe could result in some spurious timings when waiting on the
vblank.
Reported-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We use that flag to check whether we need to check whether the bo is
still busy upon destruction, so only clear it if the bo is marked as
idle.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As the 3D pipeline is quite versatile and we only need to force BLT if
we cannot extract the subregion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The sampler can in fact handler subregions of large pixmaps quite well,
and so we prefer to keep using the 3D pipeline so long as the operation
fits in. If not, then switch to the BLT in order to avoid the temporary
surface dance.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
To complete my show of incompetence for the evening, not only do we have
to restore the original source when compositing the mask onto the
destination, we also need to restore the original dst (rather than
composite the mask onto the mask!).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In 64a4bcb8ce, I introduced a WHITE source for the purposes of
accumulating the glyph mask correctly. Unfortunately I neglected to
restore the original source picture for compositing the glyph mask on
the destination, resulting in a use-after-free and then corruption.
Reported-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we may attempt to end up using the GPU bo is the CPU bo is busy, we
need to make sure we have initialised the damage extents first.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Keep the semantic analyser happy by consuming the expected return value
with an assert.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
An earlier version was buggy and introduced corruption as it failed to
fallback gracefully with ComponentAlpha glpyhs. This is a much simpler
implementation that composites each glyph individually, leaving it to the
backend to optimise away state changes. It should still be many times
faster than incurring the fallback...
Reported-by: Oleksandr Natalenko <pfactum@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50508
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As pointed out by Soren Sandmann and Behdad Esfahbod, it is essential to
use white IN glyph when adding to the mask so that the channel expansion
is correctly performed when adding to an incompatible mask format.
For example, loading alpha as the source results in the value 000a being
added to the rgba glyph mask (for mixed subpixel rendering with
grayscale glyphs), whereas the desired value is aaaa.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This issue was raised by Dave Airlie as he is trying to integrate
multiple GPUs into the xserver, and a particular setup has a slave
rendering device that copies the contents from the GPU over a
DisplayLink USB adaptor. As such the slave device is listening for
Damage on the Screen Pixmap and needs the update following pageflips.
Since we already are posting damage for all the SwapBuffers paths other
than pageflip, for consistency we should post damage along the pageflip
path as well.
Reported-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we have never written to a pixmap, then there will be neither a GPU
or shadow pointer and we would attempt to copy a NULL pointer. In this
case as the user is expecting to copy unintialised data we are at
liberty to replace those undefined values with the clear color.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we adjust the region for the pixmap offset, be sure that we reset it
before returning it back to the caller.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This seems to be a restriction (observed on 965gm at least) that we
have incoherent sampler cache if we write within 128 bytes of a busy
buffer. This is either due to a restriction on neighbouring cachelines
(like the earlier BLT limitations) or an effect of sampler prefetch.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
References: https://bugs.freedesktop.org/show_bug.cgi?id=50477
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Otherwise we gradually introduce garbage into the picture.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50477
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As pixman composite performance is atrocious for anything other than
solids, prefer to upload the mask and attempt a composite operation on
the GPU unless we are forcing the fallback.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Not only do we need to make sure the source is available to the CPU, we
need to actually check the right conditions for clipping the box.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When returning early because the operation is a no-op, we still need to
fill in the function pointers to prevent a later NULL dereference.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The goal is cheaply spot a simple copy operation that can be performed
on the CPU without having to load both parties onto the GPU.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Seems we have enough GPU power to overcome the clumsy shaders. Just
imagine the possibilities when we have a true shader for spans...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Semmingly only advisable when already committed to using the GPU. This
first pass is still a little naive as it makes no attempt to avoid empty
tiles, nor aims to be efficient.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we can guess that we will only readback the data once, then we can
skip the copy into the shadow.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
One outcome is that inspecting the usage patterns afterwards indicated
that we were missing an opportunity to reduce unaligned boxes to an
inplace operation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>