If we have no shader support for generic convolutions, we currently
create the convolved texture using pixman. A multipass accumulation
algorithm can be implemented on top of CompositePicture, so try it!
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The only place where we did anything other than use the default was when
creating a new bo for CopyArea. In that case, basing the choice on the
src GPU bo was not only wrong but a potential segfault.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
After 0c12f7cb0 we were setting the width/height of the pixmap *after*
trying to use them to determine if the pixmap could be created on the
GPU. Normally this would be corrected when we attempt to render, except
for the core drawing protocol.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Don't assume that a read/write will clear the active flag if the bo has
been exported to another DRI client.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Most small pixmaps appear to be single shot, so amalgamate them into one
buffer and trim our memory usage.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
These are treated by the core drawing routines as replacements for the
front-buffer attached to Windows, and so expect the usual BLT
accelerations are available, for example overlapping blits to handle
scrolling. If we create these pixmaps with Y-tiling and then they are
pinned by the external compositor we are forced to perform a double copy
through the 3D pipeline as it does not handle overlapping blits and the
BLT does not handle Y-tiling.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Only marginally better than falling all the way back to using the CPU,
is to perform a double copy to workaround the overlapping copy.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Short-cut the determination of whether it can be tiled and accelerated
-- we know it can't! This is mainly to cleanup the debug logs.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In the READ==0 case we know that the region does not intersect damage
because we have just subtracted, and checking the intersection causes us
to immediately apply the subtraction operation defeating the
optimisation and forcing the expensive operation each time.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Only those that point into scratch memory need to synchronized before
control is handed back to the client.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Mark the end of a sequence of CPU operations and force the decision to
map again to be based on the current upload operation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If the last operation was on the GPU, continue on the GPU if this
operation overlaps any GPU damage or does not overlap CPU damage.
Otherwise, if the last operation was on the CPU only switch to the GPU
if we do not overlap any CPU damage and overlap existing GPU damage.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we now can accelerate most of the common core drawing operations, we
can create GPU bo for accelerated drawing on first use without undue
fear of readbacks. This benefits Qt especially which heavily uses core
the drawing operations.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The theory being that we will also require cache space to copy from when
uploading into the shadow.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When mixing operations and switching between GTT and CPU mappings we
need to restore the original CPU shadow rather than accidentally
overwrite.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
A missing check before emitting a dword into the batch opened up the
possibility of overflowing the batch and corrupting our state.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Prefer to reuse an available CPU mapping which are considered precious
and reaped if we keep too many unused entries availabled.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
That the rq is NULL when on the flushing list is no longer true, but
now it points to the static request instead.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we may free a purged bo whilst iterating, we need to keep the next bo
as a local member.
Include the debugging that led to this find.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
After we find a single bo that has been evicted from our cache, the
kernel is likely to have evicted many more so check our caches for any
more bo to reap.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If there are no VMA that might become inactive, there is no point
scanning the inactive lists if we are searching for VMA.
This prevents the regression in firefox-fishbowl whilst maintaining most
of the improvement with PutComposite.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Try to recycle vma by first trying to populate the inactive list before
scanning for a vma bo to harvest.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we are going to transfer GPU damage back to the CPU bo, then we can
reuse an active buffer and so improve the recycling.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In order to avoid conflating whether a bo was marked purgeable with its
retained state, we need to carefully handle the errors from madv.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we incurred a context switch to the BLT in order to prepare the
target (uploading damage for instance), we should recheck whether we can
continue the operation on the BLT rather than force a switch back to
RENDER.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we need to halt the 3D engine in order to flush the pipeline for a
dirty source, we may as well re-evaluate whether we can use the BLT
instead.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we will need to extract either the source or the destination, we
should see if we can do the entire operation on the BLT.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As for PutImage, if the damage will be immediately flushed out to the
GPU bo, we may as well do the write directly to the GPU bo and not
staged via the shadow.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
One issue with the heuristic is that it is based on total pixmap size
whereas the goal is to pick the placement for the next series of
operations. The next step in refinement is to combine an overall
placement to avoid frequent migrations along with a per-operation
override.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When the pixmap is large, larger than L2 cache size, we are unlikely to
benefit from first copying the data to a shadow buffer -- as that shadow
buffer itself will mostly reside in main memory. In such circumstances
we may as perform the write to the GTT mapping of the GPU bo. As such,
it is a fragile heuristic that may require further tuning.
Avoiding that extra copy gives a 30% boost to putimage500/shmput500 at
~10% cost to putimage10/shmput10 on Atom (945gm/PineView), without any
noticeable impact upon cairo.
Reported-by: Michael Larabel <Michael@phoronix.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>