To enable Daniel's faster pwrite paths. Only one step removed from using
whole page alignment...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Allow all generations to use the minimum alignment of 4 bytes again as
it appears to be working for me... Or at least what remains broken seems
to be irrespective of this alignment.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we reuse a partial buffer for a read, we cannot shrink it during
upload to the device as we do not track how many bytes we actually need
for the read operation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As the partial bo may be coupled into the execlist, we may as well hang
onto the memory to service the next partial buffer request until it
expires in the next dispatch.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Don't blithely assume that the incoming bytes are appropriately aligned
for the destination buffer. Indeed we may be replacing the destination
bo with the shadow bytes out of another,larger, pixmap, in which case we
do need to create a stride that is appropriate for the upload an
perform the 2D copy.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
So that subsequent code resists performing CPU operations with them
(after they have been populated.)
Marking both sides as wholly damaged breaks the rules, but should work
out so long as we check whether we can perform the operation within the
target damage first.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We should be able to reduce this by disabling dual-stream mode of the
GPU (which we want to achieve any way for 2D performance). Artefacts
in small uploads demonstrate that we fail to do.
References: https://bugs.freedesktop.org/show_bug.cgi?id=44150
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As the no transform is a special case of affine, we were attempting to
deference the NULL transform in order to determine if it was a simple
no-rotation matrix. As the operation is extremely simple, add a special
case vertex program to speed it up.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
I long for the day when this code is obsolete... Until then, this gives
a nice boost in the fishtank.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Similar for the standard io paths, try to reuse an upload buffer for a
small replacement pixmap.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we can create the read buffer from an active cached bo, it may
already be in the GPU domain by the time we first finish it, so fix the
broken assertion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
So that the relocation entries point into the contiguous surface/batch
and can be trivially fixed up.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we have to upload the dirty data anyway, setting the
alpha-channel to 0xff should be free. Not so for firefox-asteroids on
Atom at least.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we are forced to perform a render operation to a bo too large to fit
in the pipeline, copy to an intermediate and split the operation into
tiles rather than fallback.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we do not fill the whole upload buffer, we may be able to reuse a
smaller buffer that is currently bound in the GTT. Ideally, this will
keep our RSS trim.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If, for instance, we reduce the GPU damage to all we know that there can
be no CPU damage even though it may still have a region with a list of
subtractions. Take advantage of this knowledge and cheaply discard that
damage without having to evaluate it.
This should prevent a paranoid assertion that there is no cpu damage
when discarding the CPU bo for an active pixmap.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The function was semantically equivalent to moving the pixmap to the CPU
for writing, so replace it with a call to the generic function.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we can create snoopable bo, we prefer to use those as creating a vmap
forces a new bo creation increasing GTT pressure.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we try to use the diffuse/specular and only resort to using a texture
operation for convenience in the rare case of a solid mask.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If the new mode can be done either using a logic op or with the blend
unit, prefer the currently enabled unit.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we asked to use the BLT, try to avoid trigging a context switch for
a trivial case where we sample outside of a NONE source and so can
reduce the operation to a clear.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Convert the linear gradient to a texture ramp and compute the texture
coordinates in the standard manner.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
For correctness we need to inform GEM of the change of domain for the
buffer so that it knows to invalidate any caches when it is next used by
the GPU.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This helps SNB on cairo-traces that utilize lots of temporary uploads
(rasterised sources and masks for instance), but comes at a cost of
regressing others...
In order to counter the regression from increasing the GTT cache size,
the CPU/GTT vma cache are split and accounted separately.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
After reducing the used size in the partial buffer, we need to resort
the list to maintain the list in decreasing amount of available space.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Allow SandyBridge to specialise its clear routine to reduce the number
of ring switches. It may be interesting to specialise the clear routines
even further and use the special render clear commands...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Optimistically we would replace the GPU damage with the new set of
trapezoids. However, if any partial damage remains then the next
operation which is often to composite another layer of trapezoids (for
complex clipmasks) using IN will then stall.
This fixes a regression in firefox-fishbowl (and lesser regressions
throughout the cairo-traces).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The first, and likely only, goal is to support SHMPixmap efficiently
(and without compromising SHMImage!) which we want to preserve as vmaps
and never create a GPU bo. For all other use cases, we will want to
create snoopable CPU bo ala the LLC buffers on SandyBridge.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We trade-off the extra copy in the hope that as we haven't used the GPU
bo before then, we won't need it again.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
One restriction common to all generations is that samplers access pairs
of rows and so we need to pad the buffer to accommodate access to that
second row. Do so unconditionally along paths that may be used by the
render pipeline.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>