The empty glyph still needs the correct advance, and copying it too late
left it as zero and so we were collapsing spaces in PolyText8 and
friends.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Move the workaround CS stall into the emit drawrect which is the only
non-pipelined op we emit. This removes the split between deciding
whether we will emit a drawrect and actual emission.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we are force to emit a stall for the non-pipelined workaround, we do
not then need to emit a stall for switching samplers.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
On closer inspection, we still need the workaround of forcing a pipeline
stall if we update the samplers.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This allows us to implement backend specific workarounds and use the
more appropriate device specific flushing.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Fixes for resubmitting batches after running out of space for vertex
buffers and also a couple of trivial spans functions.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The gen4+ spec is a little misleading as states that all BLT pitches for
the XY commands are in dwords. Apparently not, as the upload/download
functions were already demonstrating. This only became apparent when
accelerating core text routines to offscreen pixmaps, such as composited
windows.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we test the area to be drawn against the existing CPU damage and find
it is already on the CPU, we may as well continue to utilize that
damaged region.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If either the region is busy on the gpu or if we need to read the
destination then we would incur penalties for trying to perform the
operation through the GTT. However, if we are simply streaming pixels to
an unbusy bo then we can do so inplace faster than computing the
corresponding GPU commands and uploading them.
Note: currently it is universally slower to use the GPU here (the
computation of the spans is too slow). However that is only according to
micro-benchmarks, avoiding the readback is likely to be more efficient
in practice.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If either the region is busy on the gpu or if we need to read the
destination then we would incur penalties for trying to perform the
operation through the GTT. However, if we are simply streaming pixels to
an unbusy bo then we can do so inplace faster than computing the
corresponding GPU commands and uploading them.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The damage layer was detecting that we were asking it to accumulate a
degenerate box emanating from PolySegment, as the unclipped paths made
the fatal assumption that it would not need to filter out degenerate
boxes. However, a degenerate line becomes a point, does the same apply
to a degenerate segment?
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
A few of the create_elts() routines missed marking the damage as dirty
so that if only part of the emebbed box was used (i.e. the damage
contained less than 8 rectangles that needed to included in the damage
region) then those were being ignored during migration and testing.
Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
References: https://bugs.freedesktop.org/show_bug.cgi?id=44682
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If the write operation fills the entire clip, then we can demote and
possible avoid having to read back the clip from the GPU provided that
we do not need the destination data due to arithmetic operation or mask.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The damage tracking code asserts that it only handles clip regions.
However, sna_copy_area() was failing to ensure that its damage region
was being clipped by the source drawable, leading to out of bounds reads
during forced fallback.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we decide to do the CPU fallback inplace on the GPU bo through a WC
mapping (because it is a large write-only operation), make sure that
the new GPU bo we create is not active and so will not^W^W is less likely
to cause a stall when mapped.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When reducing the damage we may find that it is actually empty and so
sna_damage_get_boxes() returns 0, be prepared.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cachelines will only be dirtied for the bytes accessed so a better
metric would based on the total number of pages brought into the TLB
and the total number of cachelines used. Base the decision on whether
to try and amalgamate the upload with others on the number of bytes
copied rather than the overall extents.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
All of the asserts and debug options that lead me to believe that the
tiling was completely screwy for some writes.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
search_linear_cache() was updated to track the first good match whilst it
continued to search for a better match. This resulted in the first good
bo being modified and a record of those modifications lost, in
particular the change in tiling.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
FORCE_GPU_ONLY now has no effect except for marking the initial pixmap
as all-damaged on the GPU, and so not testing the paths for which it was
originally introduction.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Just in case we set a mode then fail to emit any dwords. Sounds
inefficient and woe betide the culprit when I find it...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We fudge forced used of the BLT ring unless we install a render backend
and so we must also prevent the ring from being reset when the GPU is
idle. Therefore we make handing the ring status a backend function.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we do not have access to an accelerated render backend, only create
GPU buffers for the scanout and use an accelerated blitter for
upload/download and operating inplace on the scanout.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Found by valgrind:
==13639== Conditional jump or move depends on uninitialised value(s)
==13639== at 0x5520B1E: pixman_region_init_rects (in
/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.24.0)
==13639== by 0x89E6ED7: __sna_damage_reduce (sna_damage.c:489)
==13639== by 0x89E7FEC: _sna_damage_contains_box (sna_damage.c:1161)
==13639== by 0x89CFCD9: sna_drawable_use_gpu_bo (sna_damage.h:175)
==13639== by 0x89D52DA: sna_poly_segment (sna_accel.c:6130)
==13639== by 0x21F87E: damagePolySegment (damage.c:1096)
==13639== by 0x1565A2: ProcPolySegment (dispatch.c:1771)
==13639== by 0x159FB0: Dispatch (dispatch.c:437)
==13639== by 0x1491D9: main (main.c:287)
==13639== Uninitialised value was created by a heap allocation
==13639== at 0x4028693: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13639== by 0x89E6BFB: _sna_damage_create_boxes (sna_damage.c:205)
==13639== by 0x89E78F0: _sna_damage_add_rectangles (sna_damage.c:327)
==13639== by 0x89CD32D: sna_poly_fill_rect_blt.isra.65
(sna_damage.h:68)
==13639== by 0x89DE23F: sna_poly_fill_rect (sna_accel.c:8366)
==13639== by 0x21E9C8: damagePolyFillRect (damage.c:1309)
==13639== by 0x26DD3F: miPaintWindow (miexpose.c:674)
==13639== by 0x18370A: ChangeWindowAttributes (window.c:1553)
==13639== by 0x154500: ProcChangeWindowAttributes (dispatch.c:696)
==13639== by 0x159FB0: Dispatch (dispatch.c:437)
==13639== by 0x1491D9: main (main.c:287)
==13639==
Use 'count' everywhere for consistency.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Detected by valgrind:
==22012== Source and destination overlap in memcpy(0xd101000, 0xd101000,
783360)
==22012== at 0x402A180: memcpy (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22012== by 0x89BD4ED: memcpy_blt (blt.c:209)
==22012== by 0x89F2921: sna_write_boxes (sna_io.c:364)
==22012== by 0x89CFABF: sna_pixmap_move_to_gpu (sna_accel.c:1900)
==22012== by 0x89F49B0: sna_render_pixmap_bo (sna_render.c:571)
==22012== by 0x8A268CE: gen5_composite_picture (gen5_render.c:1908)
==22012== by 0x8A29B8A: gen5_render_composite (gen5_render.c:2252)
==22012== by 0x89E6762: sna_composite (sna_composite.c:485)
==22012== by 0x21D3C3: damageComposite (damage.c:569)
==22012== by 0x215963: ProcRenderComposite (render.c:728)
==22012== by 0x159FB0: Dispatch (dispatch.c:437)
==22012== by 0x1491D9: main (main.c:287)
==22012==
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Instead of checking for CPU generation, use the libdrm-provided
I915_PARAM_HAS_LLC instead.
v2: use a define check to verify if we have I915_PARAM_HAS_LLC.
Signed-off-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The previous commit undoes a premature optimisation that assumed that
the current damage captured all pixels written. However, it happens to
be a useful optimisation along that path (tracking upload of partial
images), so add the necessary booking that watches for when the union
of cpu and gpu damage is no longer the complete set of all pixels
written, that is if we either migrate from one pixmap to the other, the
undamaged region goes untracked. We also take advantage of whenever we
damage the whole pixel to restore knowledge that our tracking of all
pixels written is complete.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we discard the CPU bo, we lose knowledge of whatever regions had been
initialised but no longer dirty on the GPU, but instead must assume that
the entirety of the GPU bo is dirty.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Put some markers into the debug log as those functions create many
proxies causing a lot of debug noise.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
For SNB, in case you really, really want to use GPU detiling and not
incur the ring switch. Tweaking when to just mmap the target seems to
gain most anyway...
The ulterior motive is that this provides fallback paths for avoiding
the use of TILING_Y with GTT mmaps which is broken on 855gm.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We only want to create huge pwrite buffers when populating the inactive
cache for mmapped uploads. In the absence of using mmap for upload, be
more conservative with the alignment value so as not to simply waste
valuable aperture and memory.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we steal a write buffer for creating a pixmap for read back, then we
need to be careful as we will have set the used amount to 0 and then try
to incorrectly decrease by the last row. Fortunately, we do not yet have
any code that attempts to create a 2d buffer for reading.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The batch may legitimately be submitted prior to the attachment of the
read buffer, if, for example, we need to switch rings. Therefore update
the assertion to only check that the bo remains in existence via either
a reference from the exec or from the user
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When backporting the patches from gen6, I didn't notice the memset that
came later, and this wasn't along the paths checked by rendercheck.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Whilst we can render to and blend with an depth 15 target, we cannot use
it as a texture with the sampling engine.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The hw wants it as demonstrated by the '>' in KDE's menus. Why is it
always KDE that demonstrates coherency problems...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
One of the side-effects of emitting the composite state is that it
tags the destination surface as dirty as a result of the *forthcoming*
operation. So emitting the flush after emitting the composite state
clears that tag, so we need to restore it for future coherency.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>